ECE1657PDFNotesGTEGT 2016
ECE1657PDFNotesGTEGT 2016
Lacra Pavel
Systems Control Group
Dept. of Electrical and Computer Engineering
University of Toronto
2016
2
Contents
i
ii CONTENTS
2.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Continuous-Kernel Games 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Game Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Extension to Mixed-Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Nash Equilibria and Best-Response Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Existence of Pure-Strategy NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Example: Optical Network OSNR Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6.1 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7 Replicator Dynamics 95
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Derivation of Replicator Dynamics (RD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 RD Equilibria vs. NE Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.4 RD Equilibria vs. ESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5 Doubly Symmetric (Partnership) Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.6 Potential Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.7 *Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.7.1 *Proof of Proposition 7.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.7.2 *Doubly Symmetric Games and NE Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 113
7.7.3 *Extensions to Multi-populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.7.4 *Adaptation (Strategy) Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
v
vi CONTENTS
(overloaded symbols)
F ( x) = ∇ p J (x) ∈ RNmi pseudo-gradient in a continuous-kernel game, Fi (x) = ∇i Ji (x) ∈ Rmi , i ∈ N
F ( x) = J(x) ∈ Rm vector cost in a population game, F j (x) = J j (x) =J (e j , x) ∈ R, j ∈ M
F ( x) = J(x) ∈ RNmi vector cost in a repeated matrix game, Fi (x) = Ji (x−i ) ∈ Rmi , i ∈ N
J i (xi , x−i ) = E [Ji (ei , e−i )] = xTi Ji (x−i ) = xTi Fi (x) ∈ R
F ( x) = U(x) ∈ RNmi vector payoff in a repeated matrix game, Fi (x) = Ui (x−i ) ∈ Rmi , i ∈ N
U i (xi , x−i ) = E [Ui (ei , e−i )] = xTi Ui (x−i ) = xTi Fi (x) = −xTi Fi (x) ∈ R
F (x) = −F(x)
F ( x) = U(x) = −F(x) vector payoff in population game
F j (x) = U j (x) =U (e j , x) = −J (e j , x) = −J j (x) = −F j (x) ∈ R, j ∈ M
F j ( x) also called fitness of agents using pure strategy j in the population state x
F je (x) = F j (x) − F (x) excess fitness of agents using pure strategy j relative to the average fitness in
the population
m
∆ ∆ = {x ∈ Rm m
+ | ∑ j =1 x j = 1 } The set of points in R such that x j ≥ 0 for every
j ∈ M = {1, . . . , m} and ∑ j∈M x j = 1. Here “|” means “such that” and “,”means
“and”
Introduction
Game theory, an area initiated more than fifty years ago [103], has been of interest for researchers working in
a broad range of areas from economics, [15], computer science, [78, 128], social studies, and more recently in
engineering and communication networks, [16, 70, 154, 130, 7, 72, 119]. The recent popularity it has been enjoying
in engineering has to do with fact that it brings new perspectives to optimization and control of distributed networks.
Game theory is a branch of applied mathematics concerned with the study of situations involving conflicting inter-
ests. The field was born with the book by John von Neumann and Oskar Morgenstern, [152], although the theory
was developed extensively in the 1950s by many among whom John Nash, [102, 103]. Game theory mathemati-
cally describes behavior in strategic situations, in which an individual’s success in making choices depends on the
choices of others. It incorporates paradigms such as Nash equilibrium and incentive compatibility, which can help
quantifying individual preferences of decision-making agents. In fact game theory provides a rigorous mathematical
framework for modeling actions of individual selfish or cooperating agents/players and interactions among players.
Furthermore, it has an inherently distributed nature and provides a foundation for developing distributed algorithms
for dynamic resource allocation.
While initially developed to analyze competitions in which one individual does better at another’s expense (zero-sum
games), it has been expanded to treat a wide class of interactions, which are classified according to several criteria,
among which cooperative versus noncooperative games. Typical classical games are used to model and predict the
outcome of a wide variety of scenarios involving a finite number of “players" (or agents) that aim to optimize some
individual objective.
Historically, the development of game theory was motivated by studies in economics, however, many interesting
applications in have emerged in diverse fields such as biology [139], computer science [60], social science and
engineering [79]. In engineering, the interest in noncooperative game theory is motivated by the possibility of
designing large scale systems that globally regulate their performance in a distributed, and decentralized manner.
Modelling a problem within a game-theoretic setting is particularly relevant to any practical application consisting
of separate subsystems that compete for the use of some limited resource. Examples of such applications include
congestion control in network traffic (i.e. internet, or transportation), problems of optimal routing, [10, 12, 13],
power allocation in wireless communications and optical networks, [130, 119].
Moreover, recent interest has been on extending the standard game setup in various ways, some having to do with
computation of equilibria, [48, 120, 52] others being concerned with the inherent static nature of a traditional game
setup, how to extend it to a dynamic process by which the equilibria is to be achieved, [91, 135] or to address
equilibria efficiency, [71, 1, 69, 124].
1
2 CONTENTS
In a noncooperative (Nash) game [103, 125, 19] each player pursues the maximization of its own utility, or equiva-
lently the minimization of its own cost function, in response to the actions of all other players. The stable outcomes
of the interactions of noncooperative selfish agents correspond to Nash equilibria. On the other hand, in a coopera-
tive game framework the natural players / agents can be the network nodes, routers or switches (as software agents),
or users, [8]. These players/agents cooperate to redistribute the network resources (bandwidth, wavelength capacity,
power). Why game theory? Consider the case of a multiple access network problem; most optimization based ap-
proaches find the optimal multiple access control (MAC) and routing parameters that optimize network throughput,
lifetime, delay etc and assume all nodes in the network use these parameters. But there is no reason to believe that
nodes will adhere to the actions that optimize network performance. Cheaters may deviate in order to increase their
payoffs which in turn affects other users. In effect any scenario where there is some strategic interaction among
self-interested players is best modelled via a game theoretic model. Game theory helps to capture this interaction,
the effect of actions of rational players on the performance of the network. Although the selfish behavior of players
causes system performance loss in a Nash game [50, 78, 72, 1, 123], it has been shown in [72, 131] that proper
selection of network pricing mechanism can help preventing the degradation of network system performance. In the
context of evolution, a game captures the essential features where strategic interactions occur.
In these course notes we shall study game theory as a framework with its branches: classical game theory (CGT),
evolutionary game theory (EGT) and learning game theory (or learning in games) (LGT).
Game theory as a framework is a methodology used to build models of real-world social interactions. The result of
such a process of abstraction is a formal model that typically comprises the set of individuals who interact (called
players/agents), the different choices available to each of the players (called strategies), and a payoff function that
assigns a (usually numerical) value to each player for each possible combination of choices made by every individual.
When different assumptions about how players behave, or should behave, are included in the framework, gives rise
to the different branches that compose game theory: CGT, EGT and LGT.
Classical game theory (CGT): Classical game theory was chronologically the first branch to be developed (Von
Neumann and Morgenstern, 1944), the one where most of the work has been focused historically, and the one with
the largest representation in most game theory textbooks and courses. Classical game theory (CGT) is a branch
of mathematics devoted to the study of how instrumentally rational players should behave in order to obtain the
maximum possible payoff in a formal game.
The main problem in classical game theory is that, in general, rational behaviour for any one player remains un-
defined in the absence of strong assumptions about other players’ behaviour. Hence, in order to derive specific
predictions about how rational players should behave, it is often necessary to make very stringent assumptions about
everyone’s beliefs (e.g. common knowledge of rationality) and their interdependent consistency. If a game involves
only rational agents, each of whom believe all other agents to be rational, then theoretical results offer accurate
predictions of the game outcomes.
Even when the most stringent assumptions are in place, it is often the case that several possible outcomes are possible,
and it is not clear which, if any, may be achieved, or the process through which this selection would happen. Thus,
the general applicability of classical game theory is limited. A related limitation of classical game theory is that it is
an inherently static theory: it is mainly focused on the study of end-states and possible equilibria, paying hardly any
CONTENTS 3
attention to how such equilibria might be reached. A more realistic modelling scenario involves players that are less
than rational and a repeated game play.
Evolutionary Game Theory (EGT): Some time after the emergence of classical game theory, biologists realized the
potential of game theory as a framework to formally study adaptation and coevolution of biological populations. For
those situations where the fitness of a phenotype is independent of its frequency, optimization theory is the proper
mathematical tool. However, it is most common in nature that the fitness of a phenotype depends on the composition
of the population [106]. In such cases game theory becomes the appropriate framework. In evolutionary game
theory (EGT), players are no longer taken to be rational. Instead, each player, most often meant to represent an
individual animal, always selects the same (potentially mixed) strategy, which represents its behavioural phenotype,
and payoffs are usually interpreted as Darwinian fitness. The emphasis is placed on studying which behavioural
phenotypes (i.e. strategies) are stable under some evolutionary dynamics, and how such evolutionary stable states
are reached. Despite having its origin in biology, the basic ideas behind evolutionary game theory, that successful
strategies tend to spread more than unsuccessful ones, and that fitness is frequency-dependent, have been extended
to other fields. The study of dynamic systems often begins with the identification of their invariant or equilibrium
states, in EGT literature called stable states. This is often called static analysis, as it does not consider the dynamics
of the system explicitly, but only its rest points. The most important concept in the static analysis of EGT is the
concept of Evolutionary Stable Strategy (ESS), proposed by Maynard Smith and Price,[141]. Very informally, a
population playing an ESS is uninvadable by any other strategy.
The main shortcoming of mainstream EGT is that it is founded on assumptions made to ensure that the resulting
models are mathematically tractable. Most of the work assumes one single infinite and homogeneous population
where players using one of a finite set of strategies are randomly matched to play an infinitely repeated 2-player
symmetric game. Most of our treatment will be assuming this standard model. In the last few years various alter-
native models (e.g. finite populations, stochastic strategies, multi-player games, structured populations) are being
explored. Extensive references on this topic can be found in [153], [65], [151], [129].
Learning game theory (LGT): Like evolutionary game theory (EGT), learning in games abandons the demanding
assumptions of classical game theory on players’ rationality and beliefs. However, unlike its evolutionary counterpart
(EGT), learning game theory assumes that individual players adapt, learning over time about the game and the
behaviour of others (e.g. through reinforcement, imitation, or belief updating). Therefore, instead of immediately
playing a perfect move, players adapt their strategy based on the outcomes of previous matches, [56], [135], hence a
classical game with learning, or learning in games. Extensive references on this topic can be found in [56], [104].
A game-theoretic approach can be used for studying both the behavior of independent agents and the structure of
networks they create. Moreover, the formulation of a game-theoretic model leads directly to the development of
distributed algorithms towards finding equilibria, if they exist. Over the years there have been a lot of research
efforts resulting in a rich literature in the application of game theory in transportation systems [128], Internet [72]
wireless networks [70, 130, 7], or optical networks, [119, 115], [121].
Among some of the recent game theory applications, communication networks is an area of recent interest. Ap-
plications can involve either cooperative or noncooperative games, static or repeated games, finite or continuous
strategy games, [13]. Typical applications include power control problems in different multiuser environments
[53, 70, 7, 130, 119, 77], routing [11, 109] or congestion control [128, 10, 4, 12, 20, 80, 137, 154], extending the
4 CONTENTS
system-based optimization approaches [75, 95, 87, 86, 88, 147]. The many "players" interact within the network
and make (sequential) decisions, i.e., play a game. For example, in a noncooperative (Nash) game framework
the natural players / agents can be the Internet service providers (ISP) or domain operators, [85, 90], the routers
[11, 20, 155, 109], or even the users themselves in an access network application with dedicated wavelengths, [115].
As another example in wired networks, there could be two sets of players: telecom firms/ISP and end users. Both sets
of players have different objectives and non-negligible interaction across players exists. In wireless networks there
could be wireless LANs where users / players communicate with a fixed access point, or wireless ad-hoc networks
where users /players communicate with each other in the absence of any fixed infrastructure support.
Chapter 1
The Name of the Game
Chapter Summary
"He thinks that I think that he thinks ... "
This chapter provides a brief overview of basic concepts in game theory. The following chapters will start introduc-
ing them formally.
1.1 Introduction
In this chapter we shall introduce the game theoretical notions in simplest terms. Our goal will be later on to study
and formalize mathematically various game problems, by which we understand problems of conflict with common
strategic features. These models are called “games" because they are built on actual games such as bridge and poker.
The theory of games stresses the strategic aspects, i.e, the aspects controlled by the players and in this goes beyond
the classical theory of probability which treats games limited to aspects of pure chance. This theory of games was
first introduced by Borel in 1921 but the theory was established by John von Neumann in 1928, who laid together
with Morgenstern the basis of what later on John Nash generalized and what is called nowadays the mathematical
theory of games, [103].
In any game there are a number of players (multiple decision-makers) that have a sequence of personal moves; at
each move, each player has a number of choices from among several possibilities, also possible is the chance or
random move (think of throwing a dice). If we start by considering examples of games, one well-known case is the
game of chess, in which there are no chance moves once the game starts, bridge which has chance moves but skill is
important and roulette which is entirely a game of chance. In fact in the game of chess each player knows every move
that has been made so far, while in bridge this information is imperfect. At the end of the game there is some payoff
to be gained (cost to be paid) by the players which depends on how the game was played. Noncooperative game
theory studies the strategic interaction among self-interested players. This is in contrast to a standard optimization
where there is only one decision-maker who aims to minimize an objective function by choosing values of variables
from a constrained set such that the system performance is optimized.
So far we have mentioned three elements: alternation of moves (individual or random (chance)), a possible lack of
knowledge and a payoff or cost function. A game G consists of a set of players (agents) N = {1, . . . , N}, an action
5
6 CHAPTER 1. THE NAME OF THE GAME
set (also referred to as a set of strategies) available to those players and an individual payoff (utility) Ui or cost
function Ji for each player i ∈ N . The convention used in most books on classical game theory (matrix form) and
evolutionary games is utility or payoff maximization. We shall use the convention of cost function minimization that
follows the control setup, keeping in mind the equivalence between the two approaches, [19]. Specifically, in a game
each player i ∈ N individually takes an optimal action to maximize its own payoff (utility) Ui , which is equivalent
to minimizing its own cost (loss) function, Ji , formally defined as Ji = −Ui . So we will always understand that when
a player aims to maximize its payoff (utility) this means to minimize its cost (loss) incurred during a game.
Each player’s success in making decisions depends on the decisions of the others. Let Ωi denote the set of actions
available to player i, which can be finite or infinite. This leads to either finite actions set games, also known as matrix
games, or infinite (continuous action set) games.In the latter case each player can choose its action from a continuum
of (possibly vector-valued) alternatives. A strategy can be regarded as a rule for choosing an action, depending on
external conditions. Once such a condition is observed, the strategy is implemented as an action. In the case of
mixed strategies, this external condition is the result of some randomization process. Briefly, a mixed strategy for
agent i is a probability distribution xi over its action set Ωi . In some cases actions are pure, or independent of any
external conditions, and the strategy space coincides with the action space. In discussing games in pure strategies
we shall use the term “strategy" and “action" interchangeably to refer to some u ∈ Ω, and the game G can simply be
specified as G (N , Ωi , Ji ).
In the next sections we introduce these concepts for possible forms of a game as well as what we understand by
various solution concepts.
The extensive form of a game amounts to a translation of al the rules into technical terms of a formal system designed
to describe all games.
Extensive form game generally involve several acts or stages, and each player chooses a strategy at each stage. The
game’s information structure, i.e., how much information is revealed to which players concerning the game’s out-
comes and their opponents’ actions in the previous stages, significantly affects the analysis of such games. Extensive
form games are generally represented using a tree graph. Each node (called a decision node) represents every pos-
sible state of play of the game as it is played, [19]. Play begins at a unique initial node, and flows through the tree
along a path determined by the players until a terminal node is reached, where play ends and cost s are assigned
to all players. Each non-terminal node belongs to a player; that player chooses among the possible moves at that
node, each possible move is an edge leading from that node to another node. Their analysis becomes difficult with
increasing numbers of players and game stages.
A formal definition is as follows.
Definition 1.1. A N-player game G in an extensive form is defined as a graph theoretic tree of vertices (states)
connected by edges (decisions or choices) with certain properties:
2. a function called the cost function which assigns an N-vector (tuple) J1 , . . . , JN to each terminal vertex (out-
come) of the game G , where Ji denotes the cost of player i, N = {1, . . . , N},
1.2. GAMES IN EXTENSIVE FORM 7
3. each non-terminal vertex of G is partitioned into N + 1 possible sets, S 0 , S 1 , . . . , S N called the player sets,
where S 0 stands for the choice of chance (nature),
4. each vertex of S0 has a probability distribution over the edges leading from it,
5. the vertices of each player S i , i = 1, . . . , N are partitioned into disjoint subsets known as information sets,
S ji , such that two vertices in the same information set have the same number of immediate (choices/edges)
followers and no vertex can follow another vertex in the same information set
As a consequence of (5) a player knows which information set he is in but not which vertex of the information set.
A player i is said to have perfect information in a game G if each information set for this player consists of one
element. The game G in extensive form is said to have perfect information if every player has perfect information.
A pure strategy for player i denoted by ui is defined as a function which assigns to each of player’s i information
sets S ji , one of the edges leading from a representative vertex in this set S ji . We denote by Ωi the set of all pure
strategies of player i, ui ∈ Ωi and by u = (u1 , . . . , uN ) the N-tuple of all players strategies, with u ∈ Ω = Ω1 × . . . ΩN .
A game in extensive form is finite if it has a finite number of vertices, hence each player has only a finite number of
strategies. Under this definition most parlor games are finite (think of chess). Let us look at a couple of examples.
Example 1.2. In the game of Matching Pennies (see Figure 1.1) player 1 chooses “heads" (H) or “tails" (T), player
2, not knowing this choice, also chooses between H or T. If the two choose alike (matching) than player 2 wins 1
cent from player 1 (hence +1 for player 2 and -1 for player 1); else player 1 wins 1 cent from player 2 (hence the
reverse case of the above). The game tree is shown below with vectors at the terminal vertices indicating the cost
function, while the number near vertices denote the player to whom the move corresponds. The dotted (shaded) area
indicates moves in the same information set.
H T
2 2
H T H T
The next two figures show other two zero-sum game examples which differ by the information available to player
2 at the time of its play (information set), denoted by the shaded area (dotted). In the first case, Figure 1.2, the two
possible nodes of player 2 are in the same information set, implying that even though player 1 acts before player 2
does, player 2 does not have access to its opponent decision. This means that at the time of its play, player 2 does not
8 CHAPTER 1. THE NAME OF THE GAME
know at which node (vertex) he is. This is as saying that both players act simultaneously. The extensive form in the
second case Figure 1.3, admits a different matrix game in normal form. In this case each node of player 2 is included
in a separate information set, i.e., has perfect information as to which branch of the tree player 1 has chosen.
L R
2 2
L R L R
M M
Figure 1.2
L R
2 2
L R L M R
M
Figure 1.3
Games in normal-form (strategic form) model scenarios in which two or more players must make a one-time decision
simultaneously. These games are sometimes referred to as one-shot game, simultaneous move games. The normal-
form is a more condensed form of the game, stripped of all features but the choice of each player’s pure strategies,
and it is more convenient to analyze. The fact that all players make their choice of strategy simultaneously has
nothing to do with a temporal constraint, but rather with a constraint on the information structure particular to this
type of game. The information structure of a game is a specification of how much each player knows at the time he
chooses his strategy. For example, in Stackelberg games, [19], where there are leaders and followers, some players
(followers) choose their strategies only after the strategic choices made by the leaders have already been revealed.
In order to describe a normal-form game we need to specify players’ strategy spaces and cost functions. A strategy
space for a player is the set of all strategies available to that player, where a strategy is a complete plan of action
1.3. GAMES IN NORMAL FORM 9
for every stage of the game, regardless of whether that stage actually arises in play. A cost function of a player i,
Ji , is a mapping from the cross-product of players’ strategy spaces to that player’s set of costs (normally the set of
real numbers), hence depends on all players’ strategies. We will be mostly concerned with these type of normal-
form games herein. For any strategy profile (N-tuple of players’ pure strategies) u = (u1 , . . . , uN ) ∈ Ω, where
Ω = Ω1 × . . . ΩN is the overall pure-strategy space, let Ji (u) ∈ R denote the associated cost for player i, i ∈ N .
Now these costs depend on the context: in economics they represent a firm’s profits or a consumer’s (von Neumann-
Morgenstern) utility, while in biology they represent the fitness (expected number of surviving offspring). We gather
all these real numbers Ji (u) to form the combined pure-strategy vector cost function of the game, J : Ω → RN , where
Ji (u)
..
J( u ) =
.
JN (u)
We shall denote a normal-form game by G (N , Ωi , Ji ). It is possible to tabulate the function J for all possible values
of u1 , . . . , uN ∈ Ω either in the form of a relation (easier for continuous or infinite games), or, as an N-dimensional
array (table) in the case of finite games (when Ω is finite set). In this latter case and when N = 2 this reduces to
a matrix whose size is given by the number of available choices for the two players and whose elements are pairs
of real numbers corresponding to outcomes (costs) for the two players. That is the reason why even when there are
N > 2 players such normal-form games are called matrix games.
Let us look at a few examples for N = 2, where we shall list player 1’s choices as the rows and player 2’s choices
as the columns. Hence entry ( j, k) indicates the outcome of player 1 using the j pure strategy and player 2 using k
strategy.
player 2
H T
player 1 H (-1,1) (1,-1)
T (1,-1) (-1,1)
It turns out that one can transform any game in extensive-form into an equivalent game in normal-form, so we shall
restrict most of our theoretical development to games in normal-form only.
10 CHAPTER 1. THE NAME OF THE GAME
Depending of various features of the game one could classify them in different categories. Below we briefly discuss
such classification depending on the competitive nature of the game, the knowledge/information available to the
players, and the number of times the game is repeated.
A cooperative game is one in which there can be cooperation between the players and/or they have the same cost
(also called team games). A non-cooperative game is one where an element of competition exists and we can have
coordination games, constant-sum games, and games of conflicting interests. We give below such examples for the
class of matrix games.
Coordination Games
In coordination games, what is good for one player is good for all players. An example coordination game in normal
form is: " #
(−3, −3) (0, 0)
M=
(0, 0) (−4, −4)
In this game, players try to coordinate their actions. The joint action ( j, k) = (2, 2) is the most desirable (least cost),
but the joint action ( j, k) = (1, 1) also produces negative costs to the players. This particular game is called a pure
coordination game since the players always receive the same payoff.
Other coordination games move more toward the domain of games of conflicting interest. For example, consider the
Stag-Hunt (SH) game: stag hare (we shall come back to this example)
" #
(−4, −4) (0, −1)
M=
(−1, 0) (−1, −1)
In this game, each player can choose to hunt stag (first row or first column) or hare (second row or second column).
In order to catch a stag (the biggest animal, hence the bigger payoff or lowest cost of −4), both players must choose
to hunt the stag. However, a hunter does not need help to catch a hare, which yields a cost of −1. Thus, in general,
it is best for the hunters to coordinate their efforts to hunt stag, but there is considerable risk in doing so (if the other
player decides to hunt hare). In this game, the costs (payoffs) are the same for both players when they coordinate
their actions, but their costs are not equal when they do no coordinate their actions.
Constant-Sum Games
Constant-sum games are games in which the sum of the players’ payoffs sum to the same number. These games are
games of pure competition of the type “my gain is your loss." Zero-sum games are particular example of these games
and we shall study them in detail in the next chapter. An example of such game is the Rock, Paper and Scissors
game with the matrix form below
(0, 0) (1, −1) (−1, 1)
M = (−1, 1) (0, 0) (1, −1)
(1, −1) (−1, 1) (0, 0)
1.4. GAME FEATURES 11
These fall in between constant-sum games and coordination games and cover a large class, whereby the players
have somewhat opposing interests, but all players can benefit from making certain compromises. One can say that
people (and learning algorithms) are often tempted to play competitively in these games (both in the real world and
in games), though they can often hurt themselves by doing so. One of the most celebrated games of this type is the
Prisoners’ Dilemma game that we shall encounter many times. Here two criminals are the players’ action choices
are to "Confess" (defect) or "Not Confess" (cooperate), resulting in corresponding jail time.
" #
(5, 5) (0, 15)
M=
(15, 0) (1, 1)
1.4.2 Repetition
Any of the previously mentioned kinds of games can be played any number of times between the same players.
One-shot Games
In one-shot games, players interact for only a single round (or stage). Thus, in these situations there is no possible
way for players to reciprocate (by inflicting punishment or rewards) thereafter.
Repeated Games
In repeated games, players interact with each other for multiple rounds and each time they play the same game. In
such situations, players have opportunities to adapt to each others’ behaviours (i.e.,“ learn") in order to try to become
more successful. There can be finite-horizon repeated games where the same game is repeated a fixed number of
times by the same players, or infinite-horizon games in which the play is repeated indefinitely.
Dynamic Games
The case where the game changes when players interact repeatedly is what can be called a repeated dynamic game,
characterized by a state. These are also called differential games. It is now good to point out that an important such
class are stochastic games or Markov games. These are extensions of Markov decision processes to the scenario
with N multiple adapting players. In a stochastic game we can model probabilistic transitions, and these are similar
to extensive form games, played in a sequence of stages. Formally, a N-player (player) stochastic game is denoted
by (Σ, Ωi , T , Ji ), where Σ is a set of states, Ωi , i = 1, . . . , N is set of actions for player i and Ω = Ω1 × · · · × ΩN is
the set of joint actions, T is a transition function T : Σ × Ω × Σ → [0, 1] and Ji is the cost function Ji : Σ × Ω → R
for player (player) i.
We shall not cover these games in these notes.
Depending on the amount of information a player has different plays and outcomes may be possible. For example,
does an player know the costs (or preference orderings) of other players? Does the player know its own cost (payoff)
12 CHAPTER 1. THE NAME OF THE GAME
matrix? Can he view the actions and costs of other players? All of these (and other related) questions are important
as they can help determine how the player should learn and act. Theoretically, the more information an player has
about the game, the better he should be able to do. In short, the information an player has about the game can
vary along the following dimensions: knowledge of the player’s own actions; knowledge of the player’s own costs;
knowledge of the existence of other players; knowledge of the other players’ actions; knowledge of the other players’
costs and in case learning is used, knowledge of the other players’ learning algorithms.
In a game with complete information each player has knowledge of the payoffs and possible strategies of other
players. Thus, incomplete information refers to situations in which the payoffs and strategies of other players are not
completely known. The term perfect information refers to situations in which the actual actions taken by associates
are fully observable. Thus, imperfect information implies that the exact actions taken by associates are not fully
known.
A solution concept briefly describes how to use a certain set of mathematical rules to decide how to play the game.
Various solution concepts have been developed, in trying to indicate/predict how players will behave when they play
a generic game. Herein we only introduce these solution concepts in short, and many will be further studied in depth.
One of the most basic properties of every game is the minimax solution (or minimax strategy), also called security
strategy. The minimax solution is the strategy that minimizes a player’s maximum expected loss (cost). There is an
alternate set of terminology we can use (often used in the literature as we mentioned before). Rather than speak of
minimizing our maximum expected loss, we can talk of maximizing our minimum expected payoff. This is known
as the maximin solution. Thus, the terms minimax and maximin can be used interchangeably. We shall study this
solution in depth in the next chapter for zero-sum games.
Let us look at the Prisoner’s Dilemma matrix game above. In the (PD) prisoner’s dilemma, both players are faced
with the choice of "Confess" (defecting) or to "Not Confess" (cooperating). If both players defect (confess), they
both receive a relatively low cost (which is 5 years in prison). However, if one of the players does "Not confess"
and the other "Confesses" (defects), the defector gets a very low cost (0 years or gets free) (called the temptation
cost), and the other ("Not confess") gets all the blame and receives a relatively high cost (15 years). If both players
do "Not confess" (cooperate), they get only 1 year (low cost). So what should you do in this game? Well, there are
a lot of ways to look at it, but if you want to play conservatively, you might want to invoke the minimax solution
concept, which follows from the following reasoning. If you play "Confess" (defect), the worst you can do is get a
cost of 5 (thus, we say that the security of defecting is 5 years). Likewise, if you play "Not Confess" (cooperate), the
worst you can do is get a cost of 15 years (security of cooperating is 15 years). In this game the minimax strategy
(lowest cost of the two) in this game is "Confess" or to defect, and the minimax value is 5. However, even though the
minimax value is the lowest cost you can guarantee yourself without the cooperation of your associates, you might
be able to do much better on average than the minimax strategy if you can either outsmart your associates or get
them to cooperate (see above for 1 year only ) (in a game that is not fully competitive). So we need other solution
concepts as well.
1.5. SOLUTION CONCEPTS 13
Another basic solution concept in multi-player games is to play the strategy that gives you the lowest cost given your
opponents’ strategies. That is exactly what the notion of the best response suggests. Suppose that you are player i,
and your opponents’ play u−i . Then the your best response in terms of pure strategies is u∗i such that
We shall see that the best response idea has had a huge impact on learning algorithms. If you know what your other
players are going to do, why not get the lowest cost (highest payoff) you can get (i.e., why not play a best response)?
Taking this one step further, you might reason that if you think you know what other players are going to do, why
not play a best response to that belief ? While this obviously is not an unreasonable idea, it has two problems. The
first problem is that your belief may be wrong, which might expose you to terrible risks. Second (and perhaps more
importantly), we will see that this “best-response" approach can be quite unproductive in a repeated game when
other players are also learning/adapting.
Best response dynamics
In evolutionary game theory, best response dynamics (BR dynamics) represents a class of strategy updating rules,
where players strategies in the next round are determined by their best responses. In a large population model,
players choose their next action probabilistically based on which strategies are best responses to the population as a
whole. In general this will lead to a best-reaction correspondence (possibly multi-valued), with “jumps" from one
strategy to another. Importantly, in these models players only choose the best response on the next round that would
give them the highest payoff on the next round. Players do not consider the effect that choosing a strategy on the
next round would have on future play in the game. This constraint results in the dynamical rule often being called
myopic best response. In order to avoid the use of multi-valued best response correspondences, some models use
smoothed best response functions.
We now introduce briefly a most celebrated solution concept for a N-player non-cooperative game G . John Nash’s
identification of the Nash equilibrium concept has had perhaps the single biggest impact on game theory. Simply
put, in a Nash equilibrium, no player has an incentive to unilaterally deviate from its current strategy. Put another
way, if each player plays a best response to the strategies of all other players, we have have a Nash equilibrium.
We will discuss the extent to which this concept is satisfying by looking at a few examples later on.
Definition 1.4. Given a game G a strategy N-tuple (profile) u∗ = (u∗1 , . . . , u∗N ) is said to be a Nash equilibrium (or
in equilibrium) if and only if
where u∗ = (u∗i , u∗−i ) and u∗−i denotes u∗ for all players except the ith one, N = {1, . . . , N}.
14 CHAPTER 1. THE NAME OF THE GAME
Thus u∗ is an equilibrium if no player has a positive incentive for unilateral chance of his strategy, i.e., assuming the
others keep their same strategies. In particular this means that once all choices of pure strategies have been revealed
no player has any cause for regret (hence the point of no regret concept).
player 2
u2,1 u2,2
player 1 u1,1 (3,1) (0,0)
u1,2 (0,0) (1,3)
and note that both (u1,1 , u2,1 ) and (u1,2 , u2,2 ) are equilibrium pairs. For matrix games we shall use the matrix notation
and for the above we will say that (3,1) and (1,3) are equilibria.
By inspection as in the above, we can conclude that both the joint actions ( j, k =)(1, 1) and ( j, k) = (2, 2) are Nash
equilibria since in both cases, neither player can benefit by unilaterally changing its strategy. Note, however, that
this illustrates that not all Nash equilibria are created equally. Some give better costs than others (and some players
might have different preference orderings over Nash equilibrium).
While all the Nash equilibria we have identified so far for these two games are pure strategy Nash equilibrium, they
need not be so. In fact, there is also a third Nash equilibrium in the above coordination game in which both players
play mixed strategies. Unfortunately not every game has an equilibrium. Take a look at the game of matching
pennies in Example 1.3, and you can see that it does not have (pure) equilibrium pairs.
Strategic dominance is another solution concept that can be used in many games. Loosely, an action is strategically
dominated if it never produces lower costs (higher payoffs) and (at least) sometimes gives higher costs (lower pay-
offs) than some other action. An action is strategically dominant if it strategically dominates all other actions. We
shall formally define this later on. For example, in the Prisoner’s Dilemma (PD) game, the action "Confess" (defect)
strategically dominates the "Not confess" (cooperate) in the one-shot game. This concept of strategic dominance
(or just dominance, as we will sometimes call it) can be used in some games (called iterative dominance solvable
games) to compute a Nash equilibrium.
Here are a couple more observations about the Nash equilibrium as a solution concept:
• In constant-sum games, the minimax solution is a Nash equilibrium of the game. In fact, it is the unique Nash
equilibrium of constant-sum games as long is there is not more than one minimax solution (which occurs only when
two strategies have the same security level). We shall study this in the next chapter.
• Since a game can have multiple Nash equilibrium, this concept does not tell us how to play a game (or how we
would guess others would play the game). This poses another question: Given multiple Nash equilibria, which one
should (or will) be played? This leads to the idea of considering refinements of the NE concept.
1.5. SOLUTION CONCEPTS 15
One of the features of a Nash equilibrium (NE) is that in general it does not correspond to a socially optimal outcome.
That is, for a given game it is possible for all the players to improve their costs (payoffs) by collectively agreeing to
choose a strategy different from the NE. The reason for this is that a posteriori some players may choose to deviate
from such a cooperatively agreed-upon strategy in order to improve their payoffs further at the group’s expense. A
Pareto optimal equilibrium describes a social optimum in the sense that no individual player can improve his payoff
(or lower his cost) without making at least one other player worse off. Pareto optimality is not a solution concept,
but it can be an important attribute in determining what solution the players should play (or learn to play). Loosely,
a Pareto optimal (also called Pareto efficient) solution is a solution for which there exists no other solution that gives
every player in the game a higher payoff (lower cost). A PE solution is formally defined as follows.
Definition 1.6. A solution u∗ is strictly Pareto dominated if there exists a joint action u ∈ Ω for which Ji (u) < Ji (u∗ )
for all i, and weakly Pareto dominated if there exists a joint action u 6= u∗ ∈ Ω for which Ji (u) ≤ Ji (u∗ ) for all i.
Definition 1.7. A solution u∗ is weakly Pareto efficient (PE) if it is not strictly Pareto dominated and strictly PE if it
is not weakly Pareto dominated.
Often, a Nash equilibrium (NE) is not Pareto efficient (optimal). Then one speaks of a loss of efficiency, which
is also referred to as the Price of Anarchy. An interesting problem is how to design games with improved Nash
efficiency, and pricing or mechanism design is concerned with such issues.
In addition to these solution concepts other important ones include the Stackelberg equilibrium [19], which is rele-
vant in games where the information structure plays an important role), and correlated equilibria [56], [108], which
is relevant in games where the randomization used to translate players’ mixed strategies into actions are correlated.
1.5.5 Examples
Recall Example 1.3 of the Matching Pennies game. This is a zero-sum game, or a strictly competitive game, where
the interests are diametrically opposed and B = −A. It turns out this game has no (pure) Nash equilibrium.
Let’s look at a couple of other examples of the simplest two-player matrix game, where the cost of the two players
take form of a two matrices, A and B.
This representation does not account for the additional harm that might come from not only going to different
locations, but going to the wrong one as well (e.g. he goes to the opera while she goes to the football game,
satisfying neither). In order to account for this, the game is sometimes represented as in the pair below
" # " #
−3 −1 −2 −1
A= , B=
0 −2 0 −3
This game has two pure strategy Nash equilibria, one where both go to the opera and another where both go to the
football game. For the first game, there is also a Nash equilibrium in mixed strategies, where the players go to their
preferred event more often than the other.
" # " #
0 −4 0 −1
A= , B=
−1 −3 −4 −3
again a symmetric game. In terms of a single matrix M with double entry this is described by
" #
(0, 0) (−4, −1)
M=
(−1, −4) (−3, −3)
The earliest presentation of a form of the Hawk-Dove game was by John Maynard Smith and George Price in their
1973 Nature paper, “The logic of animal conflict". This is also known as the game of Chicken, as an influential
model of conflict for two players in game theory. The principle of the game is that while each player prefers not to
yield to the other, the worst possible outcome occurs when both players do not yield. The name “Hawk-Dove" refers
to a situation in which there is a competition for a shared resource and the contestants can choose either conciliation
“Dove" or conflict “Hawk" ; this terminology is most commonly used in biology and evolutionary game theory
(EGT). From a game-theoretic point of view, "chicken" and "hawk-dove" are identical; the different names stem
from parallel development of the basic principles in different research areas. We shall use the Hawk-Dove game in
the latter part of the notes concerned with EGT approach.
Chapter 2
Matrix Games: 2-player Zero-Sum
Chapter Summary
This chapter provides basic concepts and results for two-player zero-sum finite (matrix) (2PZSM) games. Material
is mostly adapted from [19], [100, 107].
2.1 Introduction
Recall the Matching Pennies example in Chapter 1, a two-player game in which each player has only two pure
strategies, “Heads" (H) or “Tails" (T). The normal form (strategic form) of this game is described by
P2
(H ) (T )
P1 (H ) " #
−1 1
A=
(T ) 1 −1
and B = −A. Here we can think of player 1 (P1) losing a dollar to player 2 (P2) if their choices match and winning
a dollar from player 2 (P2) if they do not. This is an example of a two-player zero-sum matrix game that we study
in this chapter.
We start by formalizing such a game, the cost functions, strategy spaces and then move on to solution concepts for
the game.
Consider a two-player game, denoted G (N , Ωi , Ji ), where N = {1, 2} is the set of players, Ωi is the action set and
Ji the cost function for player i ∈ N . Thus J1 , J2 : Ω1 × Ω2 → R. Let the action of player 1 and 2 be denoted by
u1 ∈ Ω1 and u2 ∈ Ω2 , so that the two costs are J1 (u1 , u2 ), J2 (u1 , u2 ).
Such a game is called a two-player zero-sum game (2PZSG) if
J1 (u1 , u2 ) + J2 (u1 , u2 ) = 0, ∀ ( u1 , u2 ) ∈ Ω 1 × Ω 2
and the overall action profile is u = (u1 , u2 ) ∈ Ω1 × Ω2 . Player 1 is the minimizer of J1 , while player 2 is the
minimizer of J2 . In a two-player zero-sum, based on the above relation, player 2 is the maximizer of J1 . In such a
17
18 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
case, sometimes we shall drop the index of cost function, and use J = J1 so that we say that Player 1 is the minimizer
of J, while Player 2 is the maximizer of J, and the game is denoted G (N , Ωi , J ).
Assume now that player 1 (P1) and player 2 (P2) each have a finite number of discrete options/actions or pure
strategies to choose from, m1 = m, m2 = n. Then the set of their actions Ω1 and Ω2 are finite and sometimes we
use the notation A1 and A2 (and the actions u1 , u2 might be denoted by a1 , a2 ). These sets can be simply identified
with the set of indices M1 := {1, ..., m} and M2 := {1, ..., n} corresponding to these possible actions. The action
u1 ∈ Ω1 of player 1 can have m values, and the j-th action can be identified with the index j ∈ M1 . Thus we let
Ω1 := {e11 , . . . , e1 j , . . . , e1m }, where e1 j ∈ Rm is the j-th unit vector in Rm . Similarly for player 2, u2 ∈ Ω2 can have
n values, and the k-th action can be identified with the index k ∈ M2 . Thus we let Ω2 := {e21 , . . . , e2k , . . . , e2n }, where
e2k ∈ Rn is the k-th unit vector in Rn . Let a jk denote the game outcome for player 1 when the j-th action is used by
player 1 and the k-th by player 2, respectively. This leads to an overall (m × n) matrix A, hence the name matrix
game. In terms of the cost function for player 1, we can write that his cost when (u1 , u2 ) pair is set to the ( j, k)-th
action pair or (u1 , u2 ) = (e1 j , e2k ), is
Thus
J1 (u1 , u2 ) = (u1 )T A u2
where u1 ∈ Ω1 , u2 ∈ Ω2 . Similarly, a cost matrix B can be defined for player 2, and correspondingly its cost function
J2 (u1 , u2 ) = (u1 )T B u2
Specializing the definition of a zero-sum game to finite action set (matrix games), gives the condition
A + B = 0, or B = −A
Hence, we can simply identify G by using only a single cost matrix A, G (N , Ωi , A),. Each entry of the matrix A
is an outcome of the game corresponding to a particular pair of decisions by the players. Player 1 can be seen as
choosing one of the m rows in matrix A and player 2 as choosing one of its n columns.
As an example, consider the case when the first player has m = 3 choices, while the second player has n = 2 choices,
so that
1 0 0
e11 = 0 , e12 = 1 , e13 = 0
0 0 1
" # " #
1 0
e21 = , e22 =
0 1
the column with maximum in each row, hence 5, 2, or 4 and among these player 1 will chose the minimum of them
which is 2, hence he will choose row 2. Mathematically this means that he chooses row j∗ such that
If player 1 chooses like that, his losses will be at most 2, also called his loss ceiling, or the security level for his
losses, and strategy j∗ is the called a security strategy of player 1. Similarly player 2 who is the “maximizer" aims
to secure his gains, hence get at least the gain floor, or the security level for his gains, and strategy k∗ is the called a
security strategy of player 2. Assuming again the worst case, i.e., that player 1 picks the row with minimum in each
column, hence 1, 2, or −3, player 2 will pick among the maximum of them which is 2, hence he will choose column
2. So in this case a security strategy of player 1 is row 2 and of player 2 is column 2, i.e. (2, 2) and the outcome is 2.
Then denoting the left hand side in the above as
JU = min max a jk
j k
Definition 2.1. For a given m × n zero sum matrix game G with cost matrix A, let j∗ ∈ M1 be a (pure) strategy
chosen by player 1. Then j∗ is called a security strategy for player 1 if the following holds for all j ∈ M1
hence
JU := min max a jk = min max (e1 j )T A e2k
j∈M1 k∈M2 j∈M1 k∈M2
Similarly, a security strategy k∗ player 2 can be defined giving his security level denoted by JL . JU and JL are the
security level of player 1 and 2, respectively, or the upper and lower value of the game.
Consider another example given by
0 −1 3
A= 1 0 0 (2.3)
3 4 1
In this case the security level of player 1 is (for each row maximum over columns is [3, 1, 4]T and player 1 picks row
with min value, so row 2 is picked)
JU = min max a jk = 1
j k
20 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
while the security level of player 2 is (for each column minimum over rows is [0, −1, 0] and player 2 picks column
with max value, so either column 1 or 3 is picked)
JL = max min a jk = 0
k j
If both players play their security strategies, hence either they use the strategy pairs (2, 1) or (2, 3), then the outcome
of the game is either 1 or 0.
It is obvious that in any matrix game, there always exists at least a security strategy for each player since there are
finite number of choices. Moreover, (pure) maximin value is ≤ (pure) minimax value, i.e.,
JL ≤ JU
and
a jk ≤ max a jk , ∀ j, k
k
hence with the notation above, and taking j∗ and k∗ as the security strategies, yields
This says that in any game the outcome is between the JL and JU which are also called the lower and upper value
(floor and ceiling value) of a game. If these two values are equal then the game has a saddle point equilibrium, and
equal value is called the value of the game; we saw in example (2.1) this is the case and the value of the game is
J ∗ = 2. Unfortunately not all games posses such a saddle-point equilibrium.
Loosely speaking we shall call a strategy pair ( j∗ , k∗ ) an equilibrium (or to be in equilibrium) if after the game is
over, based on the outcome obtained, the players have no ground for regretting their past actions. In order to see
where this equilibrium concept is coming from let us look at another example given below in (2.5). We assume that
the two players act independently and the game is played once.
4 0 −1
A= 0 −1 3 (2.5)
1 2 1
In this case the security level of player 1 is (for each row maximum over columns is [4, 3, 2]T and player 1 picks row
with min value, so row 3 is picked)
JU = min max a jk = 2
j k
and his security strategy is row 3. The security level of player 2 is (for each column minimum over rows is
[0, −1, −1] and player 2 picks column with max value, so column 1 is picked)
JL = max min a jk = 0
k j
2.1. INTRODUCTION 21
and his security strategy is column 1. Now if both players play their security strategies, hence they use the strategy
pair (3, 1), then the outcome of the game is 1 which is in between JL and JU . To test the condition for regret, consider
player 1 evaluating this outcome and thinking: “If I knew that player 2 plays his security strategy (hence column
1), I would have chosen row 2 and get a smaller outcome of 0", hence regrets his actions. Similarly player 2 can
think, and we see that the security strategies pair does not posses equilibrium properties. On the other hand such a
security strategy pair is conservative, since if a player would have chose an different one, then he might get a worse
outcome than his security level. Here we see that the knowledge one player has can make a difference in how he
plays and what outcome he gets. Such knowledge would be for example available to player 1 if the players do not
act independently but player 2 acts first, followed by player 1. In general we shall assume simultaneous play.
Let us look again at example (2.1). The security strategies pair is (2, 2) with value 2 and neither player has any
cause for regret, and we say the strategies are optimal against one another, and in equilibrium, called saddle-point
equilibrium in pure strategies. Also since the security levels coincide it does not matter if they play simultaneously
(independently) or in a predetermined order. We give next the formal definition of such an equilibrium.
Definition 2.2. For a given m × n matrix game A = [a jk ], let row j∗ and column k∗ , i.e., ( j∗ , k∗ ) be a pair of (pure)
strategies chosen by the two players. Then if
a j∗ k ≤ a j∗ k∗ ≤ a jk∗ , ∀ j = 1, . . . , m, ∀k = 1, . . . , n
the pair ( j∗ , k∗ ) is a saddle-point equilibrium and the matrix game is said to have a saddle-point in pure strategies,
and the corresponding outcome J ∗ = a j∗ k∗ is the saddle-point value of the game.
Note that this point is both the minimum in its column and maximum in its row.
The following result gives properties relating saddle-points with security strategies in zero-sum matrix games.
JU = max a j∗ k ≥ a j∗ k , ∀k (2.6)
k
Note that j∗ and k∗ always exist since there are a finite number of choices of the actions of each player. Then taking
k = k∗ on the RHS of (2.6) we get
a j∗ k∗ ≤ JU = max a j∗ k
k
22 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
These are another justification for (2.4). By assumption JU = JL , so that from above
a j∗ k∗ ≥ a j∗ k , ∀k and a j∗ k∗ ≤ a jk∗ , ∀ j
and from the outer left and right side it follows that
a j∗ k ≤ a jk∗ , ∀ j = 1, . . . , m, ∀k = 1, . . . , n
hence
max a j∗ k ≤ a jk∗ , ∀ j = 1, . . . , m
k
Now since a jk∗ ≤ maxk a jk is true always, combining with the above yields
and now the outer left and right give (2.2) hence j∗ is a security strategy for player 1. Similarly it can be shown that
k∗ is security strategy for player 2. Part (iii) is immediate from above.
Remark 2.4. Any two saddle-point strategies are ordered interchangeable, i.e., if ( j1 , k1 ) and ( j2 , k2 ) are two saddle-
point strategies, then ( j1 , k2 ) and ( j2 , k1 ) are also saddle-point strategies.
As another example let us look again at the Matching Pennies game, with cost matrix
" #
−1 1
A=
1 −1
Note that in each row, max over columns is [1, 1]T and player 1 chooses min so he can pick either row 1 or 2, which
both give JU = 1. Similarly for each column, min over rows is [−1, −1] and player 2 chooses max, so he can pick
either column 1 or 2, giving JL = −1. But JL 6= JU in this case, so this game has no saddle point and hence no
solution in pure strategies.
As a way to get out of this difficulty one could decide not on a single (pure) strategy but on a choice between pure
strategies as dictated by chance (randomizing). Such a probability combination of the original pure strategies is
called a mixed strategy.
2.2. PURE AND MIXED STRATEGIES 23
A player is said to use a mixed strategy whenever he/she chooses to randomize over the set of available actions.
Formally, a mixed strategy is a probability distribution that assigns to each available action a likelihood of being
selected. If only one action has a positive probability of being selected, the player is said to use a pure strategy.
Thus when an equilibrium cannot be found in pure strategies the space of strategies can be enlarged, and the players
are allowed to base their decisions on random events, hence mixed strategies. This is similar to the case when one
tries to solve the equation x2 + 1 = 0 which does not have a solution in real number, but does in the enlarged space
of complex numbers. Check that in the Matching Pennies game if both players choose each of their pure strategies
with probability 21 then each will have an expected cost (loss) of 0 which seems an acceptable solution. This is the
case when the game is played repeatedly and the player 1 (player 2) aims to minimize (maximize) the expected cost
(outcome) of individual plays.
The introduction of mixed strategies was successful as a new solution concept because von Neumann, [152] was
able to show (even from 1928) that for any matrix game the minimax value is equal to the maximin value in mixed
strategies, hence any such game has a solution in mixed strategies. This is known as von Neumann’s Minimax
Theorem and is one of the key results of the theory of finite two-player zero-sum games (two-player zero-sum
matrix (2PZSM) games). We shall prove it soon but for now let us look at some formal definition and properties of
mixed strategies and expected costs in such strategies.
Consider a 2PZSM game with cost matrix A and u1 ∈ Ω1 := {e11 , . . . , e1 j , . . . , e1m }, u2 ∈ Ω2 := {e21 , . . . , e2k , . . . , e2n }.
Recall that the cost (or game outcome) when the two players use pure strategy pair ( j, k), j ∈ M1 := {1, . . . , m},
k ∈ M2 := {1, . . . , n}, or when (u1 , u2 ) = (e1 j , e2k ) is [A] j,k = a jk written as
∆1 = {x ∈ Rm | 1Tm x − 1 = 0, x j ≥ 0, ∀ j}
24 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
where e1 j ∈ Rm , j = 1, . . . , m are unit vectors and the pure strategies. Hence, by (2.10) and (2.9), any mixed-strategy
is some convex combination of the pure strategies, e1 j . Moreover, pure strategies are just extreme case of mixed
strategies (vertices of the simplex ∆1 ), e.g., when x j = 1 and the rest are 0, giving e1 j .
Similarly for player 2, we shall identify his mixed strategy with y = [yk ], k ∈ {1, . . . , n}, y ∈ ∆2 , where yk denotes
the probability that player 2 will choose action k from his n available (pure) alternatives in Ω2 and y ∈ ∆2 ,
n
∆2 := {y ∈ Rn | ∑ yk = 1, yk ≥ 0, ∀k = 1, . . . , n }
k=1
and
n n
y= ∑ e2k y(e2k ) = ∑ yk e2k
k=1 j =1
We sometimes denote ∆X = ∆1 × ∆2 .
Recall the cost J (u1 , u2 ) when a pure strategy pair (u1 , u2 ) = (e1 j , e2k ) is used. Assuming that the two players’
strategies are jointly independent, the probability of selecting pure strategy pair (u1 , u2 ) = (e1 j , e2k ) ∈ Ω1 × Ω2 is
given by x(u1 ) · y(u2 ) = x(e1 j ) · y(e2k ) = x j · yk . Then the expected (averaged) cost is
m n m n
∑ ∑ J ( u1 , u2 ) x ( u1 ) y ( u2 ) = ∑∑ J (e1 j , e2k ) x(e1 j ) · y(e2k ) = ∑ ∑ x j J (e1 j , e2k ) yk := J (x, y) (2.11)
u1 ∈Ω1 u2 ∈Ω2 j =1 k=1 j =1 k=1
and this defines the mixed-strategy cost function, when x ∈ ∆1 , y ∈ ∆2 . Using (2.8) yields for the (average) or
expected cost when mixed-strategy (x, y) is used,
m n
J (x, y) = ∑ ∑ x j a jk yk = xT A y (2.12)
j =1 k=1
where x ∈ ∆1 and y ∈ ∆2 and A is the cost matrix in pure strategies, with A = [a jk ]. When x, y are restricted to be
vertices of ∆1 , ∆2 , hence are pure strategies, e1 j , e2k , we see that this recovers J (e1 j , e2k ) = a jk .
A game could be denoted by G (N , Ω, J ) in its pure-strategy representation, or when we refer to its mixed-strategy
extension, by G (N , ∆X , J ).
Remark 2.5. Note that in mixed strategies, while player 1 is the minimizer of J 1 (x, y) = J (x, y), player 2 is the
minimizer of J 2 (x, y) = −J (x, y) hence the maximizer of J (x, y) and
J 1 (x, y) + J 2 (x, y) = 0
again indicating a zero-sum game. The function J (x, y) = xT A y that player 1 aims to minimize and player 2 aims to
maximize is called the kernel of the game.
Now that we extended the space of strategies and the cost definition let us extend the concept of security strategies
in mixed strategies.
2.2. PURE AND MIXED STRATEGIES 25
Definition 2.6. For a given two-player zero-sum matrix (2PZSM) game G (N , ∆X , J ) with m × n cost matrix A,
let x∗ ∈ ∆1 be a mixed strategy chosen by player 1. Then x∗ is called a mixed security strategy for player 1 if the
following holds for all x ∈ ∆1
JU := max x∗T A y ≤ max xT A y, ∀x ∈ ∆1
y∈∆2 y∈∆2
hence
x∗ = arg min max xT A y and JU := min max xT A y
x∈∆1 y∈∆2 x∈∆1 y∈∆2
hence
y∗ = arg max min xT A y and J L := max min xT A y
y∈∆2 x∈∆1 y∈∆2 x∈∆1
The quantities JU and J L are the expected (average) security level of player 1 and 2, respectively, or the expected
(average) upper and lower value of the game.
Note that the maximum and minimum in Definition 2.6 are guaranteed to exist. This follows because for JU , the
quantity maxy∈∆2 xT A y is a linear hence continuous function of x ∈ ∆1 , and the simplex ∆1 over which optimization
is done is closed and bounded, hence compact. Similar properties hold for J L .
First we give some properties of the security strategies and security levels in mixed strategy case.
Lemma 2.7. In every two-player zero-sum matrix (2PZSM) game with cost matrix A, the security levels in pure and
mixed strategies satisfy the following relations
JL ≤ J L ≤ JU ≤ JU
holds. Then this holds for J (x, y) = xT A y and the middle inequality
J L ≤ JU
will follow.
The claim can be proved by using the fact that for ∀y ∈ ∆2 , ∀x ∈ ∆1
Then, since this holds for ∀x ∈ ∆1 , it holds when we take minimum with respect of x again, so
Definition 2.8. For a given two-player zero-sum matrix (2PZSM) game with m × n cost A = [a jk ], let x∗ = (x∗ , y∗ ) ∈
∆X be pair of mixed strategies chosen by the two players. Then (x∗ , y∗ ) is a saddle-point game equilibrium in mixed
strategies if both
x∗T A y∗ ≤ xT A y∗ , ∀x ∈ ∆1
x∗T A y ≤ x∗T A y∗ , ∀y ∈ ∆2
∗
hold. The quantity J = x∗T A y∗ is called the saddle-point value or the value of the game in mixed strategies.
Note that if the above holds for x∗ = (x∗ , y∗ ) = (e1 j∗ , e2k∗ ), the equilibrium is pure, identified by ( j∗ , k∗ ), and
Definition 2.2 is recovered. In the pure equilibrium case we saw that equality can hold only in special cases. It turns
out that in the mixed-strategy case the two values are always equal. This is one of the most important theorems in
game theory and we shall prove it below.
The next result is one the most important results in game theory and it has many proofs. We give the original proof
due to John von Neumann (1928), [152]. The one due to Nash(1950) is based on Kakutani’s fixed-point theorem, a
much later result (1941).
The proof we give is based on the Separating Hyperplane theorem and uses the following lemma:
Lemma 2.9. Let any arbitray m × n matrix Q. Then either (i) or (ii) below must hold:
(i) there exists some y0 ∈ ∆2 such that xT Q y0 ≤ 0, ∀x ∈ ∆1
(ii) there exists some x0 ∈ ∆1 such that xT0 Q y ≥ 0, ∀y ∈ ∆2
Theorem 2.10 (Minimax Theorem). In any two-player zero-sum matrix (2PZSM) game with cost A, where A is a
m × n matrix, or G (N , ∆X , J ), where J (x, y) = xT A y, we have
or
min max xT A y = max min xT A y
x∈∆1 y∈∆2 y∈∆2 x∈∆1
2.3. MINIMAX THEOREM 27
∆1 = {x ∈ Rm |1Tm x − 1 = 0, x j ≥ 0, ∀ j}
∆2 = {y ∈ Rn |1Tn y − 1 = 0, yk ≥ 0, ∀k}
Proof:
Step 1: We use Lemma 2.9 to show first that for any arbitrary constant c we have either
( a) max min xT A y ≥ c
y∈∆2 x∈∆1
or
( b) min max xT A y ≤ c
x∈∆1 y∈∆2
To do this, we use Lemma 2.9 for the matrix Q = −A + c 1m×n where 1m×n denotes a m × n matrix with all entries
equal to 1, any c constant. Note that this matrix has the property that xT 1m×n y = 1 for every x ∈ ∆1 , y ∈ ∆2 .
If (i) in Lemma 2.9 holds, it follows that there exists some y0 ∈ ∆2 such that
Hence
xT0 A y ≤ c, ∀y ∈ ∆2
Now if a k > 0 exists such that (2.13) holds, then we can take c = maxy∈∆2 minx∈∆1 xT A y + 21 k and use Step 1. Then
either (a) or (b) have to hold for this c also. However check that neither (a) nor (b) hold, which is false, since at least
one of them has to be true. Thus the assumption (2.13) is false, hence there is no gap between the two values and
the proof is complete.
We now prove the above Lemma 2.9.
Proof: Proof of Lemma 2.9
There are two possible cases:
(A) ∃y0 ∈ ∆2 , ∃ξ0 ∈ Rm , ξ0, j ≥ 0, s.t. Q y0 + ξ0 = 0
28 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
or,
(B) For ∀y ∈ ∆2 and ξ ∈ Rm with ξ j ≥ 0, ∀ j ∈ {1, . . . , m}, Q y + ξ 6= 0.
Consider first case (A). We show that (i) in the Lemma holds. Under case (A), for every x ∈ ∆1 , xT (Q y0 + ξ0 ) = 0,
i.e.,
xT Q y0 = −xT ξ0 ≤ 0
where we have used the fact that entries in x and ξ0 are nonnegative. The above inequality shows that (i) in the
Lemma statement holds.
Consider now case (B). We show that (ii) in the Lemma holds. This follows in three steps.
Step 1: From matrix Q, define the convex hull of the columns of [Q Im ], denoted by C . Specifically, let C be
defined as
n m n m
C = {x ∈ Rm | x = ∑ qk αk + ∑ e1 j β j , f or some αk , β j ≥ 0, s.t., ∑ αk + ∑ β j = 1}
k=1 j =1 k=1 j =1
Step 2:
CLAIM: If case (B) holds, then the vector 0 ∈ Rm does not belong to the convex hull C , i.e., 0 ∈
/ C.
Let us prove this claim by contradiction. Assume that 0 ∈ C , while under (B),
Q y + ξ 6= 0, ∀y ∈ ∆2 , ∀ξ ∈ Rm , ξ ≥ 0 (2.14)
By the definition of C above, since 0 ∈ C this means we could find some convex combination α , β such that
α1 β1
.
0 = Qα + β , f or some α = .. ∈ Rn , β = ... ∈ Rm , s.t. αk ≥ 0, β j ≥ 0 1Tn α + 1Tm β = 1
αn βm
Note that 1Tn α = ∑nk=1 αk 6= 0 otherwise all αk are zero, which from Qα + β = 0 would mean all β j are zero also,
not possible in a convex combination. Then by dividing both sides in the above by 1Tn α yields
Qᾱ + β̄ = 0
where
1 1
ᾱ = α ∈ ∆2 , β̄ = β ∈ Rm , β̄ ≥ 0
1Tn α 1Tn α
This contradicts (2.14) for y = ᾱ , ξ = β̄ , hence our assumption that 0 ∈ C , is false and the CLAIM is proved.
Step 3:
Using the CLAIM in Step 2, since 0 ∈
/ C , and C is convex we can use the Separating Hyperplane theorem.
A version of this states that: Two nonempty convex subsets of Rm can be properly separated by a hyperplane if and
only their relative interiors are disjoint.
2.3. MINIMAX THEOREM 29
This can be proved based on Kakutani’s fixed point theorem, [73]. In the linear case above, X = ∆1 , Y = ∆2 , the
kernel xT A y is linear in each variable separately, hence trivially convex in x and concave y. The proof of Theorem
2.10 uses geometric arguments instead.
30 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
The Minimax theorem guarantees that every two-person zero-sum game (2PZSM) will have optimal mixed strate-
gies, but its proof is an existence proof and unfortunately does not guide us how to compute them. In the simplest
case when a saddle-point exists then the corresponding pure strategy pair ( j∗ , k∗ ) is a Nash equilibrium, which is
special case of the mixed strategy x∗ = e1 j∗ , y∗ = e2k∗ . We give below some methods that can be used to solve the
easiest games.
Consider a two-player zero-sum matrix (2PZSM) game with cost m × n matrix A. We we say that row j dominates
row r if
a jk ≤ ark , ∀k
and
a jk < ark , f or at least one k
Then a strategy j is dominating another strategy if the choice of the first (dominating) is as good as the second
(dominated) strategy and in some cases better. Similarly, column k dominates column c if
a jk ≥ a jc , ∀j
and
a jk < a jc , f or at least one j
Dominated strategies can be eliminated since they will never be chosen. The following result states this.
Proposition 2.13. In a matrix game A assume that rows j1 , . . . , jl are dominated. Then player 1 has an optimal
strategy such that x j1 = · · · = x jl = 0. Moreover any optimal strategy for the game obtained after removing the
dominated strategies will be optimal for the original game.
A similar result holds for columns, hence we can work with smaller dimension matrices which can simplify the
process of finding an equilibrium.
Note that second row dominates the fourth row, hence player 1 will never use the 4-th strategy. Discarding it we are
left with a reduced matrix
2 1 4
0 2 1
1 5 3
2.4. COMPUTATION OF MIXED-STRATEGY EQUILIBRIA 31
and we note that here column 3 dominates over column 1, hence player 2 will never use his first strategy, and we can
delete column 1. In
1 4
2 1
5 3
row 3 is dominated by row 2, hence we can remove it and we are left with
" #
1 4
A′ =
2 1
so a 2 × 2 matrix game.
Now if the 2 × 2 game has a saddle point then all is done. If however there is no such saddle point we have to work
a bit more for the mixed-strategy equilibrium. An analytical method will be discussed in the next chapter for more
general bimatrix 2 × 2 games, that is applicable to zero-sum games also. Next we describe a graphical solution based
on minimax security strategies.
This method can be relatively easily applied to any 2 × n game, hence when one of the players has only 2 strategies
so that A has 2 rows and n columns.
Since every mixed strategy is a convex combination of pure strategies, a reasonable choice for the minimax strategy
of player 1 (a conservative strategy to limit his losses) is to choose x so as to secure (minimize) his losses against
every possible pure choice of player 2. When player 2 uses the k-th pure strategy u2 = e2k , while player 1 uses
mixed strategy x, the expected cost is by (2.12),
where ( ˙)k denotes the k-th row in ( ˙), hence (AT )k denotes the k-th row of AT or k-th column (transposed) in A.
Hence the minimax strategy for player 1 is to minimize his maximum loss or V1 (x) := maxk∈M2 J (x, e2k ), i.e.,
Then using x = [x1 , x2 ]T , with x2 = 1 − x1 , and denoting Rk (x1 ) = (a1k − a2k ) x1 + a2k , leads to minimization over
x1 ∈ [0, 1],
V1 (x1 ) = max(Rk (x1 )), k = 1, . . . , n
k
Thus V1 (x1 ) is the maximum of n linear functions Rk (x1 ) in the single variable x1 . These are called the best-
response functions and can be plotted on the same graph. Then the maximum of these linear functions Rk (x1 ) can
be minimized by graphical methods leading to the mixed security strategy x∗ for P1.
32 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
R1 = −2 x1 + 4
R2 = 2 x1 + 1
R3 = −4 x1 + 5
R4 = 5 x1
6 6
5
R3 5
4 4
R1
3 3
2 2
R2
1 1
R4 0
0
0 x1* 1 x1
Figure 2.1: Graphical method for finding mixed security strategy for P1
The red thick line represents V1 (x1 ) (notice that it is discontinuous). Then the lowest point on this line give as
coordinate x∗1 , hence x∗2 = 1 − x∗1 and the value of the game is V1 (x∗1 ).
This can be used to compute solutions for zero-sum matrix games where A is a 2 × 2 and extended for 2 × n matrix.
We need to compute the mixed security strategy of player 2, P2. A conservative maximin extension for player 2 is to
secure his gain, or to choose y so as to maximize his payoff for every possible pure choice by player 1. When player
1 uses the j-th pure strategy u1 = e1 j , while player 2 uses mixed strategy y,
where ( ˙) j denotes the j-th row in ( ˙), hence ( A ) j denotes the j-th row of A. Then maximin extension for player 2
is to maximize V2 (y) := min j∈M1 J (e1 j , y), i.e.,
This is the maximin strategy for player 2, giving him the lowest payoff
When n = 2 immediately as before one obtains y∗ , hence (x∗ , y∗ ). In case n = 3 pure strategies that give worse
outcome (dominated) are not used and can be eliminated; then a smaller 2 × 2 game can be considered. Similar
extension for 2 × n is possible, (see [19]).
An alternative to the graphical method is to convert the matrix game into a linear programming (LP) problem for
which efficient algorithms are available (simplex being the most famous). This relationship between a two-player
zero-sum matrix game and an LP problem is described below.
We assume that the A matrix has all entries positive, hence a jk > 0, ∀ j, k. Then the average (expected) value of the
game is given by
J = min max xT A y = max min xT A y > 0
x∈∆1 y∈∆2 y∈∆2 x∈∆1
We start with the left hand side. For a given x ∈ ∆1 , the resulting xT A y is maximized over ∆2 , hence leading to a y
that depends on the given x. Thus we denote V1 (x) := maxy∈∆2 xT A y > 0, which satisfies
V1 (x) ≥ xT A y, , ∀y ∈ ∆2 (2.23)
Thus we can write J = minx∈∆1 maxy∈∆2 xT A y = minx∈∆1 V1 (x), where V1 (x) > 0. Since ∆2 is a n-dimensional
simplex, we can write (2.23) for y = e2k , k ∈ M2 = {1, . . . , n}, i.e., V1 (x) ≥ xT A e2k , ∀k ∈ M2 = {1, . . . , n}, and in
an equivalent form this leads to the vector inequality
(e21 )T
.. T
1n V1 (x) ≥ . A x = AT x
(e2n )T
where 1n is the all ones vector, 1n = [1, . . . , 1]T ∈ Rn . Using the scaling xe = x/V1 (x), or
x = xe V1 (x)
min V1 (x)
subject to
1
AT xe ≤ 1n , xeT 1m = , , xe ≥ 0
V 1 ( x)
This is equivalent to the maximization problem
max xeT 1m
subject to
AT xe ≤ 1n , xe ≥ 0
34 CHAPTER 2. MATRIX GAMES: 2-PLAYER ZERO-SUM
which is a standard LP problem. Solving this LP problem will give the mixed security strategy of player 1 nor-
malized by the average value of the game J. Similarly if we start with the player 2 case and introduce V2 (y) :=
minx∈∆1 xT A y ≤ xT A y, , ∀x ∈ ∆1 and ye = V21(y) y we obtain that his equivalent minimization problem is
min yeT 1n
subject to
A ye ≥ 1m , ye ≥ 0
which is the dual of the maximization problem above. Thus we have shown that given a matrix game with all entries
positive there exists two LP problems (dual one to another) whose solutions yield the saddle-point solution of the
matrix game. In fact the positivity of A is only conventional and by a translation transformation can be removed (see
[19]).
We mention here an online computation method called fictitious play (FP), that can be used in a repeated game,
where players improve they payoff (lower their loss) by keeping track of previous plays and make a decision about
the next move based on these previous plays. We shall discuss it in Chapter 8.
2.5 Notes
One feature of a mixed strategy equilibrium is that given the strategies chosen by the other players, each player is
indifferent among all the actions that he/she selects with positive probability. Hence, in the Matching Pennies game,
given that player 2 chooses each action with probability 12 , player 1 is indifferent among choosing H, choosing T,
and randomizing in any way between the two. Because randomization is more complex and cognitively demanding
than is the deterministic selection of a single action, this raises the question of how mixed strategy equilibria can be
sustained and how mixed strategies should be interpreted.
A formal interpretation is given in [62] by John Harsanyi (1973), who showed that a mixed strategy equilibrium of
a game with perfect information can be viewed as the limit point of a sequence of pure strategy equilibria of games
with imperfect information. Specifically, starting from a game with perfect information, one can obtain a family of
games with imperfect information by allowing for the possibility that there are small random variations in payoffs
and that each player is not fully informed of the payoff functions of the other players. Harsanyi showed that the
frequency with which the various pure strategies are chosen in these perturbed games approaches the frequency with
which they are chosen in the mixed strategy equilibrium of the original game as the magnitude of the perturbation
becomes vanishingly small.
Another interpretation comes from the field of evolutionary biology. Consider a large population in which each
individual is programmed to play a particular pure strategy. Individuals are drawn at random from that population
and are matched in pairs to play a game. The cost that results from the adoption of any specific pure strategy will
depend on the frequencies with which the various strategies are represented in the population. Suppose that those
frequencies change over time in response to cost differentials. For specific classes of games any trajectory that
begins at an interior state in which all strategies are present converges to the unique mixed strategy equilibrium of
the game. As another interpretation of the mixed strategy, the population frequency of each strategy corresponds to
the likelihood with which it is played in the mixed strategy equilibrium. We shall use this interpretation in the EGT
context in later chapters.
Chapter 3
Matrix Games: N-player Nonzero Sum
Chapter Summary
This chapter considers normal-form games with finite action sets, hence matrix games, in the general nonzero sum
case. Two-player or bimatrix games are treated first followed by N-player matrix games, both introducing pure and
mixed-strategy Nash equilibrium concepts. Basic results are presented mostly adapted from [19], [108].
3.1 Introduction
In this chapter we consider the class of N-player matrix games, where each player i has a finite number mi of discrete
options to choose from, so that the set of its actions is simply identified with a set of indices {1, ..., mi } corresponding
to these possible actions. On the other hand, in a continuous game each player can choose its action from a continuum
of (possibly vector-valued) alternatives, that is, Ωi ⊂ Rmi . This will be the focus of the next chapter.
We consider first the two-player or bimatrix games and then generalize to N-player matrix games. We discuss pure
and mixed- strategy game formulations, review concepts of dominance and best replies, and then prove the important
Nash equilibria theorem, followed by a brief review of Nash equilibria refinements.
Consider a two-player matrix game, where player 1 and 2 have each a finite number m1 = m and m2 = n of discrete
options or pure strategies to choose from. Then the set of their actions Ω1 and Ω2 can be simply identified with the
set of indices M1 := {1, ..., m} and M2 := {1, ..., n} corresponding to these possible actions.
Let an action of player 1 be denoted by u1 ∈ Ω1 , and the j-th action (pure strategy) can be identified with the index
j ∈ M1 . Similarly for player 2, let its action be denoted by u2 ∈ Ω2 , and the k-th action can be identified with the
index k ∈ M2 . As in the previous chapter u1 ∈ Ω1 := {e11 , . . . , e1 j , . . . , e1m } and u2 ∈ Ω2 := {e21 , . . . , e2k , . . . , e2n },
where e1 j ∈ Rm and e2k ∈ Rn are the j-th unit vector in Rm and the k-th unit vector in Rn , respectively.
Each of the two players have cost matrix A and B, respectively. Unlike the previous chapter of two-player zero-sum
games, here we do not have any special relation between B and A. Player 1’s cost when strategy pair (u1 , u2 ) is set
35
36 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
Definition 3.1. (Pure-strategy Nash equilibrium (NE)). For a given (A, B) bimatrix game let ( j∗ , k∗ ) be pair of
pure strategies chosen by the two players. If both
a j∗ k∗ ≤ a jk∗ , ∀ j = 1, . . . , m
b j ∗ k∗ ≤ b j ∗ k , ∀k = 1, . . . , n
hold, then ( j∗ , k∗ ) is a Nash equilibrium (NE) solution in pure strategies for the bimatrix game, and (a j∗ k∗ , b j∗ k∗ ) is
an NE equilibrium outcome of the game.
Of course, when b jk = −a jk the above definition recovers Definition 2.2 we had for zero-sum games. In that case all
minimax strategies saddle points are NE and all have the same value, hence are interchangeable. In bimatrix games
this is no longer the case.
In the case when more then a single NE solution exists, we shall denote this set by NE (G ). The game outcome can
be different for different NE solutions and the question of ordering the elements in NE (G ) arises. Since we cannot
have total ordering between pairs of numbers one can resort to partial ordering in order to arrive at some preferential
choice.
j, e
Definition 3.2. A pair of pure strategies ( j∗ , k∗ ) is said to be better than another pair ( e k) if
It can be shown that the set of NE equilibrium points NE (G ) is invariant to positive affine transformations on the
cost functions.
Definition 3.3. Two bimatrix games (A, B) and (A, e Be) are said to be strategically equivalent games if there exist
positive constants α1 , α2 > 0 and scalar β1 , β2 such that
aejk = α1 a jk + β1 , ∀ j = 1, . . . , m, ∀k = 1, . . . , n
3.2. BIMATRIX GAMES 37
e
b jk = α2 b jk + β2 , ∀ j = 1, . . . , m, ∀k = 1, . . . , n
Proposition 3.4. All strategically equivalent bimatrix games have the same NE equilibria NE (G ).
for the choices of to “Confess" /(defect) (first strategy) or to “Not Confess" (cooperate) (second strategy) available
for the two prisoners. This is a game in which there are gains from cooperation between players; the best outcome
is for the players to not confess (second strategy), but each player has an incentive to be a “free rider" and defect
(he would get free and the other 15 years in prison). Whatever one player does, the other prefers “Confess" so
that game has a unique pure Nash equilibrium (Confess, Confess) or pair (1, 1) giving (5, 5) cost (years in prison).
Indeed player 1’s first pure strategy (“Confess") gives a smaller loss (years in prison) (higher payoff) than its second
pure strategy, irrespective of what strategy player 2 uses. Similarly, player 2’s first pure strategy (confess) gives
always a smaller loss (higher payoff) than its second pure strategy: each entry in first column of B matrix is less
than the corresponding entry in the second column. Hence individual rationality of minimizing its cost or loss leads
to each player to select the first strategy, hence both “Confess" as the NE and both get 5 years. The dilemma arises
because players would have lower loss (1 year) if they were to select together their second strategy (“Not Confess"
(cooperate)), but this would require some trust and coordination.
Unfortunately not every (A, B) game has a pure NE equilibrium, similar to the case of two-player zero-sum games
case in the previous chapter. As there, this leads to the mixed-strategy extension of a game.
Consider an (A, B) game, where A and B are the m × n cost matrices for player 1 and 2, respectively. Instead of
only using pure strategies u = (u1 , u2 ) , sometimes we allow randomized strategies or mixed-strategies. As in the
previous chapter let us denote by x and y the mixed-strategy of player 1 and player 2, respectively,. As before, x j and
yk denote the probability of selecting the pure j-th and k-th action. Let x = (x1 , x2 ) = (x, y), where x ∈ ∆1 , y ∈ ∆2 .
Then as in (2.11), Chapter 2 we can write for the expected cost of player 1 when mixed-strategy x is used,
Then since a finite number m and n of pure strategies are available to player 1 and 2 this means
m n
J 1 (x) = J 1 (x, y) = ∑ ∑ J1 (e1 j , e2k ) x(e1 j ) y(e2k )
j =1 k=1
m n
J 2 (x) = J 2 (x, y) = ∑ ∑ J2 (e1 j , e2k ) x(e1 j ) y(e2k )
j =1 k=1
38 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
Similarly let x(e1 j ) = x j , y(e2k ) = yk . Then since J1 (e1 j , e2k ) = J1 (e1 j , e2k ) = (e1 j )T A e2k = a jk we can write
m n
J 1 (x) = J 1 (x, y) = ∑ ∑ a jk x j yk = xT A y
j =1 k=1
m n
J 2 (x) = J 2 (x, y) = ∑ ∑ b jk x j yk = xT B y
j =1 k=1
Remark 3.6. In the case when B = −A, the game is a two-player zero-sum matrix game
Definition 3.7. For a given (A, B) bimatrix game let x∗ = (x∗ , y∗ ) be pair of mixed-strategies chosen by the two
players. x∗ = (x∗ , y∗ ) ∈ ∆X is a Nash equilibrium in in mixed strategies, if for any other mixed strategies (x, y) both
x∗T A y∗ ≤ xT A y∗ , ∀x ∈ ∆1
x∗T B y∗ ≤ x∗T B y, ∀y ∈ ∆2
hold.
Consider the following sets. For every given y ∈ ∆2 , consider that player 1 uses a strategy ξ ∈ ∆1 that gives him the
best cost among all x ∈ ∆1 , i.e., from the set
Similarly, assume that consider that player 2 uses a strategy η ∈ ∆2 that gives him the best cost among all y ∈ ∆2 ,
for every given choice of player 1 x ∈ ∆1 , i.e., he selects from the set
These denote the best-response sets for each of the player, and Φ1 , Φ2 are the best-response mappings. Because the
image Φ1 (y) can be a set and not a single value, Φ1 is not a function but a set-valued mapping or correspondence
and the same is true for Φ2 . Thus we denote them by Φ1 : ∆2 ⇒ ∆1 , Φ2 : ∆1 ⇒ ∆2 (note the double arrows). From
Definition 3.7, we see that a pair (x∗ , y∗ ) is an NE if (x∗ , y∗ ) ∈ ( Φ1 (y∗ ), Φ2 (x∗ ) ), which means if it is simultaneously
in the best-response set of both players, or in their intersection.
For bimatrix games (N = 2), computational methods for finding a mixed NE solution strategies are fewer and more
difficult than for zero-sum games. Lemke-Howson algorithm is an example and another one is by converting into a
nonlinear programming problem [19].
3.2. BIMATRIX GAMES 39
For bimatrix games with only 2 choices, i.e., for 2 × 2 games this computation is simpler and an analytical method
can be easily obtained. Let ∆1 = ∆2 = ∆ ⊂ R2 and x = (x1 , x2 ) ∈ ∆, where
∆ = {x ∈ R2 | x1 + x2 = 1, x1 ≥ 0, x2 ≥ 0 }
Every x ∈ ∆ can be written as x = [x1 (1 − x1 )]T or x = (x1 , (1 − x1 )), with x1 ∈ [0, 1] and similarly, every y ∈ ∆
can be written as y = [y1 (1 − y1 )]T or y = (y1 , (1 − y1 )), with y1 ∈ [0, 1]. Thus a point (x1 , y1 ) in the unit square
[0, 1] × [0, 1] uniquely represents a mixed-strategy pair (x, y) ∈ ∆ × ∆.
Denote " # " #
a11 a12 b11 b12
A= , B=
a21 a22 b21 b22
The cost function of player 1 is written as
where
ae = a11 − a12 − a21 + a22 , ce1 = a22 − a12 , ce2 = a22 − a21
φ1 (y1 ) = {ξ1 ∈ [0, 1] | (aey1 − ce1 ) ξ1 ≤ (aey1 − ce1 ) x1 , ∀x1 ∈ [0, 1]}
or
0 if e
bx1 − de2 > 0
φ2 (x1 ) = [0, 1] if e
bx1 − de2 = 0
1 if e
bx1 − de2 < 0
Then (x1 , y∗1 ) is an admissible point for player 2 if y∗1 ∈ φ2 (x1 ).
Response correspondences φ1 (y1 ) and φ2 (x1 ) for 2 × 2 normal form games can be drawn with a line for each player
in a unit square strategy space, [0, 1] × [0, 1]. In order for a point (x∗1 , y∗1 ) to be admissible for both players we need
x∗1 ∈ φ1 (y∗1 ) and y∗1 ∈ φ2 (x∗1 ). This set can be found graphically by plotting φ1 (y1 ) and φ2 (x1 ) over x1 ∈ [0, 1] and
y1 ∈ [0, 1] on the same graph (note they are discontinuous) and taking their intersection. Thus any such (x∗1 , y∗1 ) ∈
[0, 1] × [0, 1], uniquely represents an admissible mixed-strategy pair (x∗ , y∗ ) ∈ ∆1 × ∆2 , and all of these give the set
of NEs denoted by NE (G ).
From φ1 (y1 ), note that for
ce1
y1 =
ae
the RHS of the first inequality is independent of x1 . From φ2 (x1 ), note that for
de2
x1 =
e
b
e ce1
the RHS of the second inequality is independent of y1 . Thus (x1 , y1 ) = ( de2 , ae ) and
b
" #
de2 ce1
e
(x, y) = ( b
e
, ae )
1 − de2 1 − ceae1
b
NE (G ) = {(x∗ , x∗ )}
" #
1
x∗ =x= 2
1 , i.e., such that both players independently choose heads (H) and tails (T) with probability 1
2 each.
2
The method can be applied to zero-sum games, B = −A, as well as to symmetric matrix games, B = AT .
Consider as another example # " " #
1 0 4 0
A= , B=
0 4 0 1
" # " # " # " #
1 0 0 1
which has pure strategy NE equilibria ( , = (e1 , e2 ) and ( , ) = (e2 , e1 ) and a mixed strategy
0 1 1 0
NE given by " # " #
1 4
(x∗ , y∗ ) = (x, y) = ( 5
4 , 5
1 )
5 5
4
that gives an expected cost of 5 to both players. Thus
Example 3.8. Recall Example 1.8 (Battle of Sexes or BoS) which is concerned with a couple that agreed to meet
this evening, but cannot decide if they will be attending the opera (1st option) or a football match (2nd option). The
husband would most of all like to go to the football game. The wife would like to go to the opera. Both would prefer
to go to the same place rather than different ones. If they cannot communicate, where should they go?
The cost matrices are (A, B) below, where the wife chooses a row and the husband chooses a column. A is the cost
matrix for the wife and B represents the cost to the husband,
" # " #
−3 0 −2 0
A= , B=
0 −2 0 −3
This game has two pure strategy Nash equilibria, one where both go to the opera, i.e., (1, 1) or (e1 , e1 ), and another
where both go to the football game, (2, 2) or (e2 , e2 ). Verify that there is also a Nash equilibrium in mixed strategies
from, x1 = 53 , y1 = 52 . This presents an interesting case for game theory since each of the Nash equilibria is deficient
in some way. The two pure strategy Nash equilibria are unfair; one player consistently does better than the other,
while the mixed strategy NE is inefficient (gives higher average cost).
Let us now look at some simple games, called symmetric 2 × 2 games, where B = AT and
" #
a11 a12
A=
a21 a22
This will be the case we treat later in an evolutionary game setup (EGT). The same method as in the previous section
can be applied. In this case moreover, some further classification can made. Based on Definition 3.3 and Lemma
3.4, we can reduce this to a related game: first by subtracting a21 from column 1 and a12 from column 2 we obtain
the equivalent matrix " #
a11 − a21 0
A=
0 a22 − a12
which is itself symmetric. Such a game is called doubly symmetric game. Then, by denoting a1 = a11 − a21 ,
a2 = a22 − a12 (where a2 = c̃1 and a1 + a2 = ã in the previous section), we consider the game with cost matrix
" #
a1 0
A=
0 a2
42 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
We call such a process normalization of a symmetric game. We shall classify games according to which quadrant
(N,II,III, or IV) in the plane the point a = (a1 , a2 ) ∈ R2 is, because all games in each category will have the same
properties.
Let us consider the PD game in Example 3.5. This game has a1 < 0 and a2 > 0 (in quadrant II) and belongs to the
class of games with dominated strategies, after normalization of the matrix A. For instance, in the single-play PD,
the “Not confess" strategy 2 is not optimal for any probability of the opponent "Confess" strategy 1. In this case
strategy 1 strictly dominates strategy 2 and the NE is NE (G ) = {(e1 , e1 )}, (Confess,Confess).
The second case is when a1 < 0 and a2 < 0 (quadrant III) and such a game is in the class of coordination games, in
which players have minimum loss when they both choose the same strategy. It is evident that (e1 , e1 ) and ("e2 , e2 ) are
#
a2
two pure NE. Also applying the method in the previous section, gives a mixed NE also, with x = x∗ = a1 +a2
a1
a1 +a2
and y = x∗ so the set of NEs is
NE (G ) = {(e1 , e1 ), (e2 , e2 ), (x∗ , x∗ )}
Such games are those as the Stag Hunt (see below).
The third class is the class of anti-coordination games, when a1 > 0 and a2 > 0 (quadrant I) and here the best reply
to a pure strategy is the other’s pure strategy so that (e1 , e2 ) and (e2 , e1 ) are two pure NE as well as a mixed NE
(x∗ , y∗ ) with x∗ = x, y∗ = y hence the set of NEs is
NE (G ) = {(e1 , e2 ), (e2 , e1 ), (x∗ , y∗ )}
An example in this class is the Hawk-Dove (HD) game, Example 1.9, or the game of Chicken (see below). The
fourth case is when a1 > 0 and a2 < 0 so quadrant IV which is the mirror image of the case quadrant II.
and B = AT . Because the cost of Swerving is so trivial compared to both Not Swerving (crash), the reasonable
strategy would seem to be to swerve. Yet, knowing this, if one believes one’s opponent to be reasonable, one may
well decide not to swerve at all. This unstable situation can be formalized by saying there is more than one Nash
equilibrium. In this case, the pure strategy equilibria are the two situations wherein one player Swerves while the
other does Not Swerve. Indeed this gives (e1 , e2 ) and (e2 , e1 ) as the two pure NE and (x∗ , x∗ ) as a mixed-strategy
NE. Both Chicken and Hawk-Dove are anti-coordination games, in which it is mutually beneficial for the players to
play different strategies. In this way it can be thought of as the opposite of a coordination game.
The same game formulation can be easily extended from two-player (bimatrix) games to N-player finite (matrix)
games. We shall need some more compact notations. Let the set of players be denoted by I = {1, . . . , N} and assume
that ui ∈ Ωi where Ωi is the finite action set.
Let the strategy of player i, i ∈ N = {1, . . . , N} be denoted by ui (pure strategy) or by xi (mixed-strategy). Each
player i has a finite number mi of actions at his disposal. Player i, i ∈ N , uses pure strategy pure strategy ui by
simply choosing an action j from his mi possible ones in Mi = {1, . . . , mi }, or by ui taking a value ei j from his
mi available alternatives in Ωi = {ei1 , . . . , ei j , . . . , eimi }. Let u = (u1 , . . . , uN ) ∈ Ω denote the pure strategy (action)
profile used by all N players, where Ω = Ω1 × . . . ΩN is the overall pure-strategy space of the game (cartesian
product of the players sets). The cost function is denoted by Ji = Ji (u1 , . . . , uN ) as before corresponding to a cost
matrix Ai .
The j-th pure strategy (action) available to player i is ui = ei j where ei j is the unit vector in Rmi . As before, in
the case of mixed-strategies, the actual action selected is the outcome of some randomization process. Generalizing
from two-player games, a mixed-strategy for player i, i ∈ N , denoted by xi is a probability distribution on the set Ωi
of its pure ui strategies, hence xi (ui ) ∈ ∆i (Ωi ), where ∆i (Ωi ) denotes the mixed-strategy set. In this matrix game
case let xi, j denote the probability that player i will choose action j, ui = ei j , from his mi available (pure) alternatives
in Ωi . Then a mixed-strategy xi is just the vector composed of the probabilities associated with available actions,
44 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
denotes the unit simplex in Rmi space, of dimension mi − 1. The vertices of ∆i are the unit vectors (pure strategies)
ei j above. Since mixed-strategy xi is a probability distribution over the set Ωi of pure strategies and since Ωi is finite
we can write it as
mi
xi = ∑ xi, j ei j
j =1
Thus xi is a convex combination of the basis vectors (pure strategies), and the mixed-strategy simplex ∆i is the
convex hull of its vertices (pure strategies), generalizing the case of N = 2 players.
Definition 3.11. For a mixed-strategy xi ∈ ∆i , we define its support or carrier as the set of pure strategies that have
assigned positive probabilities,
sup(xi ) = { j ∈ Mi | xi, j > 0 }
The subset
int(∆i ) = {xi ∈ ∆i | xi, j > 0 ∀ j}
is called the interior of ∆i . Hence, mixed strategies in the interior xi ∈ int(∆i ) are called completely mixed or interior,
in that they assign positive probabilities to all player’s pure strategies, hence have full support, sup(xi ) = Mi .
Each ∆i is a convex, closed and bounded (hence compact) subset of the Rmi Euclidean space. Hence the overall
∆X = ∆1 × . . . ∆N is a convex, closed and bounded (hence compact) subset of the Rn Euclidean space, n = ∑Ni=1 mi .
Any mixed strategy N-tuple x = (x1 , . . . , xN ) can be viewed as a point x ∈ ∆X .
A game could be denoted by G (N , Ωi , Ji ) in its pure-strategy representation or when we refer to its mixed-strategy
extension G (N , ∆i , J i ). The overall strategy space is Ω = ×i∈N Ωi indicating a pure-strategy game, or ∆X =
×i∈N ∆i , indicating a mixed-strategy game.
Since the game is noncooperative, a reasonable assumption is that mixed strategies viewed as probability distribu-
tions are jointly independent. Then the probability of arriving at a pure strategy profile (N-tuple) u = (u1 , . . . , uN ) ∈
Ω, denoted by x(u) is given as
N
x(u) = ∏ xi (ui )
i= 1
where x is the N-tuple x = (x1 , . . . , xN ) ∈ ∆X . Thus the probability of using pure strategy profile u = (e1k1 , . . . , eNkN )
is
N N
x(e1k1 , . . . , eNkN ) = ∏ xi (eiki ) = ∏ xi,ki , ki ∈ Mi
i= 1 i= 1
In terms of pure strategies the cost for player i when pure strategy N-tuple u is used is denoted by Ji (u) ∈ R.
If mixed strategies xi , i ∈ N = {1, . . . , N} are used according to the distribution x(u) then the cost will be the
3.3. N -PLAYER FINITE GAMES 45
We often use the following notation: a strategy profile x = (x1 , . . . , xi , . . . , xN ) ∈ ∆X is written as x = (xi , x−i ), where
xi ∈ ∆i and x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xN ) is the (N − 1)-tuple obtained from x without the i-th player component
i ∈ N , x−i ∈ ∆−i where ∆−i = ×l6=i ∆l . We denote by (ui , x−i ) the strategy profile where player i has replaced
xi ∈ ∆i by its pure strategy ui ∈ Ωi , while all other players use strategies according to x ∈ ∆X , i.e.,
We also write (wi , x−i ) ∈ ∆X for the strategy profile in which player i plays mixed strategy wi ∈ ∆i while all other
players use strategies according to x ∈ ∆X . This notation is particularly useful when a single player considers
“deviations" wi ∈ ∆i from a given profile x ∈ ∆X . Usually u, v denote pure strategies, while x, y, w, z mixed
strategies.
Then we formulate the definition of an NE equilibrium point in mixed-strategies (MSNE) or a mixed-strategy NE.
Definition 3.12. (Mixed-strategy NE). Given a noncooperative N-player finite game G a mixed strategy N-tuple
x∗ = (x∗1 , . . . , x∗N ) ∈ ∆X , x∗i ∈ ∆i , i ∈ N is an equilibrium point (or a mixed-strategy Nash equilibrium point if
where x∗ = (x∗i , x∗−i ) and x∗−i denotes x∗ of all mixed strategies except the ith one, i ∈ N .
Since we can have more than a single NE we denote the set by NE (G ), and x∗ ∈ NE (G ) is an element of this set.
Then based on Definition 3.12
Notationally, since Ω1 = {e11 , . . . , e1k1 , . . . , e1m1 }, we can we can use (3.4) and replace u1 ∈ Ω1 by e1k1 for k1 ∈ M1
to write
N N
J i (x∗1 , . . . , x∗N ) = ∑ ··· ∑ Ji (u1 , . . . , uN ) ∏ x∗l,kl = ∑ ··· ∑ Ji (e1k1 , . . . , eNkN ) ∏ x∗l,kl
u1 ∈Ω1 uN ∈ΩN l =1 k1 ∈M1 kN ∈MN l =1
46 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
with the second form explicitly showing the finite matrix form. Using this form and identifying Ji (e1k1 , . . . , eNkN ) =
Aik1 ,...,kN (element in the cost matrix of player i), we see that (3.6) can be re-written as a set of N inequalities, whose
first one is ∀w1 = [x1, j ] ∈ ∆1 , j ∈ M1
N N
∑ ··· ∑ A1k1 ,...,kN ∏ x∗l,k l
≤ ∑ ··· ∑ A1k1 ,...,kN ∏ x∗l,k l
x1, j (3.8)
k1 ∈M1 kN ∈MN l =1 k1 ∈M1 kN ∈MN l =2
Note that for N = 2 this recovers Definition 3.7 for the bimatrix game, where x = (x, y) = (x1 , x2 ), A = A1 , B = A2 .
We’ll prefer (3.6) as the more compact notation.
A key property of J i (wi , x−i ) is linearity in the wi argument, which leads to the following. Any wi ∈ ∆i can be
written as a convex combination of pure strategies, wi = ∑ j∈Mi ei j αi, j , where αi, j ≥ 0 and ∑ j∈Mi αi, j = 1. Since the
expected cost J i (wi , x−i ) is linear in wi , for any given x−i it follows that
J i (wi , x−i ) = J i (∑ j∈Mi ei j αi, j , x−i ) = ∑ j∈Mi J i (ei j , x−i ) αi, j (3.10)
In the case of two player zero-sum games we saw that it is possible to eliminate pure (strictly) dominated strategies.
Similarly, for N-player noncooperative games we could define partial ordering of a player’s (pure or mixed) strategy
set as given by the consequences on the outcome for that player. We shall define everything herein in terms of mixed
strategies since pure strategies are special cases of these, so we work on the mixed-strategy simplex ∆i of each player
i. A strategy weakly dominates another strategy if it never results in a worse outcome (higher loss) than the second
and sometimes results in lower loss. A strategy is undominated if there is no strategy that weakly dominates it. A
strategy strictly dominates another strategy if it always results in lower loss.
Definition 3.13. A strategy zi ∈ ∆i weakly dominates strategy xi ∈ ∆i if J i (zi , x−i ) ≤ J i (xi , x−i ), ∀x−i ∈ ∆−i , with
strict inequality for some x−i ∈ ∆−i . A strategy xi ∈ ∆i is undominated if no such strategy zi exists.
Definition 3.14. A strategy zi ∈ ∆i strictly dominates strategy xi ∈ ∆i if J i (zi , x−i ) < J i (xi , x−i ), ∀x−i ∈ ∆−i .
It can be possible that a pure strategy is dominated by a mixed strategy while not being dominated by any other pure
strategy. This is seen in the next example.
Example 3.15. Consider a two-player game with player 1 having cost matrix A below
2 0
A= 0 2
3 3
2 2
3.4. DOMINANCE AND BEST REPLIES 47
Thus when player 1 plays its third pure strategy x = e13 = [0 0 1]T ∈ ∆1 its outcome against any strategy y played
by player 2 is
3
J 1 (e13 , y) = [0 0 1]T A y = (y1 + y2 )
2
and e13 is not weakly dominated by any of the first two pure strategies (check this).
However take x = z = [ 21 1
2 0]T ∈ ∆1 (hence randomizing over the first two pure strategies). Then
J 1 (z, y) = y1 + y2
One of the basic assumptions in noncooperative game theory is that players are rational, so that they do not ever
use strictly dominated strategies. That was the reason why pure dominated pure strategies can be deleted without
affecting the game outcome.
Let us know define the concept of best-reply or best-response (BR) correspondence in mixed-strategies. Consider a
game G (N , ∆i , J i ). Given a mixed-strategy profile x−i ∈ ∆−i , a mixed-strategy best reply (response) correspon-
dence for player i is a mixed-strategy xi ∈ ∆i such that no other mixed-strategy gives him lower cost (loss) against
x−i . We denote this set by Φi : ∆−i → ∆i , for any given x−i ∈ ∆−i ,
also called the optimal response set (or rational reaction set), [19]. For any x∗i ∈ Φi (x−i ), it follows that for any
given x−i ∈ ∆−i , player i has a cost of
J i (x∗i , x−i ) = min J i (xi , x−i )
xi ∈∆i
The overall mixed-strategy strategy best-response correspondence of the N-players to strategy profile x ∈ ∆X is
Note that Φi (x−i ) ⊂ ∆i is a face of ∆i (a convex hull of some pure strategies (vertices) in ∆i ), and is always non-
empty, closed and convex. It can range from a singleton, when there is only one pure best reply for player i, up to
the whole simplex ∆i , hence Φ is a set-valued mapping or correspondence, denoted by Φ : ∆X ⇒ ∆X .
48 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
In this section we give a proof for existence of a mixed-strategy NE for any N-player matrix game. Consider an
N-player finite game G (N , ∆i , J i ), and let NE (G ) ⊂ ∆X denote the set of its NE points, as in (3.7), i.e.,
Based on the best reply correspondence Φ, (3.11) for any mixed-strategy Nash profile (equilibrium point) x∗ =
(x∗i , x∗−i ) ∈ NE (G ) we can write
x∗i ∈ Φi (x∗−i ), ∀i ∈ N
In fact, x∗ ∈ ∆X is an NE solution if x∗i is a solution to its optimization problem, given all other players taking
equilibrium actions, x∗−i , i.e., x∗i is player i’s best response to all his opponents’ actions, and the same is true for all
the players i ∈ N . Hence
which says that in terms of best replies, a mixed-strategy profile x∗ ∈ ∆X is an NE if it is a best reply to itself, or if it
is a fixed-point of the mixed-strategy best reply correspondence Φ, Φ : ∆X ⇒ ∆X . Note that this is not the fixed-point
of a function but of a multi-valued (set-valued) mapping, denoted as
x∗ ∈ Φ(x∗ )
This last interpretation is what leads to one of the proofs for existence of an NE in any N-player finite game, namely
the proof based on Kakutani’s fixed-point theorem given later on.
As a remark, recall the bimatrix case N = 2, andl the Φ1 and Φ2 best-response mappings. In that case, following
Definition 3.7, we saw that (x∗ , y∗ ) is an NE if (x∗ , y∗ ) ∈ ( Φ1 (y∗ ), Φ2 (x∗ ) ). If we let x∗ = (x∗ , y∗ ) ∈ ∆1 × ∆2 , and
the corresponding tuple mapping Φ : ∆1 × ∆2 ⇒ ∆1 × ∆2 , defined by Φ(x∗ ) = ( Φ1 (y∗ ), Φ2 (x∗ ) ), we see that this
is equivalent to x∗ ∈ Φ(x∗ ), as in the above.
Definition 3.16. An NE x∗ ∈ ∆X is called a strict NE if each player’s component x∗i is the unique best reply to x∗
(singleton) hence if
Φ(x∗ ) = { x∗ }
We first give the proof based on Brower’s fixed-point Theorem, then the one based on Kakutani’s fixed-point Theo-
rem (see Appendix A).
Before proving it we give two useful results.
Lemma 3.17. Consider any mixed-strategy x = (xi , x−i ) ∈ ∆X , xi ∈ ∆i . Then for every player i, i ∈ N , there exists
a k ∈ sup(xi ) such that
J i (xi , x−i ) ≤ J i (eik , x−i )
Proof: Consider any xi ∈ ∆i , written as a convex combination of pure strategies xi = ∑ j∈Mi xi, j ei j = ∑k∈sup(xi ) xi,k eik ,
where ∑k∈sup(xi ) xi,k = 1, and sup(xi ) 6= 0 since for at least one eik we have xi,k > 0. We prove the statement by
3.5. NASH EQUILIBRIA THEOREM 49
contradiction. Assume that for all pure strategies k ∈ sup(xi ), i.e., with xi,k > 0, we have J i (xi , x−i ) > J i (eik , x−i , ).
Hence for all such strategies,
J i (x) xi,k > J i (eik , x−i ) xi,k
Then factoring J i (x) on the LHS, while on the RHS using linearity of J i and the representation of xi results in
which is a contradiction.
Lemma 3.18. Let x∗ = (x∗i , x∗−i ) ∈ ∆X . Then x∗ ∈ NE (G ), i.e., is a mixed-strategy Nash equilibrium (NE) if and
only if for every player i ∈ N ,
J i (x∗i , x∗−i ) ≤ J i (ei j , x∗−i ), ∀ j ∈ Mi (3.13)
Theorem 3.19. (Nash EquilibriumTheorem Every N-player finite (matrix) game G has at least one mixed-strategy
equilibrium point, called mixed-strategy Nash equilibrium (NE).
and we see that Ci, j is a (single-valued) function that measures the change in the cost for player i when in the mixed
strategy N-tuple (profle) x, player i replaces strategy xi by its pure strategy ui = ei j . We also see that Ci, j (x) ≥ 0.
Now for each i ∈ N , j ∈ Mi we consider
where xi (ei j ) = xi, j . Each of ηi, j (x) is non-negative also. Note that since xi ∈ ∆i , we have
∑ xi (ei j ) = ∑ xi, j = 1
j∈Mi j∈Mi
Thus for fixed x and ∈ N , this ηi, j (x) can be taken a probability attached to the strategy ei j , and ηi (x) ∈ ∆i itself.
Moreover, note that J i (x) is polynomial in xi, j , Ci, j is a continuous function of x. Since 1 + ∑ j∈Mi Ci, j (x) 6= 0, it
follows that ηi is itself a continuous function of x. Thus gathering all in a N-tuple (vector), η = [η1 , . . . , ηN ]T is a
continuous vector valued function of x that we denote by η , that maps the convex closed bounded set ∆X into itself,
η : ∆X → ∆X . Then by Brower’s Fixed-point Theorem it follows that η has a fixed-point, hence some e x such that
e
x = η (e
x)
The theorem is proved if we can show that every such fixed-point is necessarily an NE and vice-versa. Since
e
x = (e xN ) is a fixed-point for η , by (3.14) it follows that for all i ∈ N and for all l ∈ Mi
x1 , . . . ,e
exi (eil ) + Ci,l (e
x)
e
xi (eil ) = (3.15)
1 + ∑ j∈Mi Ci, j (e x)
For this e
x ∈ ∆X , by Lemma 3.17 and the definition of Ci, j it follows that for any player i ∈ N there exists a k ∈ Mi ,
hence a pure strategy eik with xei,k := e
xi (eik ) > 0, such that
J i (e
x) − J i (eik ,e
x−i ) ≤ 0
hence Ci,k (e
x) = 0. For this particular strategy (3.15) becomes
e
xi (eik )
e
xi (eik ) =
1 + ∑ j∈Mi Ci, j (e
x)
so that e
xi (eik ) ∑ j∈Mi Ci, j (e
x) = 0. Since e
xi (eik ) > 0, this implies ∑ j∈Mi Ci, j (e
x) = 0. Since all Ci, j ≥ 0, it follows that
Ci, j (e
x) = 0, ∀i ∈ N , ∀ j ∈ Mi
and from the definition of Ci, j this means that for each player i ∈ N we have
J i (e
x) ≤ J i (ei j ,e
x−i ), ∀i ∈ N , ∀ j ∈ Mi
By Lemma 3.18, e
x is an NE equilibrium point and proof is complete.
An alternative proof is based on Kakutani’s fixed-point theorem (see Appendix A). It is this variant that John Nash
used in his description of Nash equilibrium for N-player games, work that would later earn him a Nobel Prize in
Economics. Kakutani’s fixed-point theorem is a generalization of Brower’s fixed-point theorem for set-valued maps
or correspondences, requiring upper semi-continuity, or to have a closed graph. Unlike the previous proof where
we had to construct a single-valued function, here we can simply use the best-response correspondence Φi and Φ.
Proof based on Kakutani’s Fixed-point Theorem
Recall that Φi (x−i ) is the set of tuples of mixed strategies chosen by each player in a game. Since there may be
a number of responses which are equally good, Φi (x−i ) is set-valued rather than single-valued. Then the Nash
equilibrium of the game is defined as a fixed-point of the mapping (correspondence) Φ, Φ : ∆X ⇒ ∆X . Note that ∆X
is a non-empty, compact and convex set. The best-response correspondence (mapping) Φi is upper semi-continuous
(has a closed-graph) by the continuity of the payoff function J i , which in itself follows from Berge’s maximum
theorem (See Appendix A). Alternatively, we can show that Φ : ∆X ⇒ ∆X itself has a closed-graph.
For every x ∈ ∆X , the image Φ(x) ⊂ ∆X is a non-empty and convex set. Hence, by Kakutani’s fixed-point theorem,
Theorem A.9, Appendix A, Φ has at least one fixed-point in ∆X , i.e., there exists x∗ ∈ ∆X such that
x∗ ∈ Φ(x∗ )
3.6. NASH EQUILIBRIA CHARACTERIZATION 51
Both proofs are non-constructive, mainly because of the fixed-point argument used. In general, for N-player matrix
game a relatively simple method can be used to compute a mixed NE if it is inner (completely mixed). Otherwise,
even for bimatrix games this is more difficult, and methods such as Lemke-Howson algorithm, nonlinear program-
ming can be used [19].
The next result gives a characterization of mixed-strategy NE that will be useful in the evolutionary game setup.
Formally this is known as the Support Characterization theorem that relates an NE as a mixed-strategy best reply to
a pure-strategy best reply.
Theorem 3.20. (Support Characterization Theorem) Let x∗ = (x∗i , x∗−i ) ∈ ∆X . Then x∗ ∈ NE (G ), is a mixed-
strategy Nash equilibrium (NE) if and only if for every player i ∈ N ,
Remark 3.21. The result states that for a player i, every action (pure strategy) in the support of a Nash equilibrium x∗
is a best-response to x∗−i , hence pure strategies in the support have the same cost. The intuition is that if the strategies
in the support have different costs, then it would be better to just take the pure strategy with the lowest expected cost
and this would contradict the assumption that x∗ is a Nash equilibrium. Using the same argument, it follows that the
pure strategies which are not in the support must have higher (or equal) expected costs.
Proof: (Sufficiency) Consider x∗ = (x∗i , x∗−i ) such that ∀i ∈ N , (3.16) holds which is equivalent to
Since x∗i ∈ ∆i , we can write it as convex combination of pure strategies, x∗i = ∑ j∈Mi ei j x∗i, j = ∑ j∈sup(x∗i ) ei j x∗i, j , where
x∗i, j ≥ 0 and ∑ j∈sup(x∗i ) x∗i, j = 1. Thus based on linearity of J i in x∗i ,
hence x∗i ∈ Φi (x∗−i ), by 3.11), which holds for ∀i ∈ N . Then by (3.12) x∗ is an NE.
(Necessity) Consider x∗ = (x∗i , x∗−i ) ∈ NE (G ) an NE and let Ji∗ = J i (x∗i , x∗−i ) = minwi ∈∆i J i (wi , x∗−i ). Then by Lemma
3.18
Ji∗ = J i (x∗i , x∗−i ) ≤ J i (ei j , x∗−i ), ∀ j ∈ Mi (3.18)
We show that all strategies in the support of x∗i have the same cost equal to Ji∗ . Assume by contradiction that there
exist k0 ∈ sup(x∗i ) such that
J i (eik0 , x∗−i ) 6= Ji∗
which, by (3.18), means that
J i (eik0 , x∗−i ) > Ji∗
Then
Ji∗ = J i (x∗i , x∗−i ) = ∑ J i (ei j , x∗−i ) x∗i, j = ∑ J i (ei j , x∗−i ) x∗i, j + J i (eik0 , x∗−i ) x∗i,k0 > Ji∗
j∈sup(x∗i ) j∈sup(x∗i ), j6=k0
52 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
where we used the fact that all j pure actions have equal cost except the k0 one for which strict inequality holds. The
above is a contradiction, hence our assumption is false.
As a note, such an NE is obtained by randomizing over pure actions with equal expected cost. We shall use this
characterization later in the evolutionary game setup.
Given a mixed strategy profile x ∈ ∆X or x−i ∈ ∆−i , a pure-strategy best reply (response) correspondence for player
i is a pure strategy vi ∈ Ωi such that no other pure strategy gives him lower cost (loss) against x−i . We denote this set
by Φip : ∆−i → Ωi where p superscript indicates that we are looking for a pure response to a (given) mixed-strategy
profile. Then Φip is defined, for any given x−i ∈ ∆−i , by
, i.e., Φip (x−i ) := arg minui ∈Ωi J i (ui , x−i ), or to indicate that the j-th pure strategy is a pure best reply,
for any given x−i ∈ ∆−i . This Φip (x−i ) ⊂ Ωi is called the optimal pure-reaction set of player i ∈ N , i.e., the set of
all optimal pure responses by player i to any fixed x−i ∈ ∆−i . Then for j∗ ∈ Φip (x−i ), it follows that for any given
x−i ∈ ∆−i , player i has a cost equal to J i (ei j∗ , x−i ) = mink∈Mi J i (eik , x−i ).
Now since every mixed strategy is a convex combination of pure strategies and J i (wi , x−i ) is linear in wi , it follows
that no mixed strategy wi ∈ ∆i can give a lower cost against x−i than any one of its pure best replies to x−i . This
means that formally we write, for any x−i ∈ ∆−i , wi ∈ ∆i and j ∈ Φip (x−i ), (3.20)
mi mi mi
J i (wi , x−i ) = ∑ J i (eik , x−i ) wi,k ≥ ∑ J i (ei j , x−i ) wi,k = Ji (ei j , x−i ) ∑ wi,k = J i (ei j , x−i ) (3.21)
k=1 k=1 k=1
In this section we present some supplementary material on NE refinements. In order to go beyond certain weakness
of the Nash equilibrium when several NEs exist, many NE refinements have been proposed since about the late 1970.
We shall review only a few of them.
Let NE (G ) ⊂ ∆X denote the set of NE points of a N-player finite game G (N , ∆i , J i ).
This refinement is due to Selten, [134], best known as the “trembling hand perfection”. This intuitively means that
NEs that are not robust to “’trembles" in players’ strategies are discarded.
3.7. * NASH EQUILIBRIA REFINEMENTS 53
Consider a game G (N , ∆i , J i ) and let µ be an error function such that a number µi,k ∈ (0, 1) is assigned for each
player i ∈ N , and pure strategy k ∈ Mi which defines the probability that this k pure strategy will be played by
mistake (trembling hand), with ∑k∈Mi µi,k < 1. Note that such a small probability > 0 is assigned to all pure strategies.
This means that for each player i ∈ N the error function µ defines a subset of mixed-strategies
Definition 3.22. An NE, x∗ ∈ NE (G ) is a perfect equilibrium if, for some sequence {G ( µt )}µt →0 of perturbed
games, there exist profiles xt ∈ NE (G ( µt )) such that xt → x∗ .
Let us call PE (G ) the set of perfect equilibria. Note that every interior (completely mixed) or inner NE is perfect.
This is seen because if x∗ ∈ int(∆X ), then for sufficiently small µi,k , x∗ ∈ int(∆X ( µ )). If in addition x∗ ∈ NE (G )
then x∗ ∈ NE (G ( µ )).
Note that the definition above requires only that the NE be robust with respect to some trembles (error function), so
existence can be established even if there are no interior NE.
Proof:
For any sequence {G ( µt )}µt →0 let xt ∈ NE (G ( µt )) for each t. Since {xt }, t = 1, . . . , ∞ is a sequence in a compact
set ∆X it has a convergent subsequence, {xts }, s = 1, . . . , ∞ with limit x∗ ∈ ∆X . For each s, G ( µts ) is the associated
perturbed game. Then since xts ∈ NE (G ( µts )) it follows that x∗ ∈ NE (G ) by continuity arguments, and x∗ is perfect
since xts → x∗ and xts ∈ NE (G ( µts )) for all s.
We give without proof the following useful result (for a proof see [150]).
Proposition 3.24. Every perfect equilibrium x ∈ PE (G ) is undominated, i.e., every player’s strategy component xi
is undominated, i ∈ N . In a two player-game I = {1, 2} if an NE x ∈ NE (G ) is undominated, then it is perfect, i.e.,
x ∈ PE (G ).
As a most stringent condition on the “trembles", the concept of strictly perfect equilibrium requires robustness with
respect to any low probability tremble.
54 CHAPTER 3. MATRIX GAMES: N -PLAYER NONZERO SUM
Definition 3.25. An NE, x∗ ∈ NE (G ) is a striclty perfect equilibrium if, for every sequence {G ( µt )}µt →0 of per-
turbed games, there exist profiles xt ∈ NE (G ( µt )) such that xt → x∗ .
Note that any interior (inner) NE equilibrium x ∈ NE (G ) is strictly proper, since one could take xt = x∗ for all t
sufficiently large such that x∗ ∈ NE (G ( µt )).
In fact a perfect NE equilibrium that is not strictly perfect is vulnerable against some sequence of trembles.
An intermediary refinement is one introduced by Myerson, [100] which imposes some conditions on the “trembles"
to which the NE should be robust (not only to some unqualified trembles). Specifically, an NE equilibrium should be
robust with respect to those trembles that are such that more costly “errors" are less probable. This is like requesting
the perturbed (“trembled") strategy to be “proper" is some sense, formalized as follows. Given some ε > 0, a strategy
profile x ∈ int(∆X ) is called ε -proper if
Note that any interior (inner) NE equilibrium x ∈ NE (G ) is ε -proper for any ε > 0, since for each player i all pure
strategies give the same (minimal) cost or (maximal) payoff against x. It is easy to show that any interior (inner)
NE equilibrium x∗ ∈ NE (G ) is proper - just consider xt (εt ) = x∗ for all t. Also, it can be shown that proper NE
equilibria always exist and that every proper NE equilibrium is perfect.
Another refinement that we just mention is that of an essential equilibrium which is defined as being robust to
perturbations in players’ “payoffs" or cost functions.
The concept of a strictly perfect NE can be extended to robustness of NE sets, called strategically stable sets.
Definition 3.26. A set of NEs, |X ∗ ⊂ NE (G ) is strategically stable if it is the smallest non-empty, closed set such
that for every ε > 0 there exists some δ > 0 such that every strategy perturbed game G ( µ ) = (N , ∆X ( µ ), J i ) with
errors µi,k < δ has some NE within distance ε from the set X ∗ .
3.8 Notes
In this chapter we reviewed basic concepts and results for N-player matrix games, i.e., games with finite action sets.
In the next chapter we consider games with infinite (continuous) action sets.
Chapter 4
Continuous-Kernel Games
Chapter Summary
This chapter focuses on noncooperative (Nash) games with continuous-kernel i.e., with continuous action spaces
and cost functions. Basic concepts and results are reviewed, mostly adapted from [19].
4.1 Introduction
In this chapter we consider games with continuous (infinite) action spaces and with players’ cost functions being
continuous in the actions. These type of games are called continuous-kernel games. Throughout most of the chapter
we consider pure strategies (actions), and only briefly talk about mixed-strategies. We introduce basic NE con-
cepts and NE existence results for the case in which no coupling exists between action spaces. The case of games
with coupled constrains is deferred to the next chapter. A review of associated optimization results is presented in
Appendix B.
Let us consider a game with N players where the action sets are continuous (infinite). For each player i ∈ I instead
of a finite action set Ωi = {ei1 , . . . , eimi }, consider a compact, convex set Ωi ⊂ Rmi , so a continuous (infinite) action
set. Let ui ∈ Ωi denote its pure strategy (action). The overall action space Ω is the Cartesian product of Ωi
Ω = Ω1 × · · · × ΩN . (4.1)
We also let Ω−i := Ω1 × · · · Ωi−1 × Ωi+1 × · · · × ΩN . It can be seen that Ω is compact, convex, and has a nonempty
interior set. Moreover, any two players i and j can take their actions independently from separate action sets Ωi
and Ω j , j 6= i, respectively. For simplicity in this chapter we consider mi = 1 and let Ωi = [u0 , umax ], where u0 > 0
and umax > u0 are scalars. Results will hold without modification for Ωi ⊂ Rmi . The action space Ω is separable
by construction since there is no coupling among Ωi , and such a game is called a game with uncoupled constraints.
As before an action vector u ∈ Ω can be written as the N-tuple u = [u1 , . . . , uN ]T , or u = (ui , u−i ) with u−i ∈ Ω−i
obtained by deleting the ith element from u.
55
56 CHAPTER 4. CONTINUOUS-KERNEL GAMES
Let the individual cost function of player i ∈ I be denoted by Ji : Ω → R. A standard assumption is that Ji is jointly
continuous in u. Each player aims to minimize its own cost function Ji (ui , u−i ), in the presence of all other players.
We denote such an N-player game by G (I, Ωi , Ji ).
We shall discuss the Nash equilibrium concept in the context of such games. We shall see that existence of a Nash
equilibrium in pure strategies can be established under relatively mild conditions.
Before we discuss the NE concept we shall briefly remark on the mixed-strategy extension for game with continuous
action spaces.
Similarly to the concept of a mixed-strategy for matrix games, this concept can be defined for continuous (or infinite)
games by employing an appropriate distribution function. We only consider for simplicity the N = 2 case. Let the
two players’ pure strategies be denoted u1 ∈ Ω1 , u2 ∈ Ω2 , where this time Ω1 , Ω2 are infinite sets, such as for
example the interval [u0 , umax ]. A mixed-strategy for player 1 is a cumulative distribution function defined on Ω1 ,
denoted by σ1 ∈ ∆1 (Ω1 ), describing the probability distribution over his space of pure strategies. For every u1 ∈ Ω1 ,
σ1 (u1 ) = Prob ( χ ≤ u1 )
where Prob denotes probability. Every cumulative distribution function has the following properties: (i) σ1 (u1 ) ≥ 0,
∀u1 ∈ Ω1 , (ii) σ1 (u0 ) = 0, σ (umax ) = 1, (iii) σ1 (u1 ) is non-decreasing, (iv) σ1 (u1 ) is right-continuous in the open
(u0 , umax ). As the mixed-strategy of player 2, σ2 (u2 ) has similar properties. The expected (average) cost for player
1 when player 1 chooses pure strategy u1 and player 2 chooses mixed strategy σ2 (u2 ) is
Z
J 1 (u1 , σ2 ) = E (J1 (u1 , σ2 )) = J1 (u1 , u2 ) d σ2 (u2 )
Ω2
The expected (average) cost for player 1 when player 1 and player 2 use mixed strategies σ1 (u1 ) and σ2 (u2 ),
respectively, is Z Z
J 1 ( σ1 , σ2 ) = J ( u 1 , u 2 ) d σ1 ( u 1 ) d σ2 ( u 2 )
Ω1 Ω2
where joint probability distributions were considered, assuming them independent.
Let us return to the basic case of games with continuous action spaces and pure-strategies, G (I, Ωi , Ji ) as these will
be the focus of this chapter. The formal definition of a Nash equilibrium (NE) is given next.
Definition 4.1 (Nash equilibrium (NE)). An N-tuple u∗ = (u∗i , u∗−i ) ∈ Ω is called an Nash equilibrium (NE) solution
of G (I, Ωi , Ji ) if
Ji (u∗i , u∗−i ) ≤ Ji (ui , u∗−i ), ∀ ui ∈ Ωi , ∀i ∈ I (4.2)
If in addition u∗ is not on the boundary of the action space Ω (is in the interior of Ω), then it is called an inner NE
solution.
4.5. EXISTENCE OF PURE-STRATEGY NE 57
As before at an NE no player can benefit by altering its action unilaterally, i.e., an NE is a so-called no-regret
solution. For each player i ∈ I, we can define its best-response correspondence Ri : Ω−i ⇒ Ωi , such that for any
given u−i
Ri (u−i ) = {ξi ∈ Ωi | Ji (ξi , u−i ) ≤ Ji (ui , u−i ), ∀ ui ∈ Ωi } (4.3)
This is called the optimal (best-response) reaction set of player i ∈ I, i.e., the set of its best-responses to any given
action u−i of the other players. Given the actions of the other players, u−i ∈ Ω−i , each player i independently
minimizes its own cost
min Ji (ui , u−i )
ui ∈Ωi
Then an equivalent form of definition (4.2) is that u∗ ∈ Ω is an NE solution if for all i ∈ I, u∗i is a solution to the
foregoing minimization, i.e.,
u∗i ∈ Ri (u∗−i ), ∀i ∈ I
or an NE is in the intersection of all best-response sets. In graphical terms, the existence of a Nash equilibrium (NE)
is equivalent to requiring that the graph of all Ri (u−i ) results in at least one intersection point. Then the set of all
such intersections in Ω forms the set of all NE., denoted by NE(G ). Alternatively, u∗ is an NE if it is a fixed-point
of the overall best-response correspondence R
u∗ ∈ R ( u∗ )
where R is the N-tuple R = (R1 , . . . , RN ). An NE is a strategy profile that constitutes the best reply to itself. When
u∗ is the unique best reply to itself we say that u∗ is a strict NE. In that case it will be true that
Consider a continuous-kernel game G (I, Ωi , Ji ). An important question that needs addressed is whether a game ad-
mits a (possible unique) Nash equilibrium (NE). We shall see that existence of a Nash equilibrium in pure-strategies
can be established under relatively mild conditions, beyond continuity.
Theorem 4.2 (Debreu-Fan-Glicksberg NE Theorem). Consider a game G (I, Ωi , Ji ) where Ωi are non-empty, com-
pact, convex subsets of Rmi , for every i ∈ I. Let u = (ui , u−i ) ∈ Ω. If for every i ∈ I the cost function Ji : Ω → R is
jointly continuous in u and convex in ui , then the game admits at least an NE solution in pure-strategies.
Proof: The proof follows by applying Kakutani’s fixed-point theorem to the best-response correspondence R, where
R is the N-tuple R = (R1 , . . . , Ri , . . . , RN ), and Ri : Ω−i ⇒ Ωi . The proof follows along the same lines as in the
matrix game case, where the mixed-strategy best-response Φi was used. In this continuous-kernel game case the key
part is to show that the best-response set Ri (u−i ) is a convex set, based on the convexity assumption of Ji .
In the next we consider a slightly stronger assumption that Ji is strictly convex in ui . This ensures that the best-
response set Ri (u−i ) is a singleton, hence that Ri is a function and we can use Brower’s fixed point theorem to
prove existence of an NE.
58 CHAPTER 4. CONTINUOUS-KERNEL GAMES
Definition 4.3 (Reaction function). If for every given u−i ∈ Ω−i the best-response set Ri (u−i ) (4.3) is a singleton
(a set with one element), then Ri is called the best-response (reaction) function of player i, Ri : Ω−i → Ωi .
Theorem 4.4. ( Theorem 4.3 in [19]) Consider a game G (I, Ωi , Ji ) where Ωi are non-empty, compact, convex subsets
of Rm
i , for every i ∈ I. Let u = (ui , u−i ) ∈ Ω. If for every i ∈ I the cost function Ji : Ω → R is jointly continuous in
u and strictly convex in ui , then the game admits at least an NE solution in pure-strategies.
Proof:
By assumption, for every i ∈ I, Ji is strictly convex in ui for any given u−i, so that there is a unique minimum ξi ∈ Ωi ,
such that
Ji (ξi , u−i ) < Ji (ui , , u−i ), ∀ ui ∈ Ωi , ui 6= ξi .
and the best-response set Ri (u−i ) = ξ is a singleton, so by Definition 4.3 Ri is a best-response (reaction) function
Ri : Ω−i → Ωi . The overall best-response is a function R : Ω → Ω, the N-tuple R = (R1 , . . . , Ri , . . . , RN ), where
Ri is defined as in (4.3). An NE (if it exists) is a fixed-point of the best-response vector-valued function R : Ω → Ω,
i.e.
u∗ = R ( u∗ ) (4.4)
The proof follows by applying Brower’s fixed-point theorem to R. By assumption, as the cartesian product of Ωi ,
the set Ω is a non-empty, compact, convex subset of an Euclidean space. The best-response function R maps Ω into
itself, so we only need to show continuity of R. Equivalently we need to show that for every i ∈ I, Ri : Ω−i → Ωi
is continuous in its argument.
( n)
We prove continuity of Ri : Ω−i → Ωi by contradiction. Thus assume that there exists a sequence {u−i } such that
( n) ( n) ( n) ( n)
limn→∞ u−i = u−i , and for each n, let ξi = Ri ( u−i ) with limn→∞ ξi = ξi , but such that ξi 6= Ri ( u−i ), i.e.,
( n) ( n) ( n)
lim ξi = lim Ri ( u−i ) 6= Ri ( lim u−i ) = Ri ( u−i )
n→∞ n→∞ n→∞
Since by the contradiction assumption ξi 6= Ri ( u−i ), from (4.3) it follows that there exists u0i ∈ Ωi such that
which implies that there exists some finite K such that for every n ≥ K,
( n) ( n) ( n)
Ji (ξi , u−i ) > Ji (u0i , u−i )
But this contradicts (4.6) so our assumption is false. Thus Ri is continuous, for every i ∈ I, hence R is continu-
ous. On the other hand, R maps a compact and convex set Ω into itself. From Brouwer’s Fixed-point Theorem
(Theorem A.8), there exists a u∗ such that u∗ = R (u∗ ), hence an NE solution of G (I, Ωi , Ji ).
Note that both Theorem and Theorem 4.4 give sufficient conditions for existence of an NE. It is possible to have a
game that does not satisfy the conditions in either theorem and still have an NE.
Under assumptions of smoothness of the cost functions, some further characterization of an NE can be made.
Assumption 4.5. Ji (ui , u−i ) is C1 (continuously differentiable) and (strictly) convex in ui for every u−i ∈ Ω−i .
A Nash equilibrium u∗ ∈ Ω can be characterized as a solution to a variational inequality (VI) [52], i.e., such that
(v − u∗ )T ∇J(u∗ ) ≥ 0, ∀v ∈ Ω. (4.7)
We say that u∗ ∈ Ω is a solution to the variational inequality (VI) for ∇J denoted by V I (Ω, ∇J). Note that the left-
hand side of (4.7) represents an inner-product, hence the search is for points (vectors) such that the angle between
the two vectors in (4.7) is acute or at most a right-angle, for all v ∈ Ω. It can be shown that this is equivalent to
u∗ = TΩ [u∗ − α ∇J(u∗ )]
where TΩ : RN → Ω, denotes an Euclidean projection, TΩ (v) = argminu∈Ω kv − uk. Sometimes the projection
TΩ (·) is denoted as [·]+ (see Appendix B). This geometric characterization means that for an interior (inner NE)
u∗ ∈ int (Ω) the necessary condition is
∇J(u∗ ) = 0
For an interior (inner) NE solution, the following result is immediate from Definition 4.1 and Proposition B.1.
Proposition 4.6. Consider a game G (I, Ωi , Ji ) under Assumption 4.5. Let u∗ be an inner NE solution of G (I, Ωi , Ji ).
Then u = u∗ satisfies the following set of necessary conditions:
Let us consider an associated optimization problem from the cost functions Ji , i.e.,
min ∑ Ji (u)
i∈I
subject to u ∈ Ω,
In general an NE solution is not an optimal solution to this associated optimization problem. Let’s take a look at a
two-player Nash game for illustration.
Example 4.7. Consider a two-player Nash game with the cost functions
and where ui ∈ Ωi = [0, 8], i = 1, 2. The constraint set Ω = Ω1 × Ω2 is convex and compact and has a nonempty
2 2
interior set. From the cost functions, it follows that ∂ J1 2 = 4 and ∂ J2 2 = 2. Thus Assumption 4.5 is satisfied
∂ ( u1 ) ∂ ( u2 )
and the reaction functions Ri , i = 1, 2, exist and are continuous. By Definition 4.3, the reaction function R1 (u2 )
can be obtained by optimizing J1 (u1 , u2 ) with respect to u1 for every given u2 . It follows that R1 (u2 ) = 41 (u2 + 2).
Similarly, the reaction function R2 (u1 ) is obtained as R2 (u1 ) = 21 (u1 + 12 ). The reaction curves are shown in
u2
7
6
R1
5
4
R2
3
1
u∗
0 1 2 3 4 5 6 7 u1
Figure 4.1: Reaction curves, R1 (u2 ) and R2 (u1 ).
Fig. 4.1. By Definition 4.1, an NE solution lies on both reaction curves. Therefore the intersection point of R1 (u2 )
and R2 (u1 ), u∗ = (u∗1 , u∗2 ) = ( 14
9 4
, 7 ), is an NE solution. Fig. 4.1 indicates that this in fact is the unique NE solution.
The corresponding optimal cost values are J1∗ = J1 (u∗ ) = − 81 ∗ ∗ 16
98 and J2 = J2 (u ) = − 49 . Next consider the associated
optimization problem
min J0 (u1 , u2 )
subject to 0 ≤ ui ≤ 8, i = 1, 2,
2 2
where J0 (u1 , u2 ) = J1 (u2 , u1 ) + J2 (u1 , u2 ) = 2(u1 ) + (u2 ) − 2u1 − 12 u2 − 2u1 u2 , so as social cost. The constraint
set Ω is still convex and compact. The Hessian of J0 is positive definite for every u ∈ Ω with
" #
2 4 −2
∇ J0 =
−2 2
such that the cost function J0 is strictly convex over Ω. From Proposition B.2 the associated optimization problem
admits a unique global minimum. By using Proposition B.3, the optimal solution is uopt = ( 45 , 32 ) and the optimal
4.5. EXISTENCE OF PURE-STRATEGY NE 61
Consider a game G (I, Ωi , Ji ) . From the individual cost functions Ji (ui , u−i ) let us define now an augmented system-
like two-argument cost function Je: Ω × Ω → R, [19]
which we shall call the Nash game (NG) cost function of the game. A useful characterization of an NE and necessary
conditions for an inner NE solution can be formulated with respect to the NG cost function. Moreover, in the next
chapter we show how this NG cost function can be used to solve Nash games with coupled constraints.
Remark 4.9. The two definitions, Definitions 4.1 and 4.8 are equivalent. Indeed (4.10) can be equivalently rewritten
as for every given u∗−i ,
∑ Ji (u∗i , u∗−i) ≤ ∑ Ji (vi , u∗−i ), ∀ v ∈ Ω.
i∈I i∈I
Thus it is immediately seen that u∗
satisfies Definition 4.8 if u∗ is an NE solution in the sense of Definition 4.1.
Conversely, if u∗ satisfies Definition 4.8, then it constitutes an NE solution in the sense of Definition 4.1 as shown
next by contradiction. Assume to the contrary, that such a u∗ is not an NE solution in the sense of Definition 4.1. This
implies that for some i ∈ I, there exists a ũi 6= u∗i such that Ji (ũi , u∗−i ) < Ji (u∗i , u∗−i ). By adding ∑ j∈I, j6=i J j (u∗j , u∗− j )
to both sides, the following inequality holds,
where ũ := (ũi , u∗−i ) and ũi 6= u∗i , but this contradicts the hypothesis (4.10), so Definitions 4.1 and 4.8 are equivalent.
Based on the concept of NG cost function J, e (4.9), a proof of Theorem 4.4, under Assumption 4.5 of twice continu-
ously differentiable cost function can be given (Theorem 4.4 in [19]). We present it here to give a better interpretation
of the two-argument NG cost function, as it proves to be very useful for games with coupled constraints (in the fol-
lowing chapter).
Proof of Theorem 4.4 under Assumption 4.5 based on Je
From (4.9), the two-argument NG cost function Je(v; u) is separable in the (first) extra argument v for every given u,
i.e., each component cost function in Je(v; u) is decoupled in v for every given u. Therefore by using (4.9), for every
given u, the gradient of Je(v; u) with respect to v is written as
∂ J1
( v1 −1, u )
∂ v1 .
e
∇v J (v; u) := .. . (4.11)
∂ JN
∂ vN (vN , u−N )
62 CHAPTER 4. CONTINUOUS-KERNEL GAMES
2
The Hessian of Je(v; u) with respect to v, ∇2vv Je(v; u), is a diagonal matrix with elements ∂∂(vJi )i 2 (vi , u−i ), i = 1, . . . , N.
Under Assumption 4.5, from the strict convexity of Ji (vi , u−i ) with respect to vi (for every given u−i ), it follows that
Je(v; u) is strictly convex with respect to its argument v, for every given u. Moreover, Je(v; u) is continuous in its
arguments. For every given u, let us define an augmented best-response set, Re : Ω ⇒ Ω,
or, equivalently,
Re(u) = arg min Je(v; u),
v∈Ω
where minimization on the right-hand side (RHS) is done with respect to the argument v in Je(v; u). Recall that Ω is
compact. By the continuity and convexity property of J, e it follows from Theorem A.7 (Berge’s Maximum Theorem)
that Re has a closed-graph (is an upper-semi-continuous mapping), and maps each point u in Ω into a compact and
convex subset of Ω. Then by Kakutani Fixed-point Theorem (Theorem A.10), there exists a point u∗ such that
u∗ ∈ Re(u∗ ), i.e., u∗ satisfies (4.10). By Definition 4.8, u∗ is an NE solution of G (I, Ωi , Ji ).
As seen in the proof, an NE solution u∗ is a fixed-point solution that satisfies u∗ ∈ Re(u∗ ), i.e., u = u∗ is a solution
to the implicit equation
u = arg min Je(v; u). (4.12)
v∈Ω
∇v Je(v; u) = 0, (4.13)
v= u
where the notation “ |v=u ” denotes finding a fixed-point solution. By using (4.11), we get the component-wise form
of (4.13):
∂ Ji
(vi , u−i ) = 0, ∀ i ∈ I, (4.14)
∂ vi vi =ui
which are equivalent to the necessary conditions (4.8) with respect to Ji in Proposition 4.6, i.e., R (u−i ). Based on
this one can summarize the procedure to find an inner NE solution with respect to Jeas follows. As a first step solve
∇v Je(v; u) = 0 for every given u (i.e., v is the only variable), which gives v as a function of u. Then look for a
fixed-point solution for these equations, this fixed-point solution being denoted by v = u, as in (4.13). Solving the
resulting set of N equations, (4.14), yields an inner NE solution.
We give next an illustration of this procedure.
Example 4.10. Consider the two-player Nash game in Example 4.7 and its NG cost function,
Assume there exists an inner NE solution. Then given every u, in the first step we solve ∇v Je(v; u) = 0, i.e.,
∂ J1
(v1 , u2 ) = 4v1 − 2 − u2 = 0
∂ v1
∂ J2
(v2 , u1 ) = 2v2 − 21 − u1 = 0.
∂ v2
4.6. EXAMPLE: OPTICAL NETWORK OSNR GAME 63
In the second step we solve for a fixed-point solution , by setting v = u in the above, which leads to (u∗1 , u∗2 ) = ( 14
9 4
, 7 ).
While existence can be guaranteed under relatively mild conditions, uniqueness of an NE depends on the particular
situation and general results are hard to arrive at. In games where an NE is not unique and NE(G ) is not a singleton,
one may seek various refinements of the NE concept in hopes of finding a u∗ ∈ NE(G ) which is “better" than the
others under some refinement as discussed in the previous chapter.
In this section we describe as an application the basic formulation of a power control game in optical networks,
[121]. In large-scale networks decisions are often made by users independently [84], each according to its own
performance objective. This is also appropriate for large-scale optical networks, where it is difficult to maintain a
centralized system for transmitting real-time information between all channels, and cooperation among channels is
impractical. This makes noncooperative game theory a suitable framework, [103], [19]. This problem belongs to
a class of resource allocation in general communication networks [131, 4]. In optical network systems, a signal
over the same fiber link can be regarded as an interfering noise for others, which leads to optical signal-to-noise
(OSNR) degradation. A satisfactory OSNR at Rx for each channel may be achieved by regulating the input power
per channel at Tx. We restrict the analysis to single point-to-point optical links, as the simplest network topology.
Channel OSNR optimization can be formulated as an N-player noncooperative game. Conditions for existence and
uniqueness of the game Nash equilibrium solution can be obtained and an iterative algorithm that uses only channel
specific feedback measurements can be shown to converge to the Nash equilibrium solution, [117, 119].
Consider a point-to-point WDM (wavelength division multiplexed) fiber link shown in Fig. 4.2 in which reconfigu-
ration is finished, i.e., channels will not be added or dropped while performing the optimization. A set I = {1, . . . , N}
of channels are transmitted over the link. The link consists of cascaded spans of optical fiber and optical amplifiers
(OA). Optical amplifiers simultaneously boost the power of all channels but introduce Amplified Spontaneous Emis-
sion (ASE) noise that gets also amplified at each amplification stage. We denote ui and n0i the signal power and noise
power of channel i ∈ I at Tx, respectively. Similarly, we denote pi and ni the signal power and noise power of channel
i ∈ I at Rx, respectively. Let u = [u1 , . . . , uN ]T denote the vector form of the signal power at Tx. Equivalently, we
write u = (ui , u−i ). Signal power at Tx is typically limited for every channel, that is ui ∈ Ωi = [0, umax ] where umax
is a positive constant. Thus ui ∈ Ωi and u ∈ Ω.
pi
The OSNR of channel i at Rx denoted by yi , where yi = ni and can be shown to be given as
ui
yi = 0
, (4.16)
ni + ∑ j∈I Γi, j u j
where Γ is a system matrix with positive entries that depend of the amplifier (OA) gains and ASE noise accumulation.
Equivalently, (4.16) can be rewritten as
ui
yi = with X−i = ∑ Γi, j u j + n0i (4.17)
X−i + Γi,i ui j∈I, j6=i
64 CHAPTER 4. CONTINUOUS-KERNEL GAMES
where X−i denotes the total interference on channel ith due to other channels’ power.
Consider a game-theoretic approach to solve a channel OSNR optimization problem, based on the OSNR model,
(4.17). Specifically we formulate a Nash game where the players are the channels. Each channel i, i ∈ I is a player
that minimizes its own cost function Ji , by adjusting its transmission power ui , in response to the other channels’
(players’) actions, u−i . Such a game is denoted as before by G (I, Ωi , Ji ). The objective of each player is to minimize
its cost (maximize its utility) related to individual channel OSNR.
✒ u1
✲❅ ✲
Tx1
.. ..
. ✲ OA1 ✲ ... ✲ OAv ... ✲ OAN ✲
.
✒
✲
Txm ✲ (a) ❅
um
✒ u1
Tx1 ✲❅ ✲
.. ..
. ✲ OA-Link ✲ .
(Γ)
✒
Txm ✲ ✲
❅
um
(b)
Let each cost function Ji be defined as the difference between a pricing function Pi and a utility function Ui
We consider that the utility Ui is related to channel’s OSNR performance, while the pricing term Pi is used to
penalize a channel for using too large an action (power). In general a pricing mechanism is known to improve the
NE efficiency, and a linear pricing is the simplest one, [130]. Thus
where αi > 0 and the linear pricing term reflects the fact that increasing one channel’s power degrades the OSNR of
all other channels.
Consider a channel utility function Ui (u) that reflects the channel’s preference for maximizing OSNRi, yi and such
that the following assumptions hold:
(A.ii.1) The utility function Ui (u) is a continuously differentiable function and strictly concave in ui .
(A.ii.2) ui = 0, ui = umax are not solutions to the minimization of the cost function Ji with respect to ui .
Below we indicate how we can construct a utility function that satisfies (A.ii.1), (A.ii.2). Note that OSNRi , i.e.,
yi in (4.17), is a strictly increasing function with respect to ui , and tends to 1/Γi,i , for infinite channel power.
4.6. EXAMPLE: OPTICAL NETWORK OSNR GAME 65
This is similar to the wireless SIR model, [7], even though a different physical mechanism is present (ASE noise
accumuation). In the SIR model the system matrix has a special structure with equal rows, which is instrumental in
the uniqueness results. In contrast, here Γ is a full general structure matrix, with coupling due to all channels and all
spans. Moreover, OSNRi is no longer a linear function of ui and a direct logarithmic utility function of associated
SIR as in the wireless case cannot be applied. For the general full matrix Γ, one can define a more general utility
function Ui (u) here chosen to be a logarithmic function of the associated channel’s OSNR, yi (u),
Ui (u) = βi ln 1 + ai 1−Γyii,i(uy)i (u) , ∀i ∈ I (4.20)
where βi > 0 quantifies the desire to maximize the OSNR and ai > 0 is a scaling parameter.
The utility function Ui defined in (4.20) is monotonically increasing in OSNR yi , so that maximizing utility is related
to maximizing channel OSNR, yi . Equivalently, using (4.17)
ui
Ui (ui , u−i ) = βi ln 1 + ai (4.21)
X−i
where X−i is given as in (4.17), as a function of the full system-matrix Γ. Therefore the cost function (4.19) to be
minimized by each player i ∈ I is
ui
Ji (ui , u−i ) = αi ui − βi ln(1 + ai ) (4.22)
X−i
In the above αi > 0, βi > 0 are pricing parameters that capture the trade-off between penalizing a channel for using
large power and its desire to maximize the utility. αi and βi are set by the network/link and the channel, respectively,
and act as weighting factors, quantifying the trade-off between pricing and utility. From (4.21) it follows immediately
that Ui satisfies (A.ii.1). Using (4.22), it can be shown that there exists a non-empty interval from which to select
βi /αi , such that (A.ii.2) holds. This is what we assume in the following, hence that αi , βi are selected such that an
NE is an interior point of the action set, or an inner NE solution.
This is only a particular utility function that has certain nice properties being logarithmic, in that it allows for closed-
form expression of the Nash equilibrium solution. Other examples can be given such as a linear function in OSNR.
Some reasons for this choice of a logarithmic function are as follows. A logarithmic function is analytically useful
and moreover is widely used as a utility function in flow control [74, 137, 6] and power control [7, 3, 5] for general
communication networks. In some cases, the logarithmic utility function is intimately associated with the concept
of proportional fairness [74].
The following result characterizes a Nash equilibrium (NE) solution and gives a uniqueness condition.
Theorem 4.11. Consider G (I, Ωi , Ji ) with individual cost functions Ji , (4.22). This game admits a unique NE
solution u∗ if ai are selected such that
u∗ = e
Γ−1 b̃ (4.24)
where e
Γ = [e
Γi, j ] and b̃ = [b̃i ] are defined as
(
e ai , j=i ai βi
Γi, j = b̃i = − n0i
Γi, j , j 6= i αi
and Γi, j being the link system matrix.
66 CHAPTER 4. CONTINUOUS-KERNEL GAMES
2
Proof: From (4.19, 4.22) and (A.ii.1) it follows directly that ∂ J2i > 0. Since the cost function Ji is strictly convex in
∂ ui
ui , there exists a unique minimizing u∗i , for any given u−i, such that
on the closed and bounded (compact) set [0, umax ]. Furthermore by (A.ii.2) u∗i is inner. To find u∗i we solve the
necessary conditions ∂∂ uJii = 0. From (4.22) one obtains
ai βi
ai u∗i + X−i
∗
= ∀i (4.25)
αi
which defines explicitly the ith player’s best-response (reaction) function Ri ,
u∗i = Ri (u∗−i ), ∀i ∈ I
It can be seen that the reaction function is linear which greatly simplifies the closed-form expression of the NE. From
Theorem 4.4 an NE solution exists. Moreover a vector solution of (4.25) is an NE solution to the N-player game.
Use the definition of X−i∗ , (4.17), to rewrite (4.25) as
ai βi
ai u∗i + ∑ Γi, j u∗j = − n0i ∀i
j6=i αi
where matrix e
Γ and vector b̃ are defined as e
Γ = [e Γi, j ] and b̃ = [b̃i ] with
(
e ai , j=i ai βi
Γi, j = b̃i = − n0,i
Γi, j , j 6= i αi
Therefore a unique NE solution u∗ exists if the matrix e Γ is invertible. Recall that Γ is a positive-entry matrix. If
e
(4.23) holds, then Γ, (4.26), is strictly diagonally dominant; from Gershgorin’s Theorem, [68], it follows that Γ e is
invertible, and the unique NE solution is u = e
∗ −1
Γ b̃.
In this section we discuss a distributed iterative algorithm developed based on the fixed-point interpretation of an
NE, u∗ = R (u∗ ). Thus assume that all players follow a best-response strategy, i.e., based on the best -response
(BR) play or BR dynamics, uk+1 = R (uk ). Based on (4.25), consider the following recursive relation for updating
transmitter power level
k
βi X−i
uki +1 = Ri (uk−i ) = − ∀i (4.27)
αi ai
where k denotes the iteration step. Thus (4.27) corresponds to a parallel adjustment scheme (PUA) whereby each
player responds optimally based on the best-response (BR) to the previously selected action of the other players.
Relation (4.27) requires the total interference factor X−i , which from (4.17) depends on all channel powers, u j , and
all channel gains, i.e., centralized information. However using (4.17) one can express (4.27) in terms of yi , i.e,
k+1 βi 1 1
ui = − − Γi,i uki (4.28)
αi ai yki
4.6. EXAMPLE: OPTICAL NETWORK OSNR GAME 67
This corresponds to a decentralized algorithm since the only information fedback is the individual channel OSNR
yi , which can be measured in real-time, and the channel gain, Γi,i . The following result gives convergence conditions
for this algorithm.
Lemma 4.12. If (4.23) holds, then algorithm (4.28) converges to the unique NE solution.
Proof: Let eki = uki − u∗i where u∗ is the NE solution. Using (4.25, 4.27, 4.28) yields
1
eki +1 = − ∑ Γi, j ekj (4.29)
ai j6=i
so that
1
k ek+1 k∞ = max |eki +1 | ≤ max( ∑ Γi, j |ekj |)
i i ai j6=i
where the fact that Γ is component-wise positive was used. Using |ekj | ≤k ek k∞ , ∀ j yields
1
k ek+1 k∞ ≤ max( ∑ Γi, j ) k ek k∞
i ai j6=i
k ek+1 k∞ <k ek k∞ ∀k = 0, 1, 2, . . .
Chapter Summary
This chapter provides some results for continuous-kernel Nash games with coupled constraints, i.e., coupled action
sets. These are also called generalized Nash games, games with coupled constraints, social equilibria. Game
theoretical formulations of problems and computational approaches towards solving coupled or generalized Nash
games have been areas of much recent interest, despite the fact that work has been going on for more than 50 years.
We present some results based on the recent Lagrangian approach extension in [120], [121].
5.1 Introduction
As seen in the previous chapter, in continuous-kernel Nash games with uncoupled constraints, the action space is the
Cartesian product of the individual action sets and players can affect only the cost functions of the other players but
not their feasible action sets.
On the other hand, in Nash games with coupled constraints, each player’s action affects the feasible action sets of
the other players. In Example 4.7, the action sets are Ωi = [0, 8], i = 1, 2, such that the action space Ω is rectangular
in R2 . Now consider the following example with a modified action space.
Example 5.1. Consider the two-player Nash game in Example 4.7 with an action space, Ω := Ω1 × Ω2 and ui ∈
Ωi = [0, 8], i = 1, 2. An additional constraint is considered: u1 + u2 ≤ 8. Now the action space is modified to be
Ω = {u ∈ Ω | u1 + u2 − 8 ≤ 0}.
Fig. 5.1 shows that Ω is rectangular (constraints have no coupling) while Ω is triangular (constraints are coupled).
In this latter case Ω, it is not possible to obtain separate action sets from which the players can take actions inde-
pendently, i.e., u1 ∈ [0, 8 − u2 ] and u2 ∈ [0, 8 − u1 ]. That is the action space Ω is coupled and this game is called a
two-player Nash game with coupled constraints.
69
70 CHAPTER 5. CONTINUOUS-KERNEL GAMES WITH COUPLED CONSTRAINTS
u2 u2
8 8
Ω
Ω
0 8 u1 0 8 u1
Figure 5.1: Constraints: Rectangular and triangular
Starting from this example, we present in this chapter some theoretically results for Nash games with coupled
constraints, i.e., coupled action sets. We call such games coupled Nash games, which are also called generalized
Nash games (games with non-disjoint strategy sets), [61], games with coupled constraints, [125], social equilibria
games or pseudo-Nash equilibria games [49, 15]. Game theoretical formulations of problems and computational
approaches towards solving coupled or generalized Nash games have been areas of much recent interest, [52, 133,
98, 99, 120, 136, 14]. The treatment is this chapter follows mostly the Lagrangian approach extension proposed
in [120]. The setting of the construction uses the two-argument NG cost function (see Chapter 4) and relaxes the
constraints into a two-argument form also. The problem is thus enlarged into a constrained optimization problem in
a space of twice the dimension followed by projection back into a one dimension (with a fixed-point solution). For
convex constraints, duality leads to hierarchical decomposition into a lower-level game with no coupled constraints
and an optimization problem for Lagrangian prices.
The chapter is organized as follows. In Section 5.2 some results on Nash equilibria existence are reviewed and relax-
ation via an augmented optimization problem is considered. This is followed by results for Lagrangian extension in
a game setup in Section 5.3. Section 5.4 and 5.5 present results for duality extension and hierarchical decomposition
in a game setup, followed by an example section.
To formally define a coupled Nash game, let us consider the following coupled inequality constraints
gr (u) ≤ 0, r = 1, . . . , R (5.1)
Ωr = {u ∈ Ω | gr (u) ≤ 0} (5.3)
As before, for every player i ∈ N an individual cost function Ji : Ω → R is defined that satisfies Assumption 4.5.
For every given u−i ∈ Ω−i , a projection action set is also defined for each i ∈ N ,
b i (u−i ) = {vi ∈ Ωi | g(vi , u−i ) ≤ 0}.
Ω (5.5)
For each i ∈ N , this projection action set Ω b i (u−i ) is the feasible action set under the given u−i. A vector u =
(ui , u−i ) is called feasible if u ∈ Ω. The resulting coupled Nash game is denoted by G (N , Ω b i , Ji ).
b i , Ji ) if
Definition 5.2 (NE for coupled Nash games). A vector u∗ ∈ Ω is called an NE solution of G (N , Ω
4
r1
3
2
u∗ r2
1
0 1 2 3 4 5 6 7 u1
Figure 5.2: Reaction curves in coupled Nash game: r1 (u2 ) and r2 (u1 )
The following proposition (adapted from Theorem 4.4 in [19]) gives sufficient conditions for existence of an NE
solution, based on Kakutani’s fixed point theorem.
Proposition 5.3 (Theorem 4.4, [19]). Let the action space Ω be a compact and convex subset of RN . Under
b i , Ji ) admits an NE solution.
Assumption 4.5, G (N , Ω
and note that Je(v; u) is separable in its extra (first) argument v. Then with respect to Je(v; u), an NE solution u∗ of
b i , Ji ) satisfies
G (N , Ω
Je(u∗ ; u∗ ) ≤ Je(v; u∗ ), ∀ v ∈ Ω, with g(vi , u∗−i ) ≤ 0 ∀ i ∈ N . (5.7)
72 CHAPTER 5. CONTINUOUS-KERNEL GAMES WITH COUPLED CONSTRAINTS
b i (u−i ),(5.5).
This can be obtained by using Definition 5.2, the definition of Je(v; u), (4.9) and the projection action set Ω
Now, let us also augment the coupled constraints g(u) in (5.2) into an equivalent two-argument form, e g,
e
g(v; u) = ∑ g(vi , u−i ), (5.8)
i∈N
The use of the augmented function J, e [19], defined on a space of twice the dimension of the original game is
instrumental in what follows. This because it allows to find a solution of the original Nash game with coupled
constraints by solving a constrained optimization problem for Je and searching for a fixed-point solution. Two
main features allow this. Firstly, the NG cost function is separable in the argument v for every given u, i.e., each
component cost function Ji in Je(v; u) is depends only on vi . Secondly, the constraints g have been augmented into a
separable two-argument form e g thus enlarging the search set. NG-feasibility is equivalent to e
g(u; u) ≤ 0. Intuitively,
by introducing the NG cost function Je(v; u), the coupled Nash game is related to a constrained optimization problem
for Je(v; u), that has a fixed-point solution. This method has been used for uncoupled Nash games in Chapter 4.
Herein, the optimization problem for Je(v; u) is a constrained minimization of Je(v; u) with respect to v, with con-
g(v; u), (5.8). A solution u∗ of the constrained minimization of Je(v; u) satisfies
straints e
b i , Ji ). It
Proof: We prove the result by using a contradiction argument. Assume u∗ is not an NE solution of G (N , Ω
follows that for some i ∈ N , there exists an ṽi ∈ Ωi with g(ṽi , u∗−i ) ≤ 0, such that
By adding a term ∑ j∈N , j6=i J j (u∗j , u∗− j ) to both sides, the following inequality holds,
g(ṽ; u∗ ) :=
e ∑ g(u∗j , u∗− j ) + g(ṽi , u∗−i ) ≤ 0
j∈N , j6=i
As seen in the above section, for every given u ∈ Ω, the constrained minimization (5.10) is a standard constrained
optimization problem (see Appendix B). In a game context one can use a Lagrangian extension for a two-argument
constrained optimization as proposed in [118, 120]. This method leads to an elegant hierarchical decomposition.
As in standard optimization, associated to (5.10) and Jeand e
g, a two-argument Lagrangian function e
L is defined by
e
L(v; u; µ ) = Je(v; u) + µ T ge(v; u), (5.11)
∇v e
L(v; u; µ ∗ ) = 0 (5.12)
v= u
µ ∗ T g(u) = 0, (5.13)
where the notation “ |v=u ” is defined in (4.13) and denotes finding a fixed-point solution.
(b) (Sufficiency): Let u∗ be a feasible vector together with a vector µ = [ µ1 , . . . , µR ]T , such that µ ≥ 0 and
µ T g(u∗ ) = 0. Assume that u∗ minimizes the Lagrangian function, eL (5.11), over v ∈ Ω, as a fixed-point
∗
solution, i.e., u = u satisfies
u = arg min e
L(v; u; µ ), (5.14)
v∈Ω
Note that if u is an inner NE solution, then µ ∗ = 0 such that (5.12) is equivalent to (4.13), the necessary condition
for an NE solution in an uncoupled Nash game. For a proof see [120].
Remark 5.6. The Lagrangian optimality condition in Proposition 5.5 shows that u∗ is obtained by first minimizing
the augmented Lagrangian function L e(v; u; µ ) with respect to the argument v, which gives v = φ (u) for every given
u. The next step involves finding a fixed-point solution u∗ of φ by setting v = u, i.e., solving u = φ (u). This u∗ , a
fixed-point solution to the minimization of eL over v ∈ Ω, the following holds L e( u∗ ; u∗ ; µ ) ≤ e
L(v; u∗ ; µ ), ∀ v ∈ Ω.
Note that u∗ thus obtained depends on µ , u∗ ( µ ). An optimal µ ∗ is achieved by solving
Example 5.7. Consider the two-player Nash game presented in Example 5.1 with a coupled constraint. The corre-
sponding Nash game (NG) cost function Je(v; u) is
e
g(v; u) = g(v1 , u2 ) + g(v2 , u1 ) = (v1 + u2 − 8) + (u1 + v2 − 8) ≤ 0
L(v; u; µ ) = Je(v; u) + µ e
e g(v; u)
= (2v21 − 2v1 − v1 u2 ) + (v22 − 21 v2 − u1 v2 ) + µ (v1 + u2 − 8 + u1 + v2 − 8)
To find an NE solution, u, and the corresponding Lagrange multiplier vector, µ , one needs to solve the necessary
conditions, (5.12) with vi = ui , i = 1, 2. Then it follows that
e(v; u; µ ) is convex
Note that as Ji and gr are differentiable convex functions and Ω = RN , the Lagrangian function L
with respect to v, so the Lagrangian minimization is equivalent to the first order necessary condition. Thus in the
presence of convexity the first order optimality conditions are also sufficient.
In this section we present some duality results based on the Lagrangian extension. A dual cost function related to
the minimization of the associated Lagrangian function (cf. Proposition 5.5) is introduced, similar to standard opti-
mization, [33]. For Nash games with convex coupled constraints, one can show that duality enables decomposition
into a lower-level Nash game with no coupled constraints, and a higher-level optimization problem.
Consider a Nash game with coupled constraints and recall the associated Lagrangian function e L and its minimization
in a fixed-point sense, as in (5.16). with the resulting fixed-point solution as a function of µ , u∗ = u∗ ( µ ).
We define a function called the dual cost function D( µ ) as
e( u∗ ; u∗ ; µ ) ,
D( µ ) : = L (5.17)
where u∗ minimizes e
L defined in (5.11) over v ∈ Ω as a fixed-point solution, i.e., u = u∗ satisfies
u = arg min e
L(v; u; µ ).
v∈Ω
D( µ ) := [min e
L(v; u; µ )] ,
v∈Ω arg mine
L=u
v∈Ω
5.4. DUALITY EXTENSION 75
where e
g(u; u) ≤ 0, and the dual optimal value is defined as
D∗ = max D( µ ). (5.18)
µ ≥0
The primal and dual optimal solution pairs are characterized by the following result (Theorem 2 in [120]).
Theorem 5.8. (u∗ ; µ ∗ ) is an optimal NE solution-Lagrange multiplier pair in the sense of (5.10), (5.18), if and only
if:
(1) u∗ ∈ Ω and e gl (u∗ ; u∗ ) ≤ 0 ( NG-feasibility);
(2) µ ∗ ≥ 0 (Dual feasibility);
(3) u∗ = arg{minv∈Ω e L(v; u; µ ∗ ) | v=u } (Lagrangian optimality);
T
(4) µ ∗ ge(u∗ ; u∗ ) = 0 (complementary slackness).
Proof: If (u∗ ; µ ∗ ) is an optimal NE solution-Lagrange multiplier pair, then u∗ is feasible and µ ∗ is dual feasible and
the first two relations follow directly. The last two relations follow from Proposition 5.5.
For sufficiency, using Lagrangian optimality one obtains
h i
e
L(u∗ ; u∗ , µ ∗ ) = minv∈Ω e
L(v; u; µ ∗ )
v= u
so that
L ( u∗ ; u∗ , µ ∗ ) ≤ e
e L(v; u∗ ; µ ∗ ), ∀v ∈ Ω
Therefore (5.10) holds and u∗ is an optimal NE game solution with Je∗ = Je(u∗ ; u∗ ). Using (5.17), evaluated at µ ∗ ,
and the foregoing relations yields
D( µ ∗ ) = min e
L(v; u; µ ∗ ) = Je(u∗ ; u∗ )
v∈Ω v= u
The separability in the extra argument of both NG cost function and constraints ensures that D( µ ) in (5.17) can
be decomposed. Such a decomposition result for the minimization of e L(v; u; µ ) (Theorem 3, [120] where the extra
argument is the second one in fact) is presented next. The result shows that the minimum of e L(v; u; µ ) with respect
to v ∈ Ω can be obtained by minimizing a set of one-argument Lagrangian functions. Thus the fact that both the NG-
game cost and the constraints are separable in the extra argument is exploited to show that the dual NG cost function
D( µ ) can be decomposed and, equivalently, found by solving a modified Nash game with no coupled constraints.
The decomposition result is given next.
Proposition 5.9 (Theorem 3, [120]). Consider G (N , Ω b i , Ji ). Let the action space Ω be a compact and convex
subset of RN and Assumption 4.5 is satisfied. The associated dual cost function D( µ ), (5.17), can be decomposed
as
N
D( µ ) = ∑ Li (u∗i (µ ), u∗−i(µ ), µ ), (5.19)
i= 1
where
Li (vi , u−i , µ ) = Ji (vi , u−i ) + µ T g(vi , u−i ) (5.20)
and u∗ ( µ ) = [u∗i ( µ )] ∈ Ω minimizes a set of Li defined in (5.20) over vi ∈ Ωi as a fixed-point solution, ∀ i ∈ N . In
other words, ui = u∗i ( µ ) satisfies
ui = arg min Li (vi , u−i , µ ), ∀ i ∈ N .
vi ∈Ωi
Proof: By Proposition 5.5, the necessary conditions for NE optimality with respect to the Lagrangian e
L, (5.11),
require to solve
∇v e
L(v; u; µ ) =0 (5.21)
v= u
∂e
or, equivalently component-wise, L(v;u;µ )
∂ vi
= 0, i = 1, N. Using the definitions for Je and ge one can write L,
e
vi =ui
(5.11), as
N
e
L(v; u; µ ) = ∑ Li (vi , u−i, µ ) (5.22)
i= 1
Moreover, because a fixed-point solution is sought one sets v = u, i.e., component-wise one needs to solve
vi (u−i ) = ui , ∀i = 1, . . . , N
for a fixed-point vector denoted u∗ = [u∗i ] and v = [u∗i ], which depends on µ . With this u∗ let us return now to the
value functional in (5.22). The first step taken in order to obtain u∗ was minimization with respect to v, so that from
(5.22) one has
N
min e
L(v; u; µ ) = min ∑ Li (vi , u−i , µ ), v∈Ω
v∈Ω v∈Ω i=1
for any given u, with Li as in (5.20). Since Ω = Ω1 × · · · × Ωm and the right-hand side is separable with respect to
v = [vi ], vi ∈ Ωi , it follows that
N
min e
L(v; u; µ ) = ∑ vmin Li (vi , u−i , µ ) (5.25)
v∈Ω i= 1∈Ω i i
for any given u. Now evaluating (5.25) at the fixed-point u∗ = [u∗i ], v = [u∗i ] obtained as above, one can write
h i N h i
e(v; u; µ )
minv∈Ω L |u=u∗ ,v=u∗ = ∑ minvi ∈Ωi Li (vi , u−i , µ ) |ui =u∗i ,vi =u∗i ,
i= 1
N
= ∑ Li (u∗i ( µ ), u∗−i ( µ ), µ )
i= 1
The proof is completed by using (5.17) and recalling that u∗ is a fixed-point solution to the set of N optimizations
(5.24), i.e., equivalently u∗ is an NE solution to the Nash game with cost functions Li , (5.20).
Proposition 5.9 yields a decomposition into a lower-level modified Nash game with cost functions Li , (5.20), with
no coupled constraints, and a higher-level optimization problem. The interpretation is that a procedural method for
finding a solution to a Nash game with coupled constraints can be based on solving a modified game with no coupled
constraints and an optimization problem. In general, u∗ ( µ ) may not be NE optimal for the given µ , in the sense
∗
of attaining the minimum NG cost such that Li = Ji∗ . However by Theorem 5.8 there exists a dual optimal price
µ ∗ ≥ 0 such that u( µ ∗ ) = [ui ( µ ∗ )] is NE optimal. Hence µ ∗ can be found as the maximizer in (5.18)). A sufficient
condition is that the dual cost D( µ ) is strictly concave in µ , for u∗ ( µ ) as obtained from the lower-level game, (5.20).
Alternatively, the price µ can be adjusted until the slackness conditions in Theorem 5.8 are satisfied indicating that
the dual optimal price µ ∗ is found.
This decomposition result has a hierarchical game interpretation, [19]. At the upper-level is a Stackelberg game
([19], pp. 179): the link is the leader that sets "prices" (Lagrange multipliers) and the N players are the followers.
Given prices as set by the leader, a Nash game is played at the lower-level between N players, with cost functions
Li , (5.20). Each player reacts to given "prices" and the price acts as a coordination signal.
As in standard optimization, the hierarchical game decomposition may offer computational advantages. For example,
the lower-level game may admit a closed-form explicit solution, or the higher-level problem may have a reduced
dimension. One such application of these results is presented in the following section.
78 CHAPTER 5. CONTINUOUS-KERNEL GAMES WITH COUPLED CONSTRAINTS
Consider the example in Section 4.6. One important constraint that was not considered is the link capacity constraint.
This is a coupled constraint on channel powers such that the total power launched into an optical fiber is restricted
b [2, 93]. This constraint has to be imposed at Txs on all channels that share a link,
below the nonlinearity threshold P,
so unlike Section 4.6, herein we consider a Nash game with coupled constraints, hence a coupled action space Ω
( )
Ω= u ∈ Ω| ∑ u j − Pb ≤ 0 , (5.26)
j∈N
Herein coupled action space means that a player’s action affects the feasible action sets of the other players. The
feasible action set for each channel i is the projection set
( )
b i (u−i ) =
Ω vi ∈ Ωi | ∑ b≤ 0 .
u j + vi − P (5.27)
j∈N , j6=i
The first approach we consider to treat such a game is an indirect approach, based on incorporating the coupled
constraint into each cost function. Part of the results in this section are based mainly on [110, 112, 114]. Consider a
new individual cost function Jbi : Ω → R, for each i ∈ N defined as
ci (u) − Ui (u)
Jbi (u) = P (5.28)
ci : Ω → R is defined as
where Ui : Ω → R is the same utility function as in (4.20), while the new pricing function P
ci (ui , u−i ) = αi ui + 1
P , ∀i∈N , (5.29)
Pb − ∑ j∈N u j
hence a linear pricing term with αi > 0 as before and a new regulation (penalty) term. The regulation term is con-
structed by considering the link capacity constraint. It penalizes any violation of the constraint in that the regulation
term tends to infinity when the total power approaches the total power target P, b so the pricing function P ci (ui , u−i )
increases without bound. Hence the system resource is preserved by forcing all channels to decrease their input
powers and indirectly satisfies the link capacity constraint. This game is denoted by G (N , Ω b i , Jbi ). A solution u∗ of
G (N , Ωb i , Jbi ) is called an NE solution in the sense of Definition 5.2. If in addition, the solution is not on the bound-
n o
ary of the action space Ω, it is called an inner NE solution. Note that points on the hyperplane u | ∑ j∈N j u = b
P
are not NE solutions of G (N , Ω b i , Jbi ). In addition, since ui = 0 means channel i is inactive in the link, an NE solu-
tion u∗ with a zero component, say, u∗ = 0, of G (N , Ω b i , Jbi ) implies that channel 1 does not have any effect on the
1
game. So in this case, the game is equivalent to the one in which (N − 1) channels play and the NE solution to the
(N − 1)-player Nash game does not have a zero component. In this sense, we assume that any NE solution u∗ to the
N-player OSNR Nash game does not have zero components, and an NE solution u∗ is always inner. The following
result provides sufficient conditions for existence and uniqueness of an inner NE solution.
5.6. EXAMPLE: OPTICAL NETWORK GAME WITH CONSTRAINTS 79
Proof: (Existence) The action space Ω is a compact and convex set with a non-empty interior. Each cost function
Jbi (ui , u−i ) is continuous and bounded and thenfirst and second o
partial derivatives of Jbi (ui , u−i ) with respect to ui are
well defined on Ω except on the hyperplane u | ∑ j∈N u j = Pb , and are given as
∂ Jbi (u) 1 βi ai
= αi + − , ∀i∈N (5.33)
∂ ui b
(P − ∑ j∈N u j ) 2 X−i + ai ui
In this section we use a direct approach to treat the coupled constraints, based on Lagrangian extension. As we shall
see this offers reduced complexity in computing a Nash equilibrium.
Consider that each channel i ∈ N minimizes not a penalized cost Jbi as in the previous section, but a cost function
Ji : Ω → R, as in Section 4.6, defined in (4.19). Here the action set Ω is coupled as in (5.26), and the action set
of channel i ∈ N is the projection set Ω b i (u−i ) defined in (5.27). We denote this game by G (N , Ω b i , Ji ), a game
in class of N-player Nash games with coupled utilities and coupled constraints. It follows from Section 4.6 that Ji
is continuously differentiable in its arguments and convex in ui . Note that the overall coupled action space Ω is
compact and convex as well. Then from Proposition 5.3, G (N , Ω b i , Ji ) admits an NE solution. Moreover due to the
coupled constraint (5.26), solving directly for an NE solution of this game requires coordination among possibly all
channels.
80 CHAPTER 5. CONTINUOUS-KERNEL GAMES WITH COUPLED CONSTRAINTS
In the following, we use the Lagrangian extension and decomposition results as a natural way to obtain a hierar-
chical decomposition and compute an NE solution of G (N , Ω b i , Ji ). For this game consider the separable NG cost
function Je(v, u) and the separable augmented constraint ge(v, u), (5.8), together with the augmented Lagrangian
function e
L(v, u; µ ), (5.11), and the dual cost function D( µ ), (5.17). Then G (N , Ω b i , Ji ) is related to a constrained
minimization of J,e (4.9), (5.8), with respect to the extra argument v, that admits a fixed-point solution. Individual
components of a solution u∗ to this constrained minimization constitute an NE solution to G (N , Ω b i , Ji ) in the sense
of Definition 5.2. From Remark 5.6, we know that u∗ can be obtained by first minimizing e L(v, u; µ ) with respect to
v, obtaining v(u). The next step involves finding a fixed-point solution v(u ) = u , and this depends on µ , u∗ ( µ ).
∗ ∗
Proposition 5.11. Consider the coupled OSNR Nash game G (N , Ω b i , Ji ) with cost functions Ji (ui , u−i ), (4.19),
b
subject to the linear constraint (5.26), i.e., over Ωi . Then the dual cost function D( µ ) can be decomposed as
N N
D( µ ) = ∑ Li (u∗i (µ ), u∗−i, µ ) + ∑ µ (1TN u∗−i − Pb) (5.35)
i= 1 i= 1
where u∗ ( µ ) = [u∗i ( µ )] is an NE solution of the Nash game G (N , Ωi , Li ) with costs Li (5.36) and no coupled
constraints.
Proof: Each cost function Ji , (4.19), is continuously differentiable and convex in ui and the constraints are linear
so that one can apply Proposition 5.9. The linear constraint (5.26) are rewritten in a two-argument form
Then by using Proposition 5.9, the dual cost function D( µ ) can be decomposed as
N
D( µ ) = ∑ Li (u∗i (µ ), u∗−i (µ ), µ )
i= 1
Recall that in D( µ ) in Proposition 5.9 one has to minimize first with respect to vi on the right-hand side, and then
solve for a fixed-point solution. From (5.38) it can be seen that only the first two terms depend on vi . Hence,
substituting for Li (vi , u−i, µ ), (5.38), on the right-hand side of D( µ ) and isolating the terms that are independent of
vi , yields
N N
D( µ ) = ∑ vmin Li (vi , u−i , µ )|v =u + ∑ µ (1TN−1 u−i − Pb)
i i
i= 1∈Ω i i i= 1
where Li is defined as in corollary statement (5.36). A fixed-point solution u∗ = [u∗i ] to the set of N optimizations
on the right-hand side of the foregoing is an NE solution to the Nash game with cost functions, Li , (5.36), and the
last part of the claim follows.
5.6. EXAMPLE: OPTICAL NETWORK GAME WITH CONSTRAINTS 81
Proposition 5.11 leads to a hierarchical decomposition of G (N , Ω b i , Ji ) into a lower-level modified Nash game
G (N , Ωi , Li ) with cost functions Li (5.36) and no coupled constraints, and a higher-level optimization problem
used for coordination. This decomposition is computationally simpler as shown below. For a given price µ , the
lower-level game admits a closed-form explicit solution. Specifically using (5.36, 4.19, 4.22) we see that Li satisfies
ui
Li (vi , u−i , µ ) = (αi + µ ) ui − βi ln(1 + ai )
X−i
i.e., Li is the same as Ji , (4.22) for αi replaced by αi + µ , ∀i. Therefore, for each given µ , the NE solution u∗ ( µ ) to
the lower-level game G (N , Ωi , Li ) with cost Li is unique and can be obtained as in Theorem 4.11 as
e−1 (Diag[1./(α + µ )] b0 − n0 )
u∗ ( µ ) = Γ (5.39)
where b0 = [ai βi ], n0 = [n0i ] and Diag[1./(α + µ )] = Diag([1/(αi + µ )]). Next, based on the explicit solution
(5.39) and on price coordination at the higher-level, a recursive hierarchical algorithm is discussed. By Theorem 5.8
applied to the coupled OSNR game G (N , Ω b i , Ji ) with costs Ji and coupled constraints (5.26), (u∗ , µ ∗ ) is an optimal
NE solution - Lagrange multiplier pair if and only if u∗ is NG-feasible,
N
b
∑ u∗i (µ ) ≤ P, u∗i ∈ Ωi , i∈N (5.40)
i= 1
µ ∗ ≥ 0, µ ∗ (∑Ni=1 u∗i − Pb) = 0 (slackness condition) and the Lagrangian optimality condition
h i
u∗ = arg minv∈Ω e L(u; v; µ ∗ ) (5.41)
v= u
holds. By Proposition 5.11 and (5.35), note that u∗ ( µ ) solving (5.41) can be found as an NE solution to the modified
Nash game G (N , Ωi , Li ) with costs Li (5.36), with no coupled constraints. For every given price µ , this NE solution
u∗ ( µ ) is unique as in (5.39). Furthermore, from (5.39) it is seen that all components of u∗ ( µ ) decrease with µ . One
can exploit the linear constraint and adjust the price µ to satisfy the slackness condition. Instead of maximizing
D( µ ), the optimal price µ ∗ can be obtained such that the slackness condition holds, i.e., as the point of interception
between the curve representing total power, u∗T ( µ ) = ∑Ni=1 u∗i ( µ ), with the level Pb (Fig. 1). This method has the
interpretation of a coordination mechanism. The link as the coordinator sets the price at the optimal value µ ∗ . The
channels respond by adjusting their power levels to u∗i ( µ ∗ ) that minimizes their own cost.
A hierarchical adjustment algorithm for both coordinating link price (higher-level) and channel powers (lower-level)
is discussed.
Link Algorithm
Every K iterations of the channel algorithm, the new link price µ is computed based on the received total power for
all channels in the link uT (K ) = ∑Nj=1 u j (K ) as
where k̄ is the link iteration number, η is the step-size and [z]+ = max{z, 0}. This simple price update requires only
measurement of total power. Moreover it corresponds to a gradient descent technique if link price is adjusted slower
than channel powers. At the higher level, µ (k̄) acts as a coordination signal that aligns individual optimality with
82 CHAPTER 5. CONTINUOUS-KERNEL GAMES WITH COUPLED CONSTRAINTS
✻
∑Ni=1 u∗i ( µ )
✠
Pb
✲
µ∗ µ
the system constraint, (5.26) or (5.40). This is the new price given to the channels, who repeat K iterations of the
following algorithm.
Channel Algorithm
Based on a pricing µ (k̄) from the link, the optimal channel power u∗ ( µ (k̄)) can be found explicitly as in (5.39), but
this requires global centralized information. Instead the following iterative update algorithm can be used
βi 1 1
ui ( k + 1) = − − Γi,i ui (k) (5.43)
αi + µ (k̄) ai yi (k)
where k is the channel iteration number. For fixed µ this algorithm converges to the optimal NE solution (5.39).
Convergence of the combined algorithms can be proved and even for multi-link case based on time-scale decoupling,
[115]. Thus individual channels do not have to coordinate with other channels at the lower-level game.
5.7 Notes
Game theoretical formulations of problems and computational approaches towards solving generalized or coupled
Nash games have been areas of much recent interest, [52, 133, 98]. The study of conditions for existence and unique-
ness of Nash equilibrium [102] in pure strategies continues to be a fundamental issue. Only sufficient conditions
for existence are available. Uniqueness results exist only for special classes of games. Recent work on this topic
has focused on S-modular games [9], potential games [133], routing games in parallel links [18]. Uniqueness of a
normalized equilibrium point is studied in [125]. From a computation point of view, the study of generalized Nash
equilibrium presents severe analytical difficulties. Insightful theoretical results have been obtained for computation
of equilibria in classes of games with structures, such as two-player polynomial games [116], separable games [142]
or potential games [96]. Duality has also received interest from the perspective of games. Duality and dual games
are studied for repeated two player zero-sum games in [47]. Results in [120] and related similar results in [52]
that have appeared almost at the same time independently indicate continued interest in this area after more than
50 years. The work in [52] shows that a generalized Nash equilibrium can be calculated by solving a variational
inequality and the results express conditions in terms of variational inequality problem and Karush-Kuhn-Tucker
(KKT) conditions for the pseudo-gradient. Another related work is [98] where the authors present a scheme that
associates to a generalized variational inequality, a dual problem and KKT conditions, thus allowing to solve primal
and dual problems in the spirit of classical Lagrangian duality for constrained optimization problems, using set the-
oretic concepts and set-valued operators. Extensions based on this approach have been developed for non-compact
and non-smooth settings in [136].
Chapter 6
Evolutionary Games and Evolutionary Stability
Chapter Summary
This chapter introduces evolutionary games and the concept of evolutionary stability in a large population of agents.
We start by introducing the concept of evolutionary stable strategy (ESS) for normal form symmetric matrix games
and then for more general population games.
6.1 Introduction
Classical game theory essentially requires that all the players make rational choices, making their strategic choice
on a wholly rationally determined evaluation of probable outcomes. Therefore it is fundamental that each player
must consider the strategic analysis that the players’ opponents are making, in determining that his own choice is
appropriate. Evolutionary game theory (EGT) shifts the paradigm: the game is imagined played over and over by
conditioned players or agents (conditioned either biologically or socially) who are drawn from large populations.
Actually this interpretation was initially mentioned by Nash himself. For a surprisingly long period of time, game
theorists forgot about Nash’s statistical population interpretation of his equilibrium concept (presented in his unpub-
lished doctoral thesis). Instead, they devised ever more sophisticated theories or definitions of rational behaviour.
However, the rationality assumptions became so stringent and demanding that the predictive value of the theory can
be doubtful. Secondly, there has been little success solving the equilibrium selection problem. One of the great ben-
efits of evolutionary game theory (EGT) is that it has shifted the focus away from the setup that an equilibrium is a
point from which one does not move but nobody explains how one gets there in the first place, to dynamical theories
which explicitly model how one gets to a point. This was possible once the idea of replicator dynamics (RD) was
introduced (we’ll formally describe it in the next chapter). On the other hand, classical game theory (CGT) is still
the main approach used; EGT has still not developed far enough to provide applied researchers with a sufficiently
sophisticated enough toolset. Learning in games (LGT) provides connections between classical game theory (CGT)
and EGT, and we shall see this in the last chapter.
In this chapter we introduce evolutionary game theory (EGT). Material here follows mainly [153], [65], [151],
[129]. Evolutionary game theory (EGT) is concerned with the application of game theory to evolving populations
originally for lifeforms in biology. It defines a framework of contests, strategies and analyses into which Darwinian
83
84 CHAPTER 6. EVOLUTIONARY GAMES AND EVOLUTIONARY STABILITY
competition can be modelled. The original work on EGT dates from 1973 when John Maynard Smith and George
R. Price formalized how such contests can be analyzed as "strategies" as well as the mathematical criteria which
can be used to predict the resulting prevalence of such competing strategies, [141]. Despite its origin and original
purpose in biology and social sciences, EGT has become of increasing interest to economists and game theory. A
crucial new development on this front was the publication in 1982 of John Maynard Smith’s seminal work “Evolution
and the Theory of Games", [140]. Maynard Smith envisaged randomly drawn members from populations of pre-
programmed players (individuals, agents) meeting and playing strategic games, i.e., having a strategy. The results of
the game will test how good that strategy is. This is what evolution does: it tests alternative strategies for the ability
to survive and reproduce. In biology, individuals are typically animals and strategies are genetically inherited traits
that control an individuals action (strategies) which are algorithmic just like computer programs. In other fields,
individuals in a population are agents, firms, users, nodes, individuals etc. interacting strategically. Applications
can range from economics (markets) to transportation science (highway network congestion) to computer science
(selfish routing of Internet traffic in communication networks).
As a motivating example consider the Hawk-Dove (HD) game, introduced in Chapter 1. This game is the typical
example used in EGT, used by evolutionary biologists to model animal conflicts. The earliest presentation of a form
of the Hawk-Dove game was in Maynard Smith & Price’s 1973 Nature paper, “The logic of animal conflict". The
principle of the game is that while each player prefers not to yield to the other, the worst possible outcome occurs
when both players do not yield.
Example 6.1. Recall Example 1.9 of the Hawk-Dove or (HD) game in which each player is animal fighting over
some prey (or deciding or not to start a conflict). Each can behave like a Hawk (1st choice, H) or a Dove (2nd choice,
D), hence the parallel to the belligerent vs. non-belligerent. The best outcome for each animal/player is that in which
it acts as a Hawk while the other acts as a Dove; the worst outcome is that when both act like Hawks. Hence each
animal prefers to be Hawkish if the other is Dovish, and Dovish if the opponent is Hawkish. Such a bimatrix game
can be captured by the cost matrices
" #
0 −4
A=
−1 −2
and B = AT , hence a symmetric game. We can analyze this game following the graphical method in Chapter 3. Any
strategy profile in which one player chooses H and the other picks D is in equilibrium, hence the two pure-strategy
NE equilibria are (H,D) and (D,H). In addition, there is a mixed-strategy NE equilibrium in which each player selects
H with probability x∗1 = 23 , so the set of NE is
x∗ = [2/3 1/3].
In the setup of EGT consider a large population of players (agents), where individuals are drawn repeatedly at
random to play a 2-player HD game, i.e., pairwise interaction. We shall focus on the symmetric NE (x∗ , x∗ ) as
this is the case in which the players can change roles and the same outcome is obtained. Thus x∗ is the optimal NE
(mixed-) strategy against itself. Assuming that all players in a population use identical strategies, an x∗ strategy is
said to be an evolutionary stable strategy (ESS) if it is robust to evolutionary selection, in the sense described next.
Assume that a small fraction of "invaders" ("mutants") is injected into the population. The mutants all play some
other (pure or mixed) strategy different from x∗ . The incumbent strategy x∗ is an ESS if it still yields lower cost,
hence it resists any such invasion.
6.1. INTRODUCTION 85
This original ESS concept is a static concept. We shall also consider an alternative dynamic interpretation. Assume
that each individual is programmed to play a particular pure strategy (not all using the same one as in the above).
Individuals are drawn at random from that population and are matched in pairs to play the HD game. In this case x
can represent the frequency distribution of pure strategies in the population, or the population state. The cost that
results from the adoption of any specific pure j strategy will depend on the frequencies xk with which the various
k strategies are represented in the population. Suppose that those frequencies x j change over time in response to
cost differentials, with the population share of lower cost (higher rewarded or more successful) strategies increasing
dx
at the expense of higher cost strategies. We shall call the dynamic process that describes this change ẋ j = dt j the
Replicator Dynamics (RD), a dynamical system. This allows for a dynamic interpretation of ESS. We will analyze
equilibria points of the RD dynamical system and show that any such rest point must be an NE strategy. Moreover,
in the special case of the Hawk-Dove game, any state trajectory of the RD system that begins at an initial state in
which both pure strategies are present, converges to the mixed NE strategy x∗ . This provides another interpretation
of the mixed-strategy: the frequency of each pure strategy corresponds to the likelihood with which it is played in
the mixed NE strategy. Moreover, we shall see that this x∗ is an ESS. This explains why it’s not possible for either
the Hawk or the Dove type to become prevalent in the population while the other type to disappear over time, a fact
that biologists noticed.
An evolutionary game is always a game with a very large population of competitors, or a population game. Each
player is preprogrammed to some behavior (this being a strategy in the game). The key differentiator in an EGT
setup is that the success of a strategy is not just determined by how good the strategy is in itself, but by how
good is that strategy in the presence of other alternative strategies, and that this depends on the frequency of other
strategies employed within a population. A biological (or social) selection process would then change over time the
proportions (frequency) of the different populations of pre-programmed "types". The objective in an evolutionary
game is to become more fit than competitors, i.e., to produce as many replicas of oneself as one can; the payoff is in
units of fitness, i.e., relative worth in being able to reproduce. Rules describe the contest as in classical games but
for evolutionary games rules include exactly how the players that are more fit will spawn more replicas of themselves
into the population. This evolutionary (selection) process is described by the replicator dynamics (RD). Interesting
questions arise: what are the connections between solutions concepts such as Nash equilibrium (NE) in game theory
and the long-term behaviors of such a evolutionary process? Are some strategies going to “’disappear" in the long
term? Will aggregate behavior converge toward an NE? Are some NEs more likely to emerge in this process? These
are some of the questions addressed by an EGT approach.
The concept of an evolutionary stable strategy (ESS) is key in addressing these questions. Much research in EGT
has focused on population games of the type generated when populations of agents are matched to play a normal
form game. Indeed, Nash informally introduced population games in proposing the “mass action" interpretation
of his equilibrium concept. This population-matching interpretation of normal form games is what we shall use
for introducing the concepts (as originally done) as well as for many simple examples. With the HD game as a
motivational example, the focus of this chapter and the next chapter is to characterize the relationship between NE,
ESS, and equilibria of the RD dynamics.
We will derive a chain of useful mathematical results that link the concept of ESS, the dynamics of the basic RD and
the concept of NE. The logical chain is as follows:
− (I) the population state induced by an ESS is asymptotically stable in terms of the RD dynamics;
− (II) the mixed strategy corresponding to an asymptotically stable equilibrium of the RD is in (symmetric)
86 CHAPTER 6. EVOLUTIONARY GAMES AND EVOLUTIONARY STABILITY
− (III) a mixed strategy played at a symmetric Nash equilibrium (in a two-player symmetric game with a finite
set of pure strategies) induces a stationary population state (equilibrium) of the RD.
We start by first introducing the ESS as a static stability concept for normal-form (strategic) symmetric matrix
games and extend it to population games. We follow in the next chapter with a derivation of the continuous-time
replicator dynamics (RD), and systematic relationship between RD equilibria, Nash equilibria (NE) strategies and
ESS strategies. In the last chapter we present learning approaches in games (LGT) and show how the imitative
behavior of agents in large populations is related to replicator dynamics (RD).
Let us formally introduce the evolutionary stable strategy (ESS) concept. We use the canonical example used in
evolutionary games where a population game is defined from a normal-form symmetric two-player game. Later we
show how it can be extended to more general population games. Thus assume a large population of players (agents)
with individuals drawn repeatedly at random to play a two-player (symmetric) matrix game (A, AT ) (as in Chapter 3
on bimatrix games).
Let us first specialize results from Chapter 3 for a matrix game (A, AT ), A a m × m cost matrix. The associated
mixed-strategy set is ∆ = {x ∈ Rm | ∑ j∈M x j = 1, x j ≥ 0, ∀ j ∈ M}, M = {1, . . . , m}. Let x = (x, y) ∈ ∆X , where
∆X = ∆ × ∆ is the set of mixed-strategy profiles. Note that J 1 (x, y) = xT A y := J (x, y) and
Thus for a symmetric game, interchanging the strategies (or the roles of the players) one with another yields the
same outcome. For this reason we can say that in such games a strategy x is played against another strategy y and
the cost will be denoted by J (x, y) = xT A y.
From the bimatrix games section in Chapter 3, x∗ = (x∗ , y∗ ) ∈ ∆X = ∆× ∆ is an NE pair if (x∗ , y∗ ) ∈ (Φ1 (y∗ ), Φ2 (x∗ )),
where Φ1 , Φ2 as in (3.1). For this (A, AT ) case it can be shown that Φ2 (x∗ ) = Φ1 (x∗ ), where
Φ1 (x∗ ) = {η ∈ ∆ | η T A x∗ ≤ yT A x∗ , ∀y ∈ ∆}
Note the symmetry in the two inequalities and consider the case when x∗ = y∗ (same strategy used). This is the case
when agents are oblivious of which role they play; in effect we consider a strategy as played against another strategy.
This leads to a special NE called symmetric NE.
Such strategies x∗ ∈ ∆ can be regarded as being in Nash equilibrium with themselves or optimal against themselves.
We denote this set of NE strategies by ∆NE ,
D := {x ∈ ∆X | x = (x, y), x = y}
Moreover, we can see that ∆NE is the set of fixed points of the best-reply correspondence Φ1 : ∆ ⇒ ∆, x∗ ∈ Φ1 (x∗ ),
i.e., x∗ is a optimal against itself, or is best reply to itself. By applying Kakutani fixed-point theorem to Φ1 , it
follows that ∆NE 6= ∅. Symmetric NEs are not the only NE (there might be asymmetric NEs also) but in the case of
evolutionary game these are the ones we will be interested in, since typically the players will not know which player
is which. This is the case treated in evolutionary systems, where one checks to see if a strategy is an evolutionary
stable strategy (ESS).
The ESS concept is informally defined as follows: assuming that all individuals in a "population" of players play x∗
and some small deviation from x∗ appears, then this x∗ strategy is called an evolutionary stable strategy (ESS) if this
remains un-invadable for any such small deviation (robust to invasion), i.e., if the payoff (cost) under the existing x∗
strategy is still the better one. Thus the concept of an evolutionary stable strategy (ESS) is in fact a refinement of an
NE strategy. Note that if all agents (players) in the population use the same strict Nash equilibrium strategy, every
agent that will deviate from it will be penalized, hence such a deviating (invading) strategy will not spread. On the
other hand if there is no strict NE, we cannot assume that every nonstrict NE will resist such invasion by a dissident
minority, the reason being that the minority could use a strategy that does just as well as the NE, and may spread in
the population, unless that incumbent NE strategy is evolutionary stable.
Formally, suppose that a large population of individuals are all programmed to play the incumbent strategy denoted
x∗ ∈ ∆. Assume that in this (monomorphic) population a small fraction ε , ε ∈ (0, 1) of “mutants" appear that all are
programmed to play some other mutant strategy y ∈ ∆. Assume that two by two individuals in this overall population
are drawn at random repeatedly to play the game, each individual being picked with equal probability. Thus, if an
individual is picked to play the game, the probability that this opponent will play the mutant strategy y is ε and the
probability that it will play x∗ is equal to (1 − ε ). Thus the new state (composition of the population, or population
mixture) is w = ε y + (1 − ε )x∗ . Thus the cost in a game in this bimorphic population (with two strategies) is the
same as in a game where an individual agent plays the mixed strategy
w = ε y + (1 − ε ) x∗ ∈ ∆
After the mutant entry the cost to the incumbent (existing) x∗ strategy is J (x∗ , w) while the cost for the mutant
(invading) y strategy is J (y, w). One can think that an evolutionary argument will dictate that the mutant strategy
will be rejected if and only if the fitness of the incumbent is higher than the one of the mutant strategy in this new
composition of the population. Let the fitness of the incumbent strategy x in the new mixture (state) w be denoted
by Fx (w). In terms of our usual convention of using costs in the game, the lower the cost, the higher the fitness,
i.e., Fx (w) = −J (x, w). If the fitness of the incumbent strategy is higher than that of any mutant strategy y 6= x∗ for
sufficiently small ε then we call x∗ an evolutionary stable strategy (ESS). This concept was introduced in [141] and
its formal definition as given in [148] is presented below.
88 CHAPTER 6. EVOLUTIONARY GAMES AND EVOLUTIONARY STABILITY
Definition 6.3. A strategy x∗ ∈ ∆ is an evolutionary stable strategy (ESS) if for every strategy y ∈ ∆, y 6= x∗ there
exists some εy ∈ (0, 1) such that for ∀ε ∈ (0, εy ), w = ε y + (1 − ε )x∗ the following holds
or
J (x∗ , w) < J (y, w),
Alternatively, x∗ has the interpretation of population state (distribution of strategies in the population) and the in-
vasion can be modelled as some small fraction of agents switching to using another strategy which will change the
distribution, hence the state of the population, from x∗ to w = x∗ + ε (y − x∗ ), so as a small perturbation of x∗ .
Let ∆ESS ⊂ ∆ denote the set of evolutionary stable strategies (possible empty). How is an ESS as in Definition 6.3
related to an NE (symmetric) strategy x∗ ∈ ∆NE (Definition 6.2)? Let us recall the that for x∗ ∈ ∆NE we have
J (x∗ , x∗ ) ≤ J (y, x∗ ), ∀ y ∈ ∆
i.e., x∗ is optimal (best reply) against itself or x∗ ∈ Φ1 (x∗ ). What the inequality in Definition 6.3 is saying is that
this holds (with strict inequality <) when we perturb x∗ slightly, i.e., that x∗ is best reply against a perturbed w,
x∗ ∈ Φ1 (w), for sufficiently small ε , hence
∆ESS ⊂ ∆NE
This is stated next.
Proposition 6.4.
∆ESS ⊂ ∆NE
hence every evolutionary stable strategy (ESS) is a symmetric NE strategy.
Now what about the reverse? Is any NE strategy an ESS? It turns out that it is not NOT, unless a second-order
condition is met. In this case of matrix games, since the average (expected) cost is linear in x∗ and y respectively, we
can write for the inequality in Definition 6.3, that for every y 6= x∗ ∈ ∆, the following equivalent inequalities hold
Note that [ESS − 1] means x∗ ∈ ∆NE , while [ESS − 2] is known as the stability condition which needs to hold for
alternative best-replies. Thus if both [ESS − 1] and [ESS − 2] hold then x∗ is indeed an ESS. In fact this is how ESS
was first introduced in [141]. For this matrix case the two conditions equivalent to Definition 6.3 are written as
x∗T A x∗ ≤ yT A x∗ , ∀y ∈ ∆
x∗T A x∗ = yT A x∗ =⇒ x∗T A y < yT Ay ∀y 6= x∗ ∈ ∆
6.2. EVOLUTIONARY STABLE STRATEGY (ESS) 89
Remark 6.5. ESS is a refinement of the NE concept. In single population models only symmetric NE can be ESS. If
(x∗ , x∗ ) is strict NE (see Definition 3.16), i.e., Φ1 (x∗ ) = {x∗ } (a singleton), there is no other alternative best reply,
then x∗ is ESS by default.
Note that if we could take for every y ∈ ∆ in the limit ε → 1 in Definition 6.3, we obtain w → y and
i.e., such an ESS x∗ is superior in that it has lower cost (or higher payoff) against all mutant strategies y than they
do against themselves, not only for those that are alternative best-replies. While such a global result is not possible
in general, a local result holds: any ESS x∗ is locally superior, i.e., gives lower cost against all mutant strategies y
in some neighborhood of x∗ , than the cost these mutants y obtain when used against themselves. This result will be
used in a dynamical context in the next chapter.
Proposition 6.6 (Proposition 2.6 in [153]). x∗ ∈ ∆ESS if and only if there exist a neighborhood B, x∗ ∈ B such that
Since x∗ ∈ B (open set), for every z 6= x∗ , z ∈ ∆ there exists a sufficiently small εz > 0 such that w = x∗ + ε (z − x∗ ) ∈
B, ∀ε ∈ (0, εz ). Thus, for any mutant z 6= x∗ , we can express the mixture w between x∗ and z as w ∈ B for sufficiently
small ε , hence by the foregoing inequality
J (w, w) > J (x∗ , w)
Now
J (w, w) = J ((1 − ε )x∗ + ε z, w) = (1 − ε ) J (x∗ , w) + ε J (z, w)
based on linearity of J, so that the foregoing inequality becomes
Since ε > 0, this implies J (z, w) > J (x∗ , w) and x∗ ∈ ∆ESS by Definition 6.3.
(Part “Only If") Assume x∗ ∈ ∆ESS . First, it can be shown that in finite games the invasion barrier is uniform, i.e.,
εy in Definition 6.3 does not depend on y (see Proposition 2.5 in [153]). This follows by showing there exists the
minimum of εy , for y in some compact set, denoted by ε̄ . Thus if x∗ ∈ ∆ESS let ε̄ > 0 be its uniform invasion barrier.
Let the set Zx∗ be defined as
Zx∗ := {z ∈ ∆ | z j = 0, for some j ∈ sup(x∗ ) }
Thus x∗ ∈/ Zx∗ and Zx∗ is a closed set, and in fact Zx∗ ∈ ∂ ∆ is the union of all boundary faces of ∆ that do not
contain x∗ . Define the following set
which is the set of points in the convex hull of x∗ and some points z ∈ Zx∗ , or as a perturbed x∗ . Since Zx∗ is a closed
set and x∗ is outside it, it follows that there exists and open set B, neighbourhood of x∗ , such that the whole B is in
90 CHAPTER 6. EVOLUTIONARY GAMES AND EVOLUTIONARY STABILITY
V , i.e., B ∩ ∆ ⊂ V . Consider any y ∈ B ∩ ∆, y 6= x∗ . Then y ∈ V (as a perturbed x∗ ) and since ε̄ > 0 is the uniform
invasion barrier (valid for z also), it follows that
∆NE = {e1 , e2 , x∗ }
Each of the two pure NE is strict so that e1 , and e2 are evolutionary stable. However x∗ is not; all y ∈ ∆ are alternative
best replies to x∗ , so [ESS-2] needs checked. For example y = e1 has a lower cost against itself than x∗ has against
it, i.e., J (e1 , e1 ) = a1 < λ a1 = J (x∗ , e1 ) and [ESS-2] does not hold. Thus ∆ESS = {e1 , e2 }.
Exercise 6.1. Consider the Hawk-Dove game (Example 6.1). Find the set of Nash strategies ∆NE and the set of ESS strategies ∆ESS , using
[ESS-1], [ESS-2].
There exists weaker versions of ESS, for example neutral stability (NSS) where the strict inequality in Definition 6.3
relaxed to ≤ (see [153]). The ESS concept is reminiscent to the concept of stability as we know it - informally stating
that small perturbations around an equilibrium are admissible (bounded) and moreover the equilibrium is recovered
(better outcome). However it is not equivalent to it. So far the ESS concept does not have any dynamics into it, it is
a static evolutionary stability concept. We have only looked at static conditions that we expect to be fulfilled once
the system is settled, i.e., we ignored the dynamic process through which to reach such a point. We shall look at how
to add a dynamic interpretation via the Replicator Dynamics (RD) in the next chapter.
Evolutionary games are defined in terms of one or many populations of players, hence can be called population
games. The canonical example used in evolutionary game theory is a symmetric matrix game. We show next that
6.3. ESS IN POPULATION GAMES 91
any symmetric matrix game can be used to define a population game by specifying how the agents in the populations
are matched to play it. Thus we give an alternative usage of symmetric matrix games in EGT by introducing the
concept of population state. We then extend the ESS concept to more general population games.
Consider thus a large population of agents (individuals) that play a game and can use pure strategies selected from
the finite set M = {1, . . . , m}, a typical strategy being denoted by j ∈ M. Assume that agents form a continuum and
that x j denotes the fraction (or the “mass of agents") of the population that uses strategy j ∈ M, with ∑ j∈M x j = 1, or
the frequency of pure strategy j in the population. The vector x describes the strategy distribution in the population,
x ∈ ∆, where ∆ = {x ∈ Rm T m
+ |1m x = 1} is the simplex in R . Thus x is the (social) population state and the simplex
∆ as the state set (state space). Vertices of ∆ are elements of Ω = {e1 , . . . , em }, and x = e j ∈ Ω corresponds to pure
population state, where all agents choose the same strategy j (all “mass" placed on the single j strategy).
Assume that the population and the set of strategies is fixed, and that the success of a strategy depends not on the
particular strategy that a random opponent plays, but depends on the strategy distribution in the population, i.e., on
the population state x. Let J j : ∆ → R denote the cost function for strategy j that depends on this state x.
Specifically, consider that agents in the population are paired to play a symmetric normal-form matrix game with
cost matrix A, where a j,k = J (e j , ek ) is the cost of pure strategy j against strategy k. Assume a random matching
between any two individuals.The cost of interaction J j (x) when the population state is x, is the expected (average)
cost of a j-th pure strategist against any other k-th strategist, J (e j , ek ), weighted by the fraction of agents actually
playing this k-th pure strategy xk , i.e.,
J j ( x) = ∑ J (e j , ek ) xk = ∑ xk a j,k = (A x) j = J (e j , x) (6.3)
k∈M k∈M
which is in fact J (e j , x). A vector cost function J : ∆ → Rm assigns each social state x a vector of costs for all m
strategies used in the population. This vector valued cost function J depends on the population state x and has the
j-th component denoted [ J(x) ] j = J j (x) . Thus associated with cost matrix A we have just defined a population
game described by the linear map
J( x) = A x
Note that J j (x) := J (e j , x) is the same as the cost incurred by pure strategy j against an individual playing the mixed
strategy x. Formally we can identify the population state x ∈ ∆ with a mixed-strategy x ∈ ∆ in a matrix (normal
form) game.
The associated average cost in the population when in state x is J(x) = ∑ j∈M x j J j (x) the (weighted) average cost
of strategies used in the population at state x, i.e,
J( x) = ∑ J (e j , x) x j = J (x, x) = xT A x (6.4)
j∈M
hence the same as the cost incurred by the mixed strategy x against itself, justifying the use of the same notation.
Alternatively, suppose that agents are matched to play the game A with each pair of agents meeting only once, i.e.,
complete (deterministic) matching. The cost of j-strategists is the aggregate (average) cost over all matches when
the population is in state x, i.e., the same as above
J j ( x) = ∑ J (e j , ek ) xk = J (e j , x) = (A x) j
k∈M
Thus a population game can be derived from any normal-form symmetric matrix games by such a matching, either
deterministic or random. The resulting cost J j is linear in population state x. More general examples are population
92 CHAPTER 6. EVOLUTIONARY GAMES AND EVOLUTIONARY STABILITY
games with nonlinear costs, [106], [105], [129]. Another example of a (nonlinear) population game is when a game
is interpreted as “playing the field" as in congestion games. We shall discuss this example later on. For now, let us
briefly introduce it.
Example 6.8. Consider a highway congestion game where for example drivers commute over a highway network.
Assume that a collection of towns is connected by a network of links, L . For each ordered pair of towns there
is a population of agents, each of whom needs to commute from the first town in the pair to the second one. To
accomplish this, an agent must choose a path (route) connecting the two towns, hence a strategy j from the finite set
M. Let x j be the mass of agents that uses strategy (route) j. The cost for an agent is the delay on the path he takes.
Let L j ⊂ L denote the links in route j. Each link l ∈ L has a cost function (delay) cl : R + → R that is a function
of the usage it gets, i.e. the total “mass" of agents using the link, denoted by ql ,
ql (x) = ∑ xk
k∈ρ (l )
where ρ (l ) = {k ∈ M | l ∈ Lk } contains those strategies in M that require link l. The delay each driver (agent)
experiences depends not only on the route he selects, but also on the congestion created by other agents along this
route. This delay is the sum of the delays on its constituent links,
J j ( x) = ∑ cl ( ql (x) )
l∈L j
Since driving on a link increases the delays experienced by other drivers on that link, cost functions cl are increasing
and can be nonlinear; they are typically convex as well.
Population games offer a simple general model for studying strategic interactions among large numbers of agents,
interactions that occur repeatedly. They are games with the following general properties: (i) The number of players
(agents) is large; (ii) Individual agents are small: Any one agent’s choice has little or no effect on other agents’
outcomes; (iii) Agents interact anonymously: Each agent’s outcome only depend on his own strategy and the distri-
bution of others’ strategies; further individualization of the opponents is not required. We also consider two other
features: (iv) The number of roles is finite; (v) Costs (payoffs) are continuous. Each agent is a member of one (or
of a finite number of) populations. Members of a population choose from the same set of strategies, use only pure
strategies, and their costs or payoffs are identical functions of their own behavior and the distribution of opponents’
behavior. Aggregate behavior is described by the population (social )state x, which specifies the empirical distribu-
tion of strategies in the population. By (iv) it follows that the state is finite-dimensional, expressible as a vector with
a finite number of components. The continuum assumption on the number of agents in a population is key to being
able to use deterministic models and dynamics. Moreover, such a continuous population of agents can be obtained as
the limit of a finite population of agents. However, one can equally well define large games with finite populations of
players, where these games provide a tractable environment for stochastic evolutionary dynamics (beyond the scope
of these notes).
Most evolutionary concepts apply to general population games. We saw that any symmetric two-player normal form
game can be seen as a population game by an appropriate re-interpretation of the players and strategies. Next we
generalize the ESS definition and concept from normal form (matrix games) with linear cost J (e j , x) = (A x) j to
more general population games where J (e j , x) = J j (x), with J j (x) possibly nonlinear.
Using the identity J j (x) = J (e j , x) it follows that
and
J (y, x) = ∑ yTj J (e j , x) = yT J(x)
j∈M
and this is the definition used for more general population games with cost vector J(x) not necessarily linear. For-
mally ∆NE (J) = ∆NE where a population state x is identified with a mixed strategy. In terms of the pure strategies, a
characterization similar to the Support Characterization theorem (Theorem 3.20, Chapter 3) can be given for an NE
strategy in a general population game
With this identification, we call x an ESS if it satisfies the equivalent of conditions [ESS − 1] and [ESS − 2], i.e.,
and y near x. The local superiority result in Proposition 6.6 is formulated as follows: x∗ ∈ ∆ESS if and only if there
exist a neighborhood B, x∗ ∈ B such that
Note that agents in population games are assumed to play pure strategies. One of the main reasons for introducing
randomized strategies is not needed here: when the populations are continuous, and costs (payoffs) are continuous in
the social state, pure strategy Nash equilibria always exist. This guarantee may be one of the “simplifications" that
von Neumann and Morgenstern had in mind when they looked ahead to the study of games with large numbers of
players. One can view a population game as a normal form game that satisfies certain restrictions on the diversity and
anonymity of the players. Describing behavior using empirical strategy distributions shifts attention from questions
like “Who chose strategy j?" to questions like “How many chose strategy j?" and “What happens to the cost of
strategy j players if some agents switch to strategy k?" Unlike the standard (classical) game theoretical setup where
the players are assumed to follow an NE, in EGT we assume that agents gradually adjust their choices to the current
strategic environment (given by the state x) and hope that in doing so they will converge to state x∗ that is a Nash
equilibrium. Such dynamic properties will be introduced via the Replicator Dynamics in the next chapter.
Some interesting properties of ESS and connections to refinements of NE are given next.
Proposition 6.9.
∆ESS = {x∗ ∈ ∆NE | J (x∗ , y) < J (y, y), ∀y ∈ Φ1 (x∗ ), y 6= x∗ }
94 CHAPTER 6. EVOLUTIONARY GAMES AND EVOLUTIONARY STABILITY
Proof: Since ∆ESS ⊂ ∆NE , for the first part we only have to show that for every x∗ ∈ ∆ESS , ∀y ∈ Φ1 (x∗ ), y 6= x∗ we
have J (y, y) > J (x∗ , y). Assume by contradiction that ∃y0 ∈ Φ1 (x∗ ), y0 6= x∗ such that J (y0 , y0 ) ≤ J (x∗ , y0 ). Since
y0 ∈ Φ1 (x∗ ), it follows that J (y0 , x∗ ) ≤ J (w, x∗ ), ∀w ∈ ∆. Hence J (y0 , y0 ) ≤ J (w, x∗ ), ∀w ∈ ∆. Then for any ε ,
and w = ε y0 + (1 − ε )x∗ , it follows that J (y0 , w) = ε J (y0 , y0 ) + (1 − ε )J (y0 , x∗ ) ≤ ε J (w, x∗ ) + (1 − ε )J (w, x∗ ) =
J (w, x∗ ), hence x∗ is not ESS, which is false.
For the reverse, we will show that any x∗ ∈ ∆NE with the property that J (y, y) > J (x∗ , y), ∀y ∈ Φ1 (x∗ ), y 6= x∗ is an
ESS. Any possible mutant y ∈ ∆ is either a best reply to x∗ or not, i.e., y ∈ Φ1 (x∗ ) or y ∈ / Φ1 (x∗ ). If y ∈ Φ1 (x∗ ),
it follows from the assumption that J (y, y) > J (x∗ , y), hence a mutant (invading) strategy y will get a higher cost
against a mixture w = ε y + (1 − ε )x∗ then x∗ does, i.e., J (y, w) > J (x∗ , w). This follows as follows,
Proof: If x∗ ∈ ∆NE is weakly dominated from Definition 3.13 it follows that there exists y ∈ ∆ such that J (w, y) ≤
J (w, x∗ ), for ∀w ∈ ∆, with strict inequality for some w ∈ ∆. Hence taking w = x∗ gives J (x∗ , y) ≤ J (x∗ , x∗ ) which
shows that y is an alternative best reply to x∗ , i.e., y ∈ Φ1 (x∗ ). Taking w = y gives J (y, y) ≤ J (x∗ , y) which shows
that x∗ fails condition (ESS − 2), or by Proposition 6.9, x∗ is not ESS.
As a consequence if x∗ ∈ ∆ESS then (x∗ , x∗ ) ∈ ∆X is an undominated NE, hence since in every two-player game such
an NE is perfect (by Proposition 3.24) we obtain as a corollary, that if x∗ ∈ ∆ESS , then (x∗ , x∗ ) ∈ NE (G ) is a perfect
NE equilibrium.
In fact a stronger result exists (for a proof see [150].
Proposition 6.11. If x∗ ∈ ∆ESS , then (x∗ , x∗ ) ∈ NE (G ) is a proper NE equilibrium.
6.5 Notes
The original ESS concept rests on the assumption of a single infinite population of individuals who are repeatedly
randomly drawn to play a 2-player symmetric game. This assumption has important implications. First, this assump-
tion justifies treating as equivalent a mixed strategy x and a population state x (when pure strategies are played with
some frequency in the population). Secondly, this assumption allows in effect a mean-field approximation to equate
the average cost (payoff) actually obtained by a population with the expected value of a probability distribution of
costs (payoffs) (which would be obtained by explicitly modelling players’ interactions).
Chapter 7
Replicator Dynamics
Chapter Summary
This chapter introduces dynamic concepts in EGT using the continuous-time Replicator Dynamics (RD) that mod-
els selection of the fittest strategies in the population. We discuss stability properties of RD equilibria and their
relationship to Nash equlibria (NE), as well as to the static stability concept of ESS.
7.1 Introduction
The criteria of ESS or evolutionary stability refers implicitly to the connection between the payoffs (costs) in a game
and the spreading of a strategy in a population. In fact the payoffs are supposed to represent the gain in biological
fitness or reproductive value. In this biological interpretation this generalizes Darwin’s notion of survival of the fittest
to an environment described strategically where the fitness of a given behavior (strategy) depends on the behaviors
(strategies) of others. Note that as with Nash equilibrium, the evolutionary stability property does not explain how a
population arrives at such a strategy, instead it is concerned with whether once reached such a strategy is robust in
the evolutionary sense.
In order to study the dynamics of an evolutionary system explicitly, i.e., beyond the analysis of its equilibria and ESS,
it becomes necessary to specify the particular process that governs such dynamics. The most extensively studied
dynamic process in EGT is the replicator dynamics (RD), proposed by [148].
An evolutionary process in general has two elements: a mutation mechanism (see above those “mutant") that provides
variety, and a selection mechanism that will favor some varieties over others. Within the context of game theory,
this selection is based on game outcomes (payoffs/costs), so players that have obtained lower costs (higher payoffs)
are selected preferentially. The replication mechanism ensures that the properties of the individuals (entities) in
the system are preserved, replicated or inherited from one generation to the next. Within the context of EGT, the
replication mechanism ensures that the strategies of selected players are adequately inherited, or transmitted, across
consecutive generations. Selection and replication are two mechanisms that work very closely together, since being
selected means being selected to be preferentially replicated. The replicator-dynamics (RD) is a dynamic model
for evolutionary selection in continuous time. This is the second key concept in EGT besides the main one of an
evolutionary stable strategy (ESS). We saw that ESS is actually a refinement of the Nash equilibrium concept from
95
96 CHAPTER 7. REPLICATOR DYNAMICS
classical game theory. On the other hand the RD is described by a system of differential equations (ODE) describing
how a population of different strategies evolves through time under selection as an evolutionary mechanism, [106].
ESS and evolutionary stability criteria above underscores the role of mutations, while the replicator dynamics (RD)
emphasizes the role of selection. In the RD setup robustness against mutations is indirectly considered via dynamic
stability criteria as we are used to from dynamical systems.
In the replicator dynamics (RD), payoffs are interpreted as the number of offspring that inherit the same behavioural
phenotype (i.e. strategy) as their (single) parent. The basic RD also assumes a single infinite population of individ-
uals who are repeatedly randomly drawn to play a 2-player symmetric game. While in the setup for ESS individuals
can play both pure or mixed strategies, in the setup of replicator dynamics (RD) the usual assumption is that indi-
viduals are only using pure strategies (as in the case of population games), and mutations (and random drift) are not
allowed. Thus instead of interpreting mixed strategies as a particular randomization that each agent in a population
does, one interprets a mixed strategy x as population state where each component x j now represents the fraction
of population of agents that use the corresponding pure strategy j, or the frequency of strategy j in the population.
The same assumption on random pairwise interactions in a large population are assumed, where large payoffs (low
costs) represent higher fitness in terms of number of “offsprings", and where each offspring inherits its single parents
strategy (called strategies breed true). This set of assumptions is enough to fully determine a deterministic dynamic
process in which the rate of change in the frequency of any given strategy is equal to the relative difference between
its average payoff and the average payoff obtained across all strategies in the population. Most often, time is treated
as a continuous variable, and this allows the formalisation of the dynamic process as a system of ordinary differential
equations (ODE). If reproduction happens continuously over time, this leads to the replicator dynamics (RD). The
replicators are here the pure strategies and they can be copied without error from parent to child. As the popula-
tion state changes (frequency distribution of the strategies in the population), so do the payoffs (costs) of the pure
strategies, and hence so do their fitness. This will be the approach we shall follow in these notes.
In this section we derive the Replicator Dynamics (RD) from first principles for population games of the type of
normal-form symmetric games, and then extend it to general population games.
Consider a large, finite population of agents (individuals) programmed to use pure strategies j ∈ M, M = {1, . . . , m}
or e j ∈ Ω, Ω = {e1 , . . . , em }, as in the population game setup in the previous chapter. Assume that they are drawn
at random to play a symmetric two-player game, with associated cost function J : Ω × Ω → R. Let its extension be
J : ∆ × ∆ → R, where ∆ is the simplex in Rm .
During each game play, each agent in the population selects a (pure) strategy. At time t let n j (t ) ≥ 0 denote
the number of agents (individuals) that are using pure strategy j and let n(t ) = ∑ j∈M n j (t ) > 0 denote the total
population (number of individuals in the population). Let x j (t ) denote the fraction of the population (or the “mass
of agents in the population") programmed to use strategy j ∈ M,
n j (t )
x j (t ) =
n(t )
We define the population state as x(t ) = [x1 (t ), . . . , x j (t ) . . . xm (t )]T ∈ Rm as the vector whose j-th component is
x j (t ). Then 1Tm x(t ) = 1 hence x(t ) ∈ ∆ a population state is identical with a mixed strategy x.
Assume a random matching between any two individuals, when the population is in this state x. Recall that the
7.2. DERIVATION OF REPLICATOR DYNAMICS (RD) 97
expected cost incurred when playing any pure strategy j is denoted by J (e j , x). Let us assume the following process:
an individual fitness, F j (x) is measured as the number of offspring per time unit, and this fitness depends on the
costs (payoffs) in the game, namely giving the incremental effect of playing the game directly affect fitness. In terms
of our usual convention of using costs in the game, the lower the cost, the higher the fitness, hence higher number of
offspring per time unit, or F j (x) = −J (e j , x).
Assume also that there is some background fitness denoted by b0 ≥ 0 (like the birth rate), i.e., independent of the
outcomes in the game and some death rate d0 ≥ 0, same for all agents. Assume also that reproduction takes place
continuously over time and that each offspring inherits its single parent’s strategy. In the absence of any game being
played, the population changes according to just the birth and death process and the rate of change is given by the
net growth rate which is the same for all strategies, i.e.,
ṅ j
= b0 − d0 , ∀j ∈ M
nj
dn j
where ṅ j = dt . This gives the underlying population dynamics rate equation
ṅ j = (b0 − d0 ) n j , ∀j ∈ M
When the game is played, the fitness of agents (individuals) programmed to use a particular pure strategy j, F j (x)
is affected by the expected game cost as F j (x) = −J (e j , x). This corresponds to the reproductive rate in the stan-
dard biological interpretation, where fitness is understood as the reproductive rate and the game is an evolutionary
game describing selections of certain traits (strategies). Thus strategies with lower cost have better fitness and the
population dynamics is modified as
ṅ j = (b0 + F j (x) − d0 ) n j , ∀j ∈ M
In order to obtain the rate equation or population dynamics in terms of the fraction of agents x j = n j /n note that
ṅ j ṅ n j
ẋ j = −
n n n
With x j = n j /n on the RHS and multiplying by n, yields after using (7.1)
ẋ j n = (b0 − J (e j , x) − d0 ) n j − x j ∑ (b0 − J (e j , x) − d0 ) n j
j∈M
Using J (x, x) = ∑ j∈M J (e j , x) n j /n and ∑ j∈M n j = n yields after dividing both sides by n > 0,
hence,
ẋ j = − J (e j , x) − J (x, x) x j , ∀j ∈ M (7.2)
This is called the replicator dynamics (RD as introduced by Taylor and Jonker, [148], also written as
where F je (x) denotes the excess fitness (relative to the average fitness in the population)
F je (x) = − J (e j , x) − J (x, x) = −J (e j − x, x), ∀j ∈ M
based on bilinearity of J. In terms of fitness, with F j (x) = −J (e j , x), F (x) = −J (x, x), the excess fitness is
ẋ δx
We see that x jj = limδ t→0 x j δjt = F je (x) i.e., F je specifies the population share growth-rate per time unit, or the
rate at which the pure strategy j replicates when the population is in state x, and we call it growth-rate function.
Hence RD describes the population dynamics, in essence describing deterministic but frequency-dependent selection
dynamics. The replicators in the RD dynamics are the pure strategies. Thus F je (x), the rate of population change
or reproductive rate, can be seen as the excess fitness of agents using strategy j versus the average fitness in the
population. Note that this represents selection in the sense that it can favor some present behaviors (pure strategies)
over other present ones, while absent behaviors remain absent. Thus this is unlike the case with mutation where
new strategies can appear. However mutations can be indirectly taken into account through dynamic stability (see
Proposition 7.8).
Note that for symmetric matrix games (7.3) is written in vector notation as
or,
ẋ j = −[ (A x) j − ( xT A x ) ] x j , ∀j ∈ M
and this is the form mostly seen in references, so that the excess fitness is
In (7.4) Diag(x) is the diagonal matrix with elements of x on the diagonal, and we used 1Tm Diag(x) = xT .
" #
a11 a12
Exercise 7.1. Consider the symmetric 2 × 2 game case, with A = , when a11 − a21 + a22 − a12 6= 0.
a21 a22
(i) Show that with x = [x1 , x2 ]T ∈ ∆, the RD dynamics (7.3) or (7.4) is given by
(iii) Consider a coordination game such as Stag-Hunt (SH) game (Example 3.9). Write the resulting RD dynamics (7.6) and find all the
equilibrium points. Use (7.7) and show that starting from any initial value x1 (0) < a1a+2 a2 the solution x1 (t ) decreases towards 0, and from
any initial value x1 (0) > a1 a+2a2 the solution x1 (t ) increases towards 1. Hence from any interior initial state x(0) ∈ int(∆ ), the population state
x(t ) converges over time to one of the two ESSs of such a game.
(iv) Consider the Hawk-Dove (HD) game in Example 6.1, an anti-coordination game. Repeat the above and describe the evolution over
time of the population state x(t ) when starting from any interior initial state x(0) ∈ int(∆ ). Relate to Exercise 6.1.
Remark 7.1. The general form of the RD in (7.3) is valid not only for pair-wise interactions but in general population
games where the fitness or payoff of a certain type (strategy) depends on the population state x and can be a nonlinear
function of x. This is easily seen using the equivalence J j (x) = J (e j , x) for the cost of strategy j when the state is x.
Let us go back to (7.3) and rewrite it using the fact that J(x) is the vector with the j component equal to J (e j , x),
J (e1 , x)
..
J( x) = .
J (em , x)
which for symmetric matrix games gives J(x) = Ax. Then J (x, x) in (6.4) where J : ∆ × ∆ → R is written as
where J : ∆ → R, J(x) := xT J(x) denotes the aggregate average (weighted average) of cost J(x) in the population.
Let F e (x) denote the vector excess fitness with elements F je (x), j = 1, . . . , m, defined as in (7.3). Then
F je (x) = − J (e j , x) − J (x, x) = −J j (x) + J(x), ∀j ∈ M (7.8)
and
F e (x) = −J(x) + J (x, x) 1m = −J(x) + xT J(x) 1m = −J(x) + J(x) 1m
Remark 7.2. The typical definition uses (excess) fitness defined as
i.e., as the excess payoff (utility) when compared to the average payoff in the population. We keep the J notation
to be consistent with the setup we used until now. As always we use the equivalence J = −U, hence maximizing
fitness is equivalent to minimizing cost.
In the following we shall analyze properties of the dynamic system (7.3) and based on this we shall obtain some
remarkable connections to the (static) ESS and NE concepts in general.
Let us denote the right-hand-side of (7.3) by the vector field
where F je (x) is as in (7.3) or (7.8), so that we write (7.3) in the familiar form
ẋ = f (x)
dx(t )
of an system of ODEs representing a nonlinear autonomous dynamic system, with ẋ = dt . Since f is polynomial
in x j it follows it is Lipschitz continuous, so by Picard-Lindelöf theorem (see ECE1647 notes, [89]), for every initial
state x(0) = x0 ∈ Rm , the system of ODEs has a unique solution for all t, that depends continuously on the initial
state. We denote such a solution by φ : R × Rm → Rm , with φ (t, x0 ) the solution at time t and initial condition
φ (0, x0 ) = x0 = x(0). We sometimes use x(t ) instead of φ (t, x0 ). The vector field f represents the velocity of a point
x moving along the solution (orbit) x(t ), and f (x) is tangent to the solution (orbit) x(t ) at point x.
100 CHAPTER 7. REPLICATOR DYNAMICS
Remark 7.3. Recall also equilibria (constant solutions) of nonlinear system ẋ = f (x) are xeq such that f (xeq ) = 0.
Then an equilibrium point x = xeq can be classified as stable, unstable or asymptotically stable, (see ECE1647 notes,
[89]). As brief review, xeq is a Lyapunov stable equilibrium if ∀ε > 0, there ∃δ > 0 such that for ∀x(0) = x0 ∈
Bδ (xeq ), φ (t, x0 ) = x(t ) ∈ Bε (xeq ), for ∀t > 0. If in addition, there ∃δ > 0 such that ∀x(0) = x0 ∈ Bδ (xeq ),
x(t ) → xeq as t → ∞ (xeq attractive), then xeq is an asymptotically equilibrium. The equilibrium xeq is unstable if it
is not stable. The above definitions are local stability definitions, that can be extended to global stability if they hold
globally for x0 ∈ Rm .
A useful method to characterize stability of an equilibrium point xeq is to study stability of z = 0 for the linearization
of the nonlinear system around the equilibrium point xeq given by the LTI system
ż = AL z
where z = x − xeq , AL = D f (x) |x=xeq is the Jacobian matrix of f with the (i, j ) element equal to ∂∂ xfij (x), evaluated
at x = xeq . As long as there are no eigenvalues of AL with 0 real-part, stability (asymptotic stability, instability) of
z = 0 implies (local) stability (asymptotic stability, instability) of x = xeq for ẋ = f (x).
We show next that the set ∆ is in fact an invariant set for (7.3), hence φ : R × ∆ → ∆ is a continuous mapping, where
each initial state x0 ∈ ∆ and time t ∈ R assigns the population state φ (t, x0 ) ∈ ∆ at time t.
Lemma 7.4. The set ∆ ⊂ Rm is an invariant set for the RD, (7.3).
ϕ (x) := 1Tm x − 1
we have
∆ = {x ∈ Rm
+ | ϕ (x) = 0}
L f ϕ (x) = 0, ∀x ∈ ∆
where L f ϕ is the Lie derivative of ϕ along vector-field f , defined as L f ϕ (x) = ∇ϕ (x) f (x), with ∇ϕ (x) = 1Tm the
gradient row vector of ϕ .
By using the definition of f (x) we have ∀x ∈ ∆
where (7.8), xT J(x) being a scalar and 1Tm x = 1 were used. Since L f ϕ (x) = 0, ∀x ∈ ∆, this means that ∆ is an
invariant set. Hence the solution orbit through any initial state in ∆ is contained in ∆ for all t ∈ R, i.e.,
x0 ∈ ∆ =⇒ x(t ) = φ (t, x0 ) ∈ ∆, ∀t ∈ R
Remark 7.5. Note that the vector 1m is the (outward) normal to the hyperplane {x | 1Tm x − 1 = 0 }, and the tangent
space to ∆ at x, denoted by T ∆, is given by all vectors orthogonal to 1m ,
T ∆ = {y ∈ Rm | 1Tm y = 0 }
The result above shows that f (x) ∈ T ∆, for all x ∈ ∆. Thus when studying stability of an equilibrium point x = xeq
based on the linearization system, we only need to check the eigenvalues of the Jacobian matrix AL = D f (x) |x=xeq
corresponding to the eigenvectors in T ∆.
Exercise 7.2. Consider Exercise 7.1 and the RD dynamics (7.6). Consider (ii), (iii) and (iv) and in each case study the stability of all equilibria
points based on the linearization of the corresponding 2-D system (7.6). Recall from Remark 7.3 and 7.5 that only eigenvalues of the Jacobian
matrix AL corresponding to eigenvectors in T ∆ have to checked. Do these results agree with Exercise 7.1 ?
The next result says that strictly dominated strategies vanish in the long run (get “wiped out").
Proposition 7.6. Consider the RD (7.3) and let φ (t, x0 ) denote its solution at time t from initial state x0 . If a pure
strategy j is strictly dominated, then for any x0 ∈ int(∆) it follows that φ j (t, x0 ) → 0 as t → ∞.
Proof: Consider that the j-th strategy is strictly dominated by some y ∈ ∆ hence
or ε = minx∈∆ ( e j − y )T J(x) > 0. With this y as parameter, let us define the function V j : int(∆) → R by
m
V j (x) = ln(x j ) − ∑ yk ln(xk )
k=1
In the following we relate the equilibria of the RD dynamics to the set of symmetric NE points, using either the
pairwise notation J (z, x) or the one involving the fitness zT J(x). Recall Theorem 3.20, the Support Characterization
Theorem and (3.16). For the case of symmetric matrix games and symmetric NE, this yields that x∗ ∈ ∆NE if and
only if all pure strategies in its support have the same optimal cost, hence
∆0eq (RD) is a convex set such that any linear combination z ∈ ∆ of states in ∆0eq (RD) belongs to ∆NE .
Proof: Only the last part needs to be proved. Let x, y ∈ ∆0eq (RD) and let a, b ∈ R such that z = a x + b y ∈ ∆. Now
for any pure strategy j ∈ M based on bilinearity of J, stationarity of x, y and the fact that x, y are interior it follows
that
J (e j , z) = a J (e j , x) + b J (e j , y) = a J (x, x) + b J (y, y)
Thus J (e j , z) = J (z, z) or (e j − z)T J(z) = 0 hence z ∈ ∆eq (RD), by (7.10).
/ ∆0eq then z is a boundary point of ∆0eq (RD) = ∆NE ∩ int(∆), and since ∆NE
If z ∈ ∆0eq then z ∈ ∆NE . Otherwise, z ∈
is a closed set it follows that z ∈ ∆NE . Also, since ∆ is convex and z ∈ ∆0eq for all a, b ≥ 0 such that a + b = 1, it
follows that ∆0eq is convex.
7.3. RD EQUILIBRIA VS. NE STRATEGIES 103
(II ) ∆stable
eq (RD) ⊂ ∆NE
Proof: Consider x∗ ∈ ∆ an equilibrium of the RD (7.3), x∗ ∈ ∆eq (RD) which is a Lyapunov stable equilibrium. Then
by definition, ∀ε > 0, there ∃δ > 0 such that for ∀x(0) = x0 ∈ Bδ (x∗ ) ∩ ∆, φ (t, x0 ) = x(t ) ∈ Bε (x∗ ) ∩ ∆, for ∀t > 0.
Assume by contradiction that x∗ ∈ / ∆NE . Then for all j ∈ sup(x∗ ), the cost J (e j , x∗ ) is the same and suboptimal
and ∃ j0 ∈ / sup(x∗ ) with better cost against x∗ , i.e., such that J (e j0 , x∗ ) < J (x∗ , x∗ ). This means that F je0 (x∗ ) =
−J (e j0 , x∗ ) + J (x∗ , x∗ ) > 0. Let c = F je0 (x∗ ) > 0. For every α ∈ (0, c), let ε̄ = c − α > 0. Then by continuity of J
hence of F je0 , there ∃ a neighbourhood Bδ̄ (x∗ ) of x∗ such that ∀y ∈ Bδ̄ (x∗ ) ∩ ∆,
or,
−ε̄ < F je0 (y) − F je0 (x∗ ) < ε̄
Thus
F je0 (y) > F je0 (x∗ ) − ε̄ = F je0 (x∗ ) − c + α = α > 0, ∀y ∈ Bδ̄ (x∗ ) ∩ ∆ (7.11)
/ sup(x∗ ) it follows that x∗j0 = 0. On the other hand let us look at this j0 -th component in (7.3),
Since j0 ∈
ẋ j0 (t ) = F je0 (x(t )) x j0 (t )
Take any initial condition x(0) ∈ Bδ̄ (x∗ ) ∩ ∆. Then for all t > 0 such that x(t ) = φ (t, x0 ) ∈ Bδ̄ (x∗ ) ∩ ∆, based on
(7.11),
ẋ j0 (t ) = F je0 (x(t )) x j0 (t ) ≥ α x j0 (t )
x j0 (t ) ≥ eα t x j0 (0) (7.12)
i.e., x j0 (t ) grows exponentially for all t while in Bδ̄ (x∗ ) ∩ ∆. Thus we found a neighbourhood Bδ̄ (x∗ ) such that no
matter how small the initial conditions are, the exponential growth will make the solution exit Bδ̄ (x∗ ), contradicting
the definition for ε = δ̄ . This contradicts the assumption of x∗ being Lyapunov stable, hence our assumption x∗ ∈ /
NE
∆ is false.
The next result extends this to the case of interior limit points only (only the property of convergence (attractiveness)
holds).
Proposition 7.9. [Prop. 3.5, [153]] If there exists x0 ∈ int(∆) such that φ (t, x0 ) → x∗ as t → ∞ (x∗ limit point of the
RD (7.3)), then x∗ ∈ ∆NE , or
L+ (RD) ⊂ ∆NE
Now since φ (t, x0 ) → x∗ and J is continuous it follows that there exists some T > 0 such that J (e j0 − φ (t, x0 ), φ (t, x0 )) <
−ε /2 for ∀t ≥ T , or F je0 (φ (t, x0 ), φ (t, x0 ) > ε /2 for ∀t ≥ T . Thus by (7.3) it follows that for the j0 component we
have ẋ j0 > ε2 x j0 , for ∀t ≥ T , which means that
ε (t−T ) ε (t−T )
x j0 (t ) = φ j0 (t, x0 ) > e 2 x j0 (T ) = e 2 φ j0 (T , x0 ), ∀t ≥ T
Since φ j0 (T , x0 ) > 0, this implies that φ j0 (t, x0 ) → ∞ which is false, hence x∗ ∈ ∆NE .
Remark 7.10. The above are two essential results for the evolutionary foundations of Nash equilibrium: dynamic
stability and interior convergence (attractive) of the equilibria points of RD are sufficient conditions for an NE.
Respectively they imply aggregate Nash equilibrium (NE) behavior. Ideally we would like necessary and sufficient
conditions, or somehow to avoid rest points (stationary state/ equilibrium points) that are not NE. The RD dynamics
has such points and clearly any such equilibrium is unstable, by Proposition 7.8). This is desirable for many learning
algorithms, i.e., that they lead to a mean dynamics with such a property, called NS property (Nash stationarity) in
[129], see also [42].
In fact in case of asymptotic stability (stronger), we can guarantee some NE robustness, (Proposition 3.9, [153]).
(II ) ∆ass
eq (RD) ⊂ ∆
NE
We have related rest points (stationary states) of RD as well as limit points of RD to NE strategies. What about ESS?
Since this is a refinement of NE, does there exists results for ESS? The next result in Proposition 7.12 is one of the
strongest results in EGT, giving necessary conditions for ESS.
Proposition 7.12. Every x∗ ∈ ∆ESS is an asymptotically stable equilibrium of the RD, (7.3), i.e.,
(N ) ∆ESS ⊂ ∆ass
eq (RD)
Its proof is based on Proposition 6.6 and Lemma 7.13 below and uses the relative entropy function.
Qx = {y ∈ ∆ | sup(x) ⊂ sup(y) }
satisfies Hx (y) ≥ 0 and Hx (y) = 0 if and only if y = x. Moreover along the solutions of the RD, (7.3)
dHx (y(t ))
= J (x − y, y) = (x − y)T J(y)
dt
7.4. RD EQUILIBRIA VS. ESS 105
We note that Hx (y) is a measure of distance between x and y. It does not define a metric because it is not symmetric,
but is a convex function.
Proof: By Jensen’s inequality applied to ln function (a concave function) we have for any a j ≥ 0, and w j in its
domain
∑j aj wj ∑ j a j ln(w j )
ln ≥
∑j aj ∑j aj
yj
Applying this for w j = xj and a j = x j yields
yj yj yj
ln ∑ j x j xj ≥ ∑ x j ln = ∑ x j ln = −Hx (y)
j∈M xj j∈supp(x)
xj
where we have used x ∈ ∆, i.e., ∑ j∈M x j = 1. Then, since y ∈ ∆ also the LHS is 0 so
0 ≤ Hx (y)
∂ Hx (y) xj
Ḣx (y) = ∑ ∂yj
ẏ j = ∑ y
(J (e j , y) − J (y, y)) y j
j∈sup(x) j∈sup(x) j
i.e.,
(x∗ − y)T J(y) < 0, ∀y ∈ B ∩ ∆, y 6= x∗
Let
Qx∗ = {y ∈ ∆ | sup(x∗ ) ⊂ sup(y) }
satisfies Hx∗ (y) ≥ 0 and Hx∗ (y) = 0 if and only if y = x∗ . By Lemma 7.13 along the solutions of (7.3), Ḣx∗ (y) =
J (x∗ − y, y) = (x∗ − y)T J(y). Thus,
dHx∗ (y(t ))
= (x∗ − y)T J(y) < 0, ∀y ∈ V := B ∩ Qx∗
dt
Hence Hx∗ is continuously differentiable, positive definite and Ḣx∗ is negative definite for all y ∈ V , i.e., Hx∗ is a
strict Lyapunov function on the neighborhood V . By Lyapunov stability theorem it follows that x∗ is asymptotically
stable.
106 CHAPTER 7. REPLICATOR DYNAMICS
Remark 7.14. A global stability result follows for a strictly stable game, [129], for which
is true for ∀y ∈ ∆, y 6= x∗ . Since (x∗ − y)T J(x∗ ) ≤ 0, from this it follows that (x∗ − y)T J(y) < 0, ∀y ∈ ∆, y 6= x∗ .
The proof follows as in the above. Specifically, let x∗ be a unique NE in a strictly stable game and consider the
function Hx∗ (y). Then Hx∗ (y) = 0 iff y = x∗ , i.e., Hx−1 ∗
∗ (0) = {x } and Hx∗ (y) is radially unbounded with respect to
dH (y(t ))
∆ Qx∗ . Moreover, x∗dt = J (x∗ − y, y) = (x∗ − y)T J(y) ≤ 0, with equality only when y = x∗ (by the definition of
a strictly stable game). Then from La Salle’s principle it follows that is x∗ is globally asymptotically stable (GAS)
with respect to Qx∗ .
Another case when a global result holds is when the ESS is an interior ESS. In stating this convergence result for
the replicator dynamic RD we need to be careful to specify the set of states from which convergence to equilibrium
occurs. This because under the replicator dynamic (RD), strategies that are initially unused remain unused for all
time; we will see in the next chapter this is the same under any imitative dynamic. Therefore, if state x0 places no
mass on a strategy in the support of the Nash equilibrium x∗ , i.e., if j ∈ sup(x∗ ) but j ∈
/ sup(x0 ), the solution of RD
∗
starting from x0 cannot converge to x . Thus we use x0 ∈ int(∆) because this means sup(x0 ) = M .
Proposition 7.15. If x∗ ∈ ∆ESS ∩ int(∆), then for ∀x0 ∈ int(∆) as initial condition of the RD it follows that φ (t, x0 ) →
x∗ as t → ∞, i.e.,
∆ESS ∩ int(∆) = L+ (RD)
In this section we look at a special case of games and the RD equilibria relation to ESS. Consider the special case of
doubly symmetric game, i.e., such that A = AT (the cost matrix is itself symmetric). These are also called partnership
games, because the two types of players have identical cost. The game cost function is
and
J 1 (x, y) = J (x, y) = J (y, x) = J 1 (y, x), = J 2 (x, y) = J 2 (y, x)
hence, indifferent to interchanging the players strategies. These games are also called partnership games [65]).
For such games it turns out that asymptotic stability in the RD replicator dynamics, (7.3) is equivalent to (necessary
and sufficient for) ESS evolutionary stability. In general we only saw it as necessary but not sufficient. Another
property is that the average cost decreases along every non-constant solution of RD (7.3). Hence average payoff
increases and evolution as modeled in (7.3) induces a steady increase in social efficiency over time. This is known
as the fundamental theorem of natural selection, a famous biological result due to [54] that says that evolutionary
selection induces a monotonic increase over time in the average population fitness.
First we show that for a doubly symmetric game, under the RD dynamics, the average cost V (x) := J (x, x) is non-
increasing.
Proposition 7.16. Let V (x) := J (x, x). For any doubly symmetric game V̇ (x) ≤ 0 along the solutions of (7.3), with
equality if and only if x ∈ ∆eq (RD).
7.5. DOUBLY SYMMETRIC (PARTNERSHIP) GAMES 107
Proof: We use vector notation and we explicitly use the fact that J (x, x) = xT A x, where A = AT . Then
or, since A = AT ,
V̇ (x) = −2 [ xT ADiag(x) A x − ( xT A x )2 ]
Since Diag(x) ≥ 0 for all x ∈ ∆, we see that V̇ (x) ≤ 0, for all x ∈ ∆, with equality if and only if xi = 0 or
A x − ( xT A x ) 1m = 0
Proposition 7.17. For any doubly symmetric game, x∗ ∈ ∆ESS if and only if x∗ ∈ ∆ is asymptotically stable equilib-
rium of the RD dynamics, (7.3, i.e.,
∆ESS = ∆ass
eq (RD)
Proof: Necessity follows directly from Proposition 7.12 (valid for any symmetric game). Thus ∆ESS ⊂ ∆ass
eq (RD).
For sufficiency, we show ∆ass ESS . Consider x∗ ∈ ∆ass (RD) which means x∗ an asymptotically stable
eq (RD) ⊂ ∆ eq
∗
equilibrium of the RD dynamics, (7.3). Then x is stable and attractive, i.e., there exists a neighborhood Bδ (x∗ )
such that for all x(0) = y ∈ Bδ (x∗ ) the solution x(t ) = φ (t, y) → x∗ , as t → ∞.
By Proposition 7.16, for any x 6= x∗ ∈ ∆eq (RD), V̇ (x) < 0, so that V (x(t )) is decreasing along solutions, i.e.,
considering initial condition x(0), V (x(t )) < V (x(0)), ∀t ≥ 0, hence in the limit as t → ∞, we have V (x∗ ) < V (x(0)).
Let any x(0) = y ∈ Bδ (x∗ ) , then V (x∗ ) < V (y), i.e.,
Take y to be ye = 2y − x∗ so that that ye is in some neighbourhood of x∗ , and on the intersection with Bδ (x∗ ), it holds
that
J (x∗ , x∗ ) < J (ye, ye) = J (2y − x∗ , 2y − x∗ )
or, based on bilinearity of J and double-symmetry of the game
This means that the cost of x∗ is better when not only on a small fraction of agents switch to y, but also when all
agents in the population switch to y ∈ ∆, as in Pareto efficiency.
A special class of games for which general convergence results exist in the evolutionary game setup is the class of
potential games.
Definition 7.18. A population game with the cost vector function J(x) is a (full) potential game if there exists a C1
function P : Rm
+ → R, called potential function such that
∇ P ( x) = J( x) , ∀x ∈ Rm
+
When J is continuously differentiable (C1 ) this is equivalent to DJ(x) to be symmetric for all x ∈ Rm
+ , where DJ(x)
denotes the differential of J (Jacobian matrix).
The condition on DJ(x) being symmetric is called externality symmetry and is explicitly written as
∂Jj ∂ Jk
( x) = ( x) , ∀ j, k ∈ M
∂ xk ∂xj
indicating the same marginal effect of one strategy on another. For the case of symmetric bimatrix games, J(x) = A x,
this is equivalent to A being symmetric itself, hence double-symmetric games as in the previous section. Note that
thus the cost vector can be seen as the “gradient" of P, or that J is integrable.
Remark 7.19. The above definition applies when the state space is the whole Rm + , hence the name full potential
game. In the case when the simplex ∆ is the state space, this definition requires modification. In this case the
potential function P : ∆ → R has domain ∆ so that the gradient ∇ P(x) is an element of the tangent space to ∆ at x,
denoted by T ∆,
T ∆ = {y ∈ Rm | 1Tm y = 0 }
7.6. POTENTIAL GAMES 109
Thus the definition of a potential game requires that the gradient ∇ P(x) equals the projection of the cost vector J(x)
onto the subspace T ∆,
∇ P(x) = Φ J(x), ∀x ∈ ∆
where Φ = I − m1 1m 1mT is the orthogonal projection matrix from Rm to T ∆. When J is continuously differentiable
(C1 ) this is equivalent to DJ(x) to be symmetric with respect to T ∆, i.e.,
zT DJ(x) z̃ = z̃T DJ(x) z, ∀z, z̃ ∈ T ∆, x ∈ ∆
Consider a potential game J. A full potential game J̃ can be generated from an arbitrary extension P̃ of the potential
function P and then can be projected back to ∆ (see Chapter 3 in [129]). Theorem 3.2.12 in [129] shows that J
and J̃ have the same best-response correspondences and Nash equilibria, but may have different average cost levels.
Part (ii) of the theorem shows that by choosing the extension P̃ appropriately, we can make J̃ and J identical on
∆. Moreover, the result demonstrates if population mass is fixed, then the Definition 7.18 does not entail a loss of
generality. In fact almost all results for full potential games hold for potential games. In here we only consider full
potential games.
In the case of potential games, all local minimizers of the potential P are Nash equilibria, this being a consequence
of Kuhn-Tucker (KKT) first order necessary conditions. Consider the the nonlinear minimization problem
L(x, µ , λ ) = P(x) + µ ( ∑ x j − 1) − ∑ λj xj
j∈M j∈M
Example 7.20. Consider a highway congestion game where for example drivers commute over a highway network
as in Example 6.8. The same setup is used in the case of network congestion, where each flow (user/agent) tries to
go over the route (path) that is least congested. This is the leading example of a potential game.
Assume that a collection of towns is connected by a network of links. For each ordered pair of towns there is
a population of agents, each of whom needs to commute from the first town in the pair to the second one. To
accomplish this, an agent must choose a path (route) connecting the two towns, hence a strategy j from the finite
set M. Let x j be the mass of agents that uses strategy (route) j. The delay each driver (agent) experiences depends
not only on the route he selects, but also on the congestion created by other agents along this route. The cost for an
agent is the delay on the path he takes. This delay is the sum of the delays on its constituent links, while the delay
on a link is a function of the number of agents who use that link.
To define a congestion game, we begin with a finite collection of facilities (e.g., links in a network), denoted L .
Every strategy j ∈ M requires the use of some collection of links (facilities) L j ⊂ L , (e.g., the links in route j). The
set ρ (l ) = {k ∈ M | l ∈ Lk } contains those strategies in M that require facility (link) l. Each facility (link) l ∈ L
has a cost function cl : R + → R that is a function of the usage it gets, i.e. the total “mass" of agents using the link
(facility) denoted by ql ,
ql (x) = ∑ xk
k∈ρ (l )
Costs for each strategy j ∈ M in the congestion game are obtained by summing the appropriate facility (link) costs
J j ( x) = ∑ cl ( ql (x) )
l∈L j
Since driving on a link increases the delays experienced by other drivers on that link, cost functions in models of
highway congestion are increasing; they are typically convex as well.
An agent taking the path j affects the outcome (cost) of the agents that use path k through the marginal increases in
congestion on the links l ∈ L j ∩ Lk that the two paths (strategies) have in common. This means
∂Jj ∂ Jk
(x) = ∑ c′l ( ql (x) ) = ( x)
∂ xk l∈L j ∩Lk ∂xj
where c′l is the derivative of cl and where the last equality is a result of the fact that the marginal effect of an agent
choosing path k on the congestion cost of agents choosing path j is identical. Hence such a game is a full potential
game, and the potential function can be written as
Z ql (x)
P ( x) = ∑ 0
cl ( u )du
l∈L
Assume now a simple model with a single source town (S) and a single destination town (D). The two towns are
separated by a river. Highways l1 and l4 are expressways that go around bends in the river, and that do not become
congested easily cl1 (q) = cl4 (q) = 4 + 20q, where q represents the loading on this link (mass of agents using it).
Highways l2 and l3 cross the river over two short but easily congested bridges: cl2 (q) = cl3 (q) = 2 + 30q2 . In order
to create a direct path between the towns, a city planner considers building a new expressway l5 that includes a
third bridge over the river. Delays on this new expressway are described by cl5 (q) = 1 + 20q. Before link l5 is
constructed, there are two paths from (S) to (D): path 1 (strategy 1) traverses links l1 and l2 , L1 = {l1 , l2 }, while
path 2 traverses links l3 and l4 , L2 = {l3 , l4 }. The equilibrium driving pattern splits the drivers equally over the two
paths (strategies), yielding an equilibrium driving time of 23.5 on each.
7.6. POTENTIAL GAMES 111
Assume that traffic on the new link l5 only flows to the right. After link l5 is constructed, drivers may also take path
3, which uses links l3 , l5 , l2 , i.e., L3 = {l3 , l5 , l2 }. Let x1 , x2 , x3 denote the fraction of agents (drivers) using path
strategy 1, 2, 3, respectively. The “mass" (fraction) of agents using link l2 is thus q2 = x1 + x3 ; similarly one can
compute the loading for the other links and then evaluate their costs.The resulting population game has cost vector
function
J1 (x) (6 + 20x1 + 30(x1 + x3 )2 )
J(x) = J2 (x) = (6 + 20x2 + 30(x2 + x3 )2
J3 (x) 2
(5 + 20x3 + 30(x1 + x3 ) + 30(x2 + x3 ) 2
which is convex. The unique minimizer of P potential on ∆, is the state x = [.4616, .4616, .0768]T and this is is the
unique Nash equilibrium of the game. In this equilibrium, the driving time on each path is approximately 23.93,
which exceeds the original equilibrium time of 23.5. In other words, adding an additional link to the network
actually increases equilibrium driving times, a phenomenon known as Braess’ paradox. The intuition behind this
phenomenon is easy to see in this case: by opening up the new link l5 , a single driver using path (strategy) 3 can use
both of the easily congested bridges, l2 and l3 . But while using route 3 is bad for the population as a whole (social
cost), it is appealing to individual (selfish) drivers, as drivers do not account for the negative effect their use of the
bridges imposes on others.
Definition 7.21. A population game with the cost vector function J(x) is a stable game if
, or
(x − y)T (F(x) − F(y)) ≥ 0 ∀x, y ∈ ∆
yT DJ(x)y ≥ 0 ∀x ∈ ∆, y ∈ T ∆
where DJ(x) denotes the differential of J and T ∆ the tangent space to ∆ such that 1Tm y = 0.
This last inequality is called self-defeating externalities, (see [129], p.105). For the case of symmetric bimatrix
games, J(x) = A x, and Definition 7.21 is equivalent to (y − x)T A (y − x) ≥ 0, ∀x, y ∈ ∆, hence a positive semidefinite
matrix A satisfies it.
Recall for continuous kernel games, the pseudo-gradient vector F : Ω → RN is monotone on Ω, when
so, stable population games correspond to this class of games from continuous kernel games.
112 CHAPTER 7. REPLICATOR DYNAMICS
Proof:
Since any x∗ ∈ ∆ asymptotically stable equilibrium of (7.3), is a stable equilibrium, by Proposition 7.8 it follows
immediately that x∗ ∈ ∆NE , hence (x∗ , x∗ ) ∈ NE (G ).
By contradiction, assume that (x∗ , x∗ ) ∈ NE (G ) is not perfect, (x∗ , x∗ ) ∈
/ PE (G ). Then by Proposition 3.24 x∗ is
weakly dominated, i..e, there exists some y ∈ ∆, hence
V ( z) = ∑ (y j − x∗j ) ln(z j )
j∈sup(z)
ż j 1
V̇ (z) = ∑ (y j − x∗j ) z j = − ∑ (y j − x∗j ) J (e j − z, z) z j z j
j∈M j∈M
Hence
V̇ (z) = − ∑ (y j − x∗j ) (J (e j , z) − J (z, z)) = −(J (y, z) − J (x∗ , z)) + J (z, z) ∑ (y j − x∗j )
j∈M j∈M
or, since x∗ , y ∈ ∆,
V̇ (z) = −(J (y, z) − J (x∗ , z)) = −J (y − x∗ , z) ≥ 0
However we can find a z ∈ B ∩ int(∆) such that V (x∗ ) < V (z) which contradicts this.
Firstly, based on the definition of V
Specifically, let any δ ∈ (0, 1), ε > 0 and any w ∈ int(∆), and consider a special z given by
z = ε w + (1 − ε ) [x∗ + δ (y − x∗ )]
Since z − x∗ = ε (w − x∗ ) + (1 − ε ) δ (y − x∗ ), it follows that for sufficiently small ε , y j < x∗j implies z j < x∗j , hence
the first term on the RHS of the above equality is positive. It can be shown that sup(y) ⊂ sup(x∗ ) so that the second
term is zero, hence
V (z) > V (x∗ )
For the special case of doubly symmetric (partnership) games, i.e., such that A = AT , we saw that ESS evolutionary
stability is equivalent to asymptotic stability in the RD replicator dynamics, (7.3). Recall that in the ESS definition
the assumption was that all individuals play the same mixed strategy x ∈ ∆, and we wanted to see what was the cost
(payoff) of strategy x when only a few agents (that ε ) in the population switch to some other strategy y ∈ ∆.
We also saw that average payoff increases, so that evolution as modeled in (7.3) induces a steady increase in social
efficiency over time. In this special case we can look at the following: what is the cost (payoff) of strategy x compared
to the cost to some other strategy y ∈ ∆ when all agents in the population switch to y ∈ ∆. If the comparison benefits
x then this is called locally socially efficient, i.e., if there is not nearby y such that the cost would be lower if all
agents were to switch to y. Formally we define this next.
Definition 7.22. A strategy x ∈ ∆ is locally strictly efficient if there ∃ a neighborhood Bδ (x) such that J (x, x) <
J (y, y), ∀y ∈ Bδ (x) ∩ ∆, y 6= x.
A strategy x ∈ ∆ is locally weakly efficient if there ∃ a neighborhood Bδ (x) such that J (x, x) ≤ J (y, y), ∀y ∈ Bδ (x) ∩
∆, y 6= x. A strategy x ∈ ∆ is globally efficient if J (x, x) ≤ J (y, y), ∀y ∈ ∆.
This last condition is the same as finding the minimum of J (x, x). Specifically let,
Since the cost J is continuous and ∆ is compact it follows that ∆∗ 6= ∅. Based on this concept we can show that for
doubly symmetric games ESS is equivalent to local strict efficiency.
Proposition 7.23. In a doubly symmetric game x∗ ∈ ∆ESS if and only if x∗ is locally strictly efficient.
Proof: For any y 6= x∗ let z = 12 x∗ + 12 y. By Proposition 6.6 x∗ ∈ ∆ESS if and only if x∗ is locally superior, i.e., there
exist a neighborhood Bδ (x∗ ) such that
1 1 1
J (z, z) = J (x∗ , x∗ ) + J (x∗ , y) + J (y, y)
4 2 4
Now z − x∗ = 21 (y − x∗ ), so z ∈ Bδ (x∗ ) is equivalent to y ∈ B δ (x∗ ). Using these in the foregoing inequality yields
2
that equivalently
J (x∗ , x∗ ) < J (y, y), ∀y ∈ B δ (x∗ ) ∩ ∆, y 6= x∗
2
Remark 7.24. Extension of the ESS concept to multi-populations can be done immediately leading to the concept of
evolutionary stable strategy profiles (ESSP).
Definition 7.25. A strategy profile x∗ = (x∗i )i∈N ∈ ∆X is an evolutionary stable strategy profile (ESSP) if it cannot
be succesfully invaded. Formally, if for every strategy profile y = (yi )i∈N ∈ ∆X , y 6= x∗ there exists some sufficiently
small εy ∈ (0, 1) such that ∀ε ∈ (0, εy ),
Note that no assumption of symmetry is used here, hence not a strategy itself xi is ESS but the whole strategy profile
x is ESSP. Herein we treated the single-population case but we indicate briefly below how this extension can be
made. A good treatment of results for this more advanced case can be found in [153].
where ∀ j ∈ M1 , k ∈ M2
F1,e j (x, y) = − J 1 (e1 j , y) − J 1 (x, y) , e
F2,k (x, y) = − J 2 (e2k , x) − J 2 (x, y)
is the excess payoff or fitness, and J 1 (x, y), J 2 (x, y) the average cost (payoff) in population 1 and 2.
7.7. *SUPPLEMENTARY MATERIAL 115
For matrix game with J 1 (x, y) = xT A y and J 2 (x, y) = xT B y = yT BT x, in (7.14) we have the fitness
F1,e j (x, y) = − e1 j T A y − xT A y (7.15)
e
F2,k (x, y) = − e2k T BT x − yT BT x , ∀ j ∈ M1 , k ∈ M2
so that
ẋ j = − (A y) j − xT A y , x j (7.16)
ẏk = − (BT x)k − yT BT x yk , ∀ j ∈ M1 , k ∈ M2
as in the asymmetric game case in [65], keeping in mind that A is cost matrix here. For a symmetric matrix game
B = AT and when one looks for symmetric solution, i.e., restricted to the diagonal D, with x = y, we can work with
a single vector equation
ẋ j = F je (x) x j , ∀j ∈ M (7.17)
where
F je (x) := F1,e j (x, x) = − eTj A x − xT A x = −J (e j , x) + J (x, x)
and J (x, x) is the average cost, thus recovering the single-population case we dealt with in (7.3).
Multi-population ([153], pp. 171) N
Here x = (xi , x−i ) ∈ ∆X
ẋi j = − J i (ei j , x−i ) − J i (xi , x−i ) xi j , ∀i ∈ N , ∀ j ∈ Mi (7.18)
We can denote the cost vector for population (player) i by Ji : ∆−i → Rmi , where the j − th entry is
corresponding to the cost when i-th player (population) uses the j-th pure strategy,while all the others use x−i . Recall
that J i (xi , x−i ) = ∑ j∈Mi xi j J i (ei j , x−i ), so with the foregoing vector notation it follows that
where .∗ denotes component-wise product of two vectors. Note that with the equivalence Ji = −Ui , Ji (x−i ) =
−Ui (x−i ), where Ui is payoff, an equivalent form is
ẋi = Ui (x−i ). ∗ xi − xTi Ui (x−i ) xi , ∀i ∈ N
and
ẋi = −R(xi ) Ji (x−i ) = −R(xi ) Fi (x)
where
Fi,e j (x) = − J i (ei j , x−i ) − J i (x) , ∀i ∈ N , ∀ j ∈ Mi
is the excess payoff or fitness, and J i (x) has the interpretation of average cost (payoff) in population i. Typical
definition uses the fitness defined as
i.e., as the excess payoff (utility) when compared to the average in the population. We keep the Ji notation to be
consistent with the setup we used until now. As always we use the equivalence Ji = −Ui , hence maximizing fitness
is equivalent to minimizing cost. In vector form
where Diag( F e i (x) ) denotes the diagonal matrix with elements from the F e i (x) vector, xi = [xi, j ] ∈ ∆i ⊂ Rmi and
Most of the evolutionary games consider symmetric matrix games, i.e., when J (e j , ek ) = [ A ] j,k , hence such that
for pure strategies the cost is identified with A matrix, where Ω = {e1 , . . . , em }, e j ∈ Rm , j ∈ M. There are some
interesting extensions to the case when this cost is not a constant A but J (u) with some u j 6= e j that can vary
continuously (i.e., not only discrete values), [106], [151].
According to the evolutionary interpretation, let x denote the vector of fractions of population, x j being the fraction
that plays pure strategy u j . Thus the vector of pure strategies is u and in addition to the evolution of population
dynamics ẋ according to replicator dynamics (RD), one can have an evolution of the strategy, u̇, called strategy
dynamics or adaptation dynamics.
Recall that in the matrix case we had J (e1 j , y) = J j (y) = ( A y ) j and J (x, y) = ∑mj=1 ∑nk=1 x j J (e j , ek ) yk = xT A y,
where we recognized that if we denote by [ J (u) ] j,k the game cost when pure strategy pair ( j, k) is used, then for
matrix games, [ J (u) ] j,k = J (e j , ek ) = A j,k = a jk . Generalizing to J (u) = A for matrix case, in general we can write
the average cost as
m n
J (x, y, u) = ∑ ∑ x j [ J (u) ] j,k yk = xT J (u) y
j =1 k=1
still bilinear in x and y, which is useful when the pure strategy profile u is no longer fixed. Moreover, J (x, x, u) =
xT J (u) x is quadratic, and this can be extended to the more general case of a nonlinear cost (fitness) function,
J (x, x, u) := xT J(x, u).
7.8. NOTES 117
ẋ j = F je (x, u) x j , ∀j ∈ M (7.21)
where
F je (x, u) = − J (u j , x, u) − J (x, x, u) , ∀j ∈ M
or
F je (x, u) = − (J j (x, u) − J(x, u)) , ∀j ∈ M
Define a function Je: ∆ × Ω × ΩX → R, with an extra argument ν ∈ Ω, such that
ẋ j = Je(x, ν , u) x j, ∀j ∈ M
v= u j
and describes how the fraction of population x j , that plays each strategy u j , varies in time depending on the interac-
tion with the other strategies (i.e., on the environment). Consider now that u j themselves are not fixed but can vary
in time. Then to this population dynamics we add what is called the adaptive (strategy) dynamics, that describes how
the strategy itself u j is adapting. Specifically, adaptive dynamics assumes that u j changes in response to the direction
and magnitude of the fitness gradient on the adaptive landscape. Under some assumptions regarding the distribution
and redistribution of heritable strategies about the population’s mean strategy, the strategy or mean strategy u j value
evolves according to
∂ Je(x,ν ,u)
u̇ j = µ ∂ν , ∀j ∈ M (7.22)
ν =u j
u̇ j = µ ∇ν Je(x, ν , u) , ∀j ∈ M
ν =u j
In fact u j represents the mean strategy value of the population x j . The dynamics of the strategy depends on the
gradient of Je with respect to ν , denoted by ∇ν Je(x, ν , u), taken as if x and u are constant. All of the strategies in
u with positive population sizes evolve by climbing the adaptive landscape. This is called strategy dynamics in the
proposed model of evolution of [151] and is called adaptation dynamics by [46], [65], [66]. Thus the most general
overall population-strategy dynamics is described by a coupled system of differential equations in x and u.
7.8 Notes
Evolutionary game theory is devoted to the study of the evolution of strategies in a population context. In biological
systems, players are typically assumed to be pre-programmed to play one given strategy, so studying the evolution
118 CHAPTER 7. REPLICATOR DYNAMICS
of a population of strategies becomes formally equivalent to studying the demographic evolution of a population
of players. By contrast, in socio-economic models, players are usually assumed capable of adapting their behavior
within their lifetime, switching their strategy in response to evolutionary (or competitive) pressure. However, the
distinction between players and strategies is irrelevant for the formal analysis of the system in either case, since
it is strategies that are actually subjected to evolutionary pressures. Thus, without loss of generality and for the
sake of clarity, one can adopt the biological stand and assume that players may die and each individual player uses
the same particular fixed strategy all throughout his finite life. Strategies are subjected to selection pressures in
the sense that the relative frequency of strategies which obtain higher payoffs in the population will increase at the
expense of those which obtain relatively lower payoffs. The aim is to identify which strategies (i.e. type of players or
behavioral phenotypes) are most likely to thrive in this “evolving ecosystem of strategies" and which will be wiped
out by selective forces. In this sense, note that EGT is an inherently dynamic theory, even if some of its equilibrium
concepts are formulated statically (e.g. the concept of evolutionarily stable strategy).
Chapter 8
Learning in Population Games
Chapter Summary
This chapter provides an introduction to learning approaches in population games. We generalize the RD dynamics
to the class of payoff (cost) monotonic dynamics. Then we introduce revision protocols that describe the behavior
of agents in large populations and show that different types of protocols lead to different dynamics. We show that
imitative protocols lead to imitation dynamics related to the replicator dynamics (RD). Some other protocols, called
direct protocols lead to best-response dynamics or variants of it.
8.1 Introduction
As seen in the previous chapter, although motivated by mathematical biology, evolutionary game theory (EGT)
leads to remarkable connections to game theory and the NE solution concept. Here we use EGT to contrast classical
game theory (CGT) to the branch of learning game theory (LGT), specifically for population games (LGT-Pop),
(N → ∞). Recall that the classical, rationalistic justification for the Nash equilibrium concept in game theory is
based on three assumptions about the players in the game. First, each player is assumed to be rational, acting to
maximize his payoffs given what he knows. Second, players have knowledge of the game they are playing: they
know what strategies are available, and what outcomes (payoffs or costs) result from every strategy profile. Third,
the players have equilibrium knowledge: they are able to anticipate correctly what their opponents will do. Of the
three assumptions listed above, the equilibrium knowledge assumption is the hardest to accept. Namely, it is hard
to explain how players can introspectively anticipate how others will act, particularly in games with large numbers
of participants. As an example, under the traditional interpretation of equilibrium play in a traffic network, a driver
choosing a route to work has a complete mental account of all of the routes he could take, and he is able to anticipate
the delay that would arise on each route for any possible profile of choices by his fellow drivers.
LGT does not impose assumptions on players’ rationality and beliefs, but assumes instead that players learn over
time about the game and about the behavior of others (e.g. through imitation, reinforcement or belief updating).
There are a variety of approaches one could take to studying dynamics or learning in games (LGT), depending
on the number of players involved, the information the players are expected to possess, and the importance of the
interaction to the players. In the context of finite populations (N < ∞), LGT needs further mathematical apparatus,
such as the use of perturbed Markov processes, which are beyond our scope. Material on LGT (N < ∞) based on
119
120 CHAPTER 8. LEARNING IN POPULATION GAMES
stochastic approximation is presented in the supplementary section on learning in repeated games. A recent textbook
reference dealing with these and related issues is [57].
The approach we study in this chapter considers the dynamics of behavior in large, strategically interacting pop-
ulations (LGT-Pop), (N → ∞), based on connections to evolutionary game theory (EGT). This approach assumes
that agents only occasionally switch strategies, and then use simple myopic rules to decide how to act. While these
assumptions are certainly not appropriate for every application, they seem natural when the interaction in question
is just one among many the agent faces, so that the sporadic application of a rule of thumb is a reasonable way for
the agent to proceed. Material is drawn mainly from [129] and [57].
Learning models deal with behavioural strategies which are slightly more complicated than those used in evolution-
ary models. We extend the setup from the replicator dynamics (RD) as inspired by biological models to other classes
of dynamics. Models in EGT are aggregate in the sense that they describe the aggregate behaviour of a population
of players through various generations; the population is subject to evolutionary pressures (and therefore the pop-
ulation adapts), but the individual components of the population have a predefined fixed behaviour. On the other
hand, in LGT players individually adapt through learning, and it is this learning process that is formally described.
Models in LGT explicitly represent the learning processes that each individual player carries out, and the dynamics
that are generated at the aggregate level emerge out of the strategic interactions among the players. In turns out that
this dynamics at the aggregate level is what links it to the setup in EGT.
We focus only on deterministic dynamics and for the most part on regular dynamics only (as in [153]), i.e., assuming
uniqueness and invariance of the ∆ set under corresponding solutions. While regularity rules out mechanisms like
the best response dynamics, we do indicate how those could be approached (more complicated than with the tools
we are concerned with here). Stochastic and non-regular dynamics fall in that more complex case. Under the
assumption of infinite populations the stochastic effects are effectively averaged out, so the obtained deterministic
dynamics can be formalized as a system of differential equations (ODEs). Any model with finite populations can be
formalized as a Markov process, and the system of ODEs is the approximation of the Markov process in the limit as
the population N tends to infinity. We are interested in studying the behaviour of the ODEs system in the long run
(steady-state), which involves calculating the limit of the dynamics as time goes to infinity. This is because one of
the main motivations to study learning models in games is identifying learning algorithms that will lead to NE or to
refinements of NE.
Interestingly, models of learning by imitation and evolutionary models are closely related: one can always understand
an evolutionary model in learning terms, by re-interpreting the death-birth process as a strategy revision-imitation
process conducted by immortal individuals. Imitation occurs whenever a player adopts the strategy of some other
player. With this view in mind, one could argue that LGT actually encompasses EGT. The definition of a particular
imitation rule dictates when and how imitation takes place. Some models prescribe that players receive an imitation
opportunity with some fixed independent probability; in other models the revision opportunity is triggered by some
internal event (e.g. player’s average cost going above a certain threshold). When given the chance to revise his
strategy, the imitator selects one other player to imitate, influenced by the cost obtained by the other players in
previous rounds, or by experimentation (i.e. adoption of a randomly selected strategy).
A basic part of LGT is what is called a revision protocol, [129], which specifies the procedure, i.e., the rules that
agents employ to decide when and how to choose new strategies. Starting with a population game and a revision
protocol, one can derive dynamic processes that describes the aggregate behavior dynamics or mean dynamics of all
players over time. These aggregate behavior dynamics can be both deterministic and stochastic, and in the long-run
can be approximated by a deterministic dynamics as mentioned above. One can place protocols into two broad
8.2. PAYOFF (COST) MONOTONIC DYNAMICS 121
categories: under imitative protocols, an agent obtains a candidate strategy by observing the strategy of a randomly
chosen member of his population (hence a strategy already in use). Under direct protocols, agents are assumed to
choose candidate strategies directly; a strategy’s popularity does not directly influence the probability with which it
is considered. Examples are the BNN dynamics, the Smith dynamics or the best response (BR) dynamics, which is
based on optimal myopic choices.
The chapter is organized as follows. In the first part we shall define a class of dynamics that is broader than the
replicator dynamics (RD), called payoff monotonic dynamics. Many of the RD dynamics results in previous chapter
hold for the class of payoff monotonic dynamics. Moreover we will see that the resulting dynamics under imitative
protocols reduce to the RD dynamics or to payoff monotonic dynamics. In contrast, dynamics based on direct
selection protocols behave rather differently than those studied in biology. Direct selection allows unused strategies
to be introduced to the population, which is impossible under pure imitation (or under biological reproduction
without mutations).
In this section we define a class of dynamics broader than the replicator dynamics (RD) inspired by biology. Such dy-
namics are general autonomous continuous-time dynamical systems with certain properties expressed in geometric
form, [55], [153]. Many of the RD results in previous chapter hold for this broader class of dynamics.
Consider a population game with vector-valued cost function J : ∆ → Rm or J : ∆ × ∆ → R and a dynamical system
where F je (x) depends on the cost but has no specific form imposed. However it has the same interpretation of
growth-rate, i.e., the rate at which the pure strategy j replicates when the population is in state x. For the RD
dynamics, (7.3), this has the explicit form F je (x) = −J j (x) + J(x), i.e., the excess fitness. We now define only
some general properties for F je that enable generalization to other forms of dynamics, specifically to all imitative
dynamics.
Definition 8.1. A function F e : X → Rm is called regular growth-rate function if it is Lipschitz continuous, its open
domain X is such that ∆ ⊂ X and if xT F e (x) = 0 for all x ∈ ∆.
Geometrically, condition xT F e (x) = 0 says that the growth-rate vector F e (x) is orthogonal to x ∈ ∆. This is
equivalent to 1Tm f (x) = 0 and enables to show that the simplex ∆, its interior and its boundary are invariant sets for
(8.1). It can be seen that x is an equilibrium point if and only if F je (x) = 0 for all j ∈ sup(x). The next result gives
sufficient condition for asymptotic stability and instability, respectively.
Proposition 8.2. Consider (8.1) for F e regular growth-rate function and x ∈ ∆ an equilibrium. If there ∃ some
neighbourhood B (x) ∈ ∆ such that xT F e (y) > 0 for all population states y 6= x, y ∈ B (x) then x is asymptotically
stable.
If there ∃ some neighbourhood B (x) such that xT F e (y) < 0 for all population states y 6= x, y ∈ B (x) then x is
unstable.
122 CHAPTER 8. LEARNING IN POPULATION GAMES
Exercise 8.1. Prove Lemma 8.2. Proceed as in Lemma 7.13 and show that the relative-entropy function Hx (y) satisfies along the solutions
of (8.1), Ḣx (y) < 0 (or Ḣx (y) > 0) on B (x) ∩ Qx where Qx is domain of definition of Hx .
Since xT F e (y) is the inner product of the two vectors, geometrically, for asymptotic stability, condition xT F e (y) >
0 asks that the growth-rate vector F e (y) makes an acute angle with the population state vector x at any y near x.
If this happens, this guarantees a local drift toward x. Note that in the special case of RD dynamics this geometric
condition is equivalent to evolutionary stability (ESS). This can be seen as follows: x ∈ ∆ESS iff J (x, y) < J (y, y) or
xT J(y) < yT J(y), for all nearby y 6= x (by Proposition 6.6). Thus for the RD dynamics with the special growth-rate
function F e as excess fitness we have
since xT 1m = 1 and J(y) = yT J(y). Thus x is ESS iff xT F e (y) > 0, for all nearby y 6= x, and this implies that x
is asymptotically stable for (8.1). Hence the result in Proposition 8.2 is a generalization of Proposition 7.12 to any
regular growth-rate dynamics.
The next property relates the growth-rate to the cost function and enables proving the connection between the dy-
namic properties and the game theoretic criteria.
Definition 8.3. A regular growth-rate function F e : X → Rm is called payoff monotonic or cost monotonic if
The associated population dynamics (8.1) is called payoff (cost) monotonic dynamics. Note that all payoff-monotonic
selection dynamics have the same set of stationary states or equilibrium states.
Proposition 8.4.
∆eq = {x ∈ ∆ | J j (x) = J(x), ∀ j ∈ sup(x)}
is the set of stationary states under any payoff-monotonic selection dynamics, (8.1).
Proof: Let an arbitrary x ∈ ∆eq . Then J j (x) = J(x), for all j ∈ sup(x). By monotonicity (see Definition 8.3), it
follows that there exists some α ∈ R such that F je (x) = α , for all j ∈ sup(x). Then xT F e (x) = ∑ j∈sup(x) F je (x) x j =
α , for all x ∈ ∆, which by orthogonality (Definition 8.1) means that α = 0. Thus x is an equilibrium point of (8.1).
Conversely, let some y ∈ ∆ be an equilibrium point of some payoff monotonic dynamics (8.1). Then, for all j ∈
sup(y), we have F je (y) = 0. Then by monotonicity, it follows that there exists some γ ∈ R such that for all j ∈
sup(y), J (e j , y) = γ or J j (y) = γ . Then
J(y) = yT J(y) = ∑ y j J j ( y) = ∑ y j J j ( y) = γ
j∈M j∈sup(y)
In this section we consider the specific mechanism that leads to such a dynamics in the form of a revision protocol.
The following sections will show that specific examples of dynamics can be obtained by different such revision pro-
tocols, of which some (imitation based dynamics) fall into the broader class of payoff monotone dynamics discussed
in the previous section. Other examples such as the best response-dynamics (when direct selection protocols are
used) need special analysis and results exist only for special cases. We shall only briefly discuss these in these notes.
Suppose that a large population of agents (computers, firms, etc) recurrently play a game. We assume that these
players or agents interact forever and “live" forever. Each agent adopts a pure strategy which it uses for some time
interval and from time to time reviews his strategy and based on this can change it by using a fixed revision protocol.
Since there are many agents one expects the stochastic influences to be averaged away, leaving the aggregate be-
havior to evolve in an essentially deterministic fashion. Such a deterministic evolutionary process can be described
by a system of ODEs, which can be derived from the population game and revision protocol and is called the mean
dynamics. In fact in large but finite populations, over finite time spans, the Markov process converges to a solution
trajectory of the mean dynamics as the population size becomes arbitrarily large. While this deterministic approxi-
mation result does not hold over infinite time spans, the infinite-horizon behavior of the Markov process can still be
described through an analysis of the mean dynamics, [129]. We derive the mean dynamics and we present examples
illustrating well-known dynamics from the literature.
As typical assumptions, inertia and myopia are imposed. By inertia, we mean that individual agents do not contin-
ually reevaluate their choices in the game, but instead reconsider their strategies only sporadically. By myopia, we
mean that revising players will condition their choices on current behavior and payoff (cost) opportunities; they do
not attempt to incorporate beliefs about the future course of play into their decisions. Myopic play is a reasonable
assumption in large population interactions; individual members become anonymous and punishments, reputations,
and other possibilities that are central to the theory of repeated game can be ignored. These features, together with
the basic idea that strategies (actions) which are more “fit” tend over time to displace less fit actions, actually de-
scribe the setting of an evolutionary game, [55]. Inertia and myopia are built into the revision protocol: agents
wait a random amount of time before they consider switching strategies, and their decisions at such moments only
condition on current payoffs and the current social state.
We need two elements to model such revision protocol or revision dynamics: firstly the time rate at which this
strategy is reviewed (as given by each agent’s stochastic alarm clock) and secondly the conditional switch rate or
switch probability of the reviewing agent. The times between rings of an agent’s clock are independent, each with a
rate R exponential distribution (this models a Poisson alarm clock). When such a clock rings for an agent that uses
pure strategy j ∈ M, (called a j-strategist), he will revise his strategy. A j-strategist will switch to some strategy
k 6= j, with conditional switch rate p j,k . Again to indicate dependence on the state of the population we use p j,k (x).
Assume that R satisfies
max ∑ p j,k ≤ R.
x, j
k6= j
When the “revision clock" for an j-th agent rings, he switches to a strategy k 6= j with switch probability p j,k /R.
Note that the switch probability is p j,k itself when R = 1, otherwise, the switch probability is proportional to p j,k .
When revision clock rings, the j-th agent continues to play strategy j with probability equal to 1 − ∑k6= j p j,k /R.
Remark 8.5. Note that when ∑k p j,k = R (called an exact revision protocol), this gives p j, j /R as such probability
of continuing to play strategy j. For such exact revision protocol it can be assumed that R = 1 since equal but
124 CHAPTER 8. LEARNING IN POPULATION GAMES
non-unitary review rate r j = R > 1 leads to similar dynamics with just a velocity scaling by 1/R.
In the following we describe the expected population (evolutionary) dynamics under such revision protocol and
switching of strategies. Assume a population of agents and let x j denote the fraction of j ∈ M strategists in the
population, x ∈ ∆ denote the population state. As stated above, the times between revision rings of each agent’s
clock are independent and follow a rate R exponential distribution.Then a basic result from probability theory shows
that the number of review rings during time interval [0,t ] follows a Poisson distribution with mean Rt. Let us
define the expected motion of the stochastic process over the next dt time interval. During the next dt (small) time
interval the expected number of revisions that an agent receives is R dt. Assume that all agents Poisson processes are
statistically independent; then when state is x, the expected number of revisions that j-strategists (aggregate) receive
is approximately x j R dt, where we normalized to the population size. Since a j-strategist that receives a revision
opportunity switches to strategy k with probability p j,k /R, the expected number of such switches from a j-strategy
to strategy k during the next dt time units is x j p j,k dt. Thus overall during the time dt the fraction of j-strategists in
the population changes by dx j (net) which is equal to
This is called the aggregate (mean) dynamics generated by the revision protocol and switching of strategies by agents
The deterministic mean (evolutionary) dynamics describes the process under the revision protocol: the first term
captures switches from other strategies to strategy j (inflow), while the second captures switches from strategy j to
others (outflow). Thus for a large (continuum) of agents, by the law of large numbers we can model these aggregate
stochastic processes as deterministic flows.
We will see that various mean dynamics subsume the RD replicator dynamics. In the previous chapter we saw that
Lyapunov stability of equilibrium of RD dynamics (in continuous time) implies NE, and asymptotic stability implies
some robustness of this NE, so these results will hold in these cases too. Once one specifies an explicitly dynamic
model of behavior, the natural approach is to determine where the dynamics leads when set in motion from various
initial conditions. If equilibrium occurs as the limiting state of this adjustment process, we can feel some confidence
in predicting equilibrium play. If instead our dynamics lead to limit cycles or other more complicated limit sets, then
these sets rather than the unstable equilibrium points provide superior predictions of behavior.
Let us consider a few examples of mean dynamics that arises from adaptation by (myopic) imitation. Under an
imitative protocol, an agent who plays strategy j receives a revision opportunity and chooses an opponent at random
to observe his strategy. If his opponent uses strategy k, the agent switches from j to k with probability proportional
to the frequency xk of strategy k, and a function ζ that depends on the cost and state x, i.e.,
This is the general form of such a imitative protocol. Once the form of the ζ function is specified the protocol is
completely defined.
8.4. IMITATION DYNAMICS (ID) 125
Suppose that after selecting an opponent, the agent imitates the opponent only if the opponent’s cost is lower than
his own, and does this with probability proportional to the cost difference so that
where in this case ζ (v) = [v]+ , with [v]+ = v if v > 0 and [v]+ = 0, otherwise.
Under these assumptions the mean dynamics (8.2) becomes
Suppose that when a strategy j agent receives a revision opportunity, he opts to switch strategies with a probability
that is linearly increasing in his current cost. For example, agents might revise when their costs (payoffs) do not
meet a uniformly distributed random aspiration level. In the event that the agent decides to switch, he imitates a
randomly selected opponent. This leads to the revision protocol
p j,k (x) = xk ( γ + J (e j , x) ), ∀x ∈ ∆, , ∀ j, k ∈ M
where the constant γ is sufficiently large that ( γ + J (e j , x) ) is always positive. In this case ζ (v) = γ + v. Under
these assumptions the mean dynamics (8.2) becomes
x˙j = ∑ xk x j ( γ + J (ek , x) ) − x j ∑ xk ( γ + J (e j , x) )
k∈M k∈M
or
x˙j = F je (x) x j (8.4)
where
F je (x) = ∑ xk J (ek , x) − J (e j , x) = J (x, x) − J (e j , x)
k∈M
again the replicator dynamics (RD); hence all RD results are valid for this type of imitation.
These types of imitation dynamics can be extended to using any continuously differentiable (C1 ) function ζ that is
0 on (−∞, 0] and strictly increasing over [0, ∞). For example in case of pairwise proportional imitation, instead of
[·]+ one uses ζ one can obtain (8.3) where
F je (x) := − ∑ xk ζ (J (e j − ek , x)) − ζ (J (ek − e j , x))
k∈M
126 CHAPTER 8. LEARNING IN POPULATION GAMES
If ζ is strictly increasing over the possible cost difference on the game than again payoff-monotonic growth-rates
are obtained and all results apply as in the RD case. Note that such assumption does not require an agent to know
the expected cost (payoff) to his current pure strategy and the population state. It is sufficient if some or all agents
have some empirical noisy data on this expected cost. Recall from the previous chapter that the entropy-like function
V (x) = Hx∗ (x) was used as Lyapunov function for the RD dynamics, so all imitation dynamics can be treated in the
same way, and convergence shown for strictly stable games, as well as for potential games (Section 7.4 and 7.5).
Under imitative protocols, agents obtain candidate strategies by observing the behavior of randomly chosen oppo-
nents. In settings where agents are aware of the full set of available strategies, we can assume instead that they
choose candidate strategies directly, without having to see the choices of others. These direct protocols generate
dynamics that differ from those generated by imitative protocols. Their analysis typically requires methods that
apply to non-smooth functions or set-valued mappings. General results exists only for special classes of games, for
example potential games, or stable games, or supermodular games. For details on these see [129].
First consider the case where
p j,k (x) := qk (x), f or k 6= j
so that p j,k is not weighted by xk (which means it can be 0) and is independent of j, as seen in the notation qk (x).
The mean dynamics is obtained from (8.2)
where we used the fact that pk, j (x) = q j is independent of k and ∑k∈M xk = 1. Note that for those cases where
∑k∈M qk (x) = 1, this becomes in vector form
ẋ = q(x) − x
hence the name target dynamics since f (x) = q(x) − x. The most famous example is the Brown-von Neumann-Nash
(BNN) dynamics.
Example 8.6. (Brown-von Neumann-Nash (BNN) dynamics) Consider a protocol where comparison is done with
respect to the average cost in the population. When an agent’s clock rings, he chooses a strategy at random; if that
strategy’s cost is below average, the agent switches to it with probability proportional to its extra cost saving, i.e.,
Note that p j,k is not weighted by xk and is independent of j, as seen in the notation qk (x). The mean dynamics is
obtained from (8.2)
x˙j = ∑ xk pk, j (x) − x j ∑ p j,k (x) = q j (x) − x j ∑ qk (x)
k∈M k∈M k∈M
This is called the Brown-von Neumann-Nash (BNN) dynamics, written as
For the BNN dynamics convergence can be shown in the case of stable games by using a Lyapunov function,
V (x) = ∑ j∈M [ −J j (x) + J(x)]2+ .
8.6. BEST RESPONSE (BR) DYNAMICS 127
Consider a protocol based on pairwise imitation but not proportional, i.e., such that
so that the agent selects a strategy at random again, but he switches strategies from j to k if the new strategy’s cost is
lower than his current strategy’s cost with probability proportional to the difference between the two costs. The the
resulting mean dynamics is immediately obtained from (8.2) as
and is called the Smith dynamics. As for the BNN dynamics, for the case of stable games, convergence under the
Smith dynamics can be shown by using as Lyapunov function, V (x) = ∑ j∈M ∑k∈M x j [ Jk (x) − J j (x)]2+ , [129].
Example 8.7. (Logit dynamic) Another example is the logit (exponential) dynamic, related to the perturbed best-
f dynamics used in the context of repeated games. In this case
response BR
exp−ε −1 Jk (x)
p j,k (x) = [β ε (−J(x))]k =
∑l∈M exp−ε −1 Jl (x)
where Jk (x) = J (ek , x), β ε is the soft-min (max) or logit function, and ε > 0 is called the temperature or the noise
level. If ε is large, choice probabilities under the logit rule are nearly uniform. But if ε is near zero, choices are
optimal with probability close to one. In this case after a rescaling, the mean dynamics generated is
x˙j = [β ε (−J(x))] j − x j
The best response dynamics (BR dynamics) represents a class of strategy updating rules, where players strategies in
the next round are determined by their best responses. Let us consider the case when the actual switch does not come
by imitation but we suppose that each agent’s revision opportunities arrive at a fixed rate, and that when an agent
receives such an opportunity, he chooses a best response (BR) to the current population state. Thus, we assume that
each agent responds optimally to correct beliefs whenever he is revising, but not necessarily at other points in time.
Consider thus a direct protocol as in Section 8.5, so that p j,k (x) = qk (x) where ∑k qk (x) = 1, and the resulting mean
dynamics is ẋ = q(x) − x. When q(x) = BR(x) = argminy∈∆ yT J(x) = argminy∈∆ J (y, x) this gives the best response
(BR) dynamic. Since this is a multi valued map when J (y, x) = yT Ax, we write the mean BR dynamics as
ẋ ∈ BR(x) − x
a differential inclusion. The best response (BR) dynamic was introduced in [59] and further studied by Hofbauer,
[65] who gave the interpretation of the best response dynamics as a differential inclusion. In the context of normal
form games, the best response dynamics can be viewed as a continuous-time version of fictitious play (FP). The
most interesting question is whether there are classes of games for which the BR (or pBR) dynamics is guaranteed
to converge to some NE, and how one would go about characterizing these classes. To date, the two main classes
128 CHAPTER 8. LEARNING IN POPULATION GAMES
of games for which the BR dynamics is known to converge are stable games and potential games, [129]. In fact the
proof of stability for stable games is based on a multivalued Lyapunov function ([129] p. 258-279),
In [66] convergence is proved for BNN, best response (BR), and perturbed (smoothed) best response (pBR) dynamics
in normal form games with an interior ESS.
In order to avoid the discontinuity, a smoothing of the discontinuous best-response is typically considered by em-
ploying a perturbed cost yielding a perturbed BR dynamics (as we have seen in learning in repeated games)
yT e
J(x) = yT J(x) + v(y)
for some admissible smooth and convex perturbation function v. Then this leads to perturbed minimizer functions
f : Rm → int(∆), where this time the function BR
BR f is single-valued, continuous, and even differentiable. The mixed
strategy places most of its mass on the optimal pure strategies, but places positive mass on all pure strategies. This is
perturbed best-response function for the population. This protocol induces the smoothed or perturbed best response
(pBR) dynamics as its mean dynamic
f ( x) − x
ẋ = BR
which is exactly as if i = 1 in the repeated game case N < ∞. When v(y) = ε ∑k∈M yk ln(yk ), the unique minimizer,
or the best-response function gives exactly the logit function defined before in Example 8.7, i.e.,
f (x) := argminy∈∆ yT e
q(x) = BR J(x) = β ε (− J(x))
This yields the logit dynamics in Example 8.7 as the leading example of a perturbed BR dynamics. The perturbed
minimizer function BRf is similar to the best-response correspondence BR, except that the function does not "jump"
from one pure strategy to another. The idea is to smooth the discontinuous “step" function, using the logit function
defined before. The difference is illustrated in Figure 8.1, where black represents the best-response correspondence
and the other colors each represent different smoothed best-response functions. In standard BR correspondences,
even the slightest benefit to one action will result in the individual playing that action with probability 1. In perturbed
BR as the difference between two actions decreases the individual’s play approaches 50:50.
There are several advantages to using perturbed best response (pBR), both theoretical and empirical. First, it is
consistent with psychological experiments; when individuals are roughly indifferent between two actions they appear
to choose more or less at random. Second, the play of individuals is uniquely determined in all cases, since the it is a
function and not a correspondence. Finally, using smoothed best response with some learning rules (as in Fictitious
Play) can result in players learning to play mixed strategy Nash equilibria (Fudenberg and Levine 1998).
8.7 Notes
Most applications of game theory (CGT) suppose that the play will resemble an equilibrium. The theory of learning
in games (LGT) provides a foundation for equilibrium analysis by examining how, which, and what kind of equilib-
ria can arise. These equilibria arise from letting a non-equilibrium dynamic process of learning (adaptation, and/ or
imitation) run over a long time. The long-run implications of learning depend on what players observe. Learning is
most effective in strategic-form games with known payoffs and observed actions; this is because players’ observa-
tions are independent of their own actions, so passive learning is enough to identify the distribution of opponents’
strategies.
Learning through imitation is an important class with consequences for the type and the stability of the equilibrium
eventually reached. We studied Fictitious play (FP) and stochastic FP (sFP) as simple learning rules that assume
that the environment is stationary. In some classes of games sFP leads to convergence to an approximate Nash
equilibrium, so that the stationarity assumption is approximately correct. In other games, sFP leads to stable cycles.
The other class are reinforcement learning (RL) models (also covered in the supplementary notes for the N finite
case). These are one of the most widespread adaptation mechanisms in nature: reinforcement learners use their
experience to choose or avoid certain actions based on their immediate consequences, and imitation dynamics are
part of them.
We close these notes by remarking that there are a lot of open research questions on the relation between CGT, EGT
and LGT. A topic of much interest lies in studying the conditions under which the solution concepts derived in each
of these fields coincide, e.g. When does a certain learning rule converge to a Nash equilibrium? Are the dynamics
of a certain evolutionary process formally equivalent to those obtained when the game is played by individuals who
learn to play the game?
At the same time, any problem posed in the context of game theory requires some sort of computation. The identifi-
cation of best-response strategies, Nash equilibria, evolutionarily stable strategies, or of any other solution concept,
are computational problems in the sense that they require an algorithm, a procedure to compute the results. Another
current area of research concerns the design and computational complexity of such algorithms.
130 CHAPTER 8. LEARNING IN POPULATION GAMES
Chapter 9
Computation of Nash Equilibria: Learning in
Repeated Games
Chapter Summary
This chapter provides an introduction to learning in multi-player repeated games (N < ∞).
9.1 Introduction
In a repeated game, agents or players play the same game repeatedly and learn from previous game stages to update
their strategies, depending on their opponents’ actions. Learning in repeated games is an active research area, see
Fudenberg and Levine’s book, [57], or Young’s one, [156]. Several classes of algorithms have been proposed and
analyzed with respect to convergence to Nash equilibria. Most of the results are for special classes of games, such
as zero-sum games, N = 2, 2 × 2 games, or potential games.
The learning/update process is a stochastic Markov process with noise which can be treated via an stochastic approxi-
mation (SA) approach under a Martingale noise assumption, by analyzing the mean ODE (deterministic, continuous
time, on the same space ∆), obtained by an additional sample averaging to remove the noise. Different kinds of
convergence results are possible: convergence in moves (actions) (finite space A , hence Markov chains, stochastic
stability concepts), versus convergence in mixed-strategies or sample averages (continuous space ∆, hence Markov
process). This latter one is the one that can be in turn treated via stochastic approximation results.
The method of ordinary differential equations (ODE) for analysis of stochastic approximation (SA) algorithms is
covered in several textbooks, one such comprehensive text being the work of Kushner and Yin, [81]. An elegant
system theoretic interpretation is developed by Borkar in [35], which allows the treatment of multiple time-scale
learning. The extension to other cases beyond simple linear or stochastic gradient type algorithms was initiated in
the seminal works of Benaïm, [24],[25], [26]. These extensions have led to very interesting applications of ODE-
based SA results to learning in games (LGT), e.g., by Benaïm and Hirsch, [27], Benaïm and Weibull [23], Hofbauer
and Sandholm, [64], Benaïm and Raimond, [32], Benaïm, Hofbauer and Sorin, [31], and Sandholm, [127].
The ODE-based SA approach has the advantage of capturing dynamical interconnections; the question of conver-
gence of an algorithm can be reformulated as the question of stability of a closed-loop system formed between the
131
132 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
running algorithms of all players. Of particular interest is stability of rest points (equilibria points) that are Nash
equilibria of the underlying game. The main steps of such an SA approach are as follows:
1. Show that the stochastic processes generated by the algorithm are asymptotic pseudo-trajectories of a system
of ODEs. (Stochastic approximation).
(a) Show that the rest points (equilibrium points) include all NEs.
(b) Convergence towards fixed points by (e.g. using Lyapunov analysis).
(c) Characterize stability/instability of fixed points that are not NEs.
3. Conclude that stochastic processes cannot converge towards unstable rest points.
Such an approach results in an almost sure (a.s.) convergence result towards an NE.
Applications of the ODE-based SA method have been considered to learning in games for the case of a finite number
of agents (N < ∞), also called multi-agent learning (MA-LGT) case as covered in the first part of these notes, as well
as in the case of games played among a large population of agents/players. Games in such a setup where (N → ∞)
are called population games (POP-GT) or evolutionary games (EVOL-GT) case and are studied in another part of
these notes (Chapter 8). The generalization from ODEs to differential inclusions as developed by Benaïm, Hofbauer
and Sorin in [29],[30] is particularly appropriate for games. This is because in many cases the best-response (BR)
is a set-valued map, so algorithms based on best-response play require such a treatment, (see Benaïm, Hofbauer and
Sorin’s work in [31]). We review very briefly this setup in Appendix A.
An open research problem is how to design learning algorithms with minimum information requirements for each
player, since in many game players have limited capabilities for observation and computation. One of the first
introduced and most studied algorithms is the fictitious-play (FP) algorithm, introduced by Brown, [40]. Since then
many other algorithms have been proposed. In this section we review a few classes of them.
Compared to our notes until now where we minimize the cost Ji , in this whole Section we use the following notational
equivalences (here we use the notation as in learning game theory): (cost) Ji , ↔ Ui (utility), min Ji , ↔ max Ui . Also,
best-response is denoted BRi (the Φi used until now). For an action, instead of ui we use ai , but the same notation ei
is used to denote the pure-strategy.
In the next section we present a unified setup in repeated games, followed by best-response (BR) play and then by
Fictitious play (FP) and Reinforcement-Learning (RL).
Consider a repeated game interaction of G , such that at every iteration k ∈ {0, 1, 2, . . . }, each player i ∈ N selects an
action aki or pure-strategy eki and receives the payoff/reward or the utility value πik := Ui (aki , ak−i ) or πik := Ui (eki , ek−i ).
Here the payoff or the utility value πik = Ui (ek ) = Ui (eki , ek−i ) is the value of the utility function evaluated at the
joint action profile ak or the pure-strategy profile ek of all players, hence the payoff value (reward) is a function of
time (iteration). Each player Pi, i ∈ N chooses action aki (pure-strategy eki ) as the jth action or ekij according to
the probability distribution (vector) xki . This selection is a function of the information and observation available to
player i ∈ N up to iteration k, both depending on the learning process.
9.2. REPEATED GAMES 133
The modelling of the learning process for player Pi can be described in a unified manner as follows. Let ωi denote
this observation from environment, and zi ∈ Zi some overall internal state variable that encompasses all his internal
calculations (in some cases his“belief"). The overall state for all players is z = (zi , z−i ) ∈ Z , and zk is the N-tuple
of internal variables of all players at time k. Each player Pi is updating this internal state zi based on ωi (and maybe
previous values of zi ), then calculates/updates his strategy xi based on his internal state, hence xi can be regarded as
output. He selects his action ei based on this xi . Player Pi ’s processing can be written as
where fid denotes the input to state map, σi the state to output map and γk ∈ (0, 1] a step-size or relaxation parameter.
We note that ωik is a function of random variables ekj , j ∈ N , based on xkj , hence a function of zkj , j ∈ N , or zk . Thus
{ωik } is a correlated stochastic process, ωik = −Ψdi (zk ) for some map Ψdi . After appropriate averaging, (9.1) leads
to a deterministic continuous-time approximation (ODE),
where vki := fid (zki , ωik ). The RHS can be separated into a deterministic and a stochastic component
h i
RHS = γk E vki | Fk + ξik
where ξik := vki − E vki | Fk . One needs to compute the mean E vki | Fk . In the simplest case, when this mean
depends only on zki and ω ki (independent), as in FP/SFP case, i.e., E [vki |Fk ] := fi (zki , ω ki ), this leads to
Σ1 Σz1
ω Σ2 x ω Σz2 z x
- ..
. - .. σ
.
ΣN ΣzN
ϕ1 (·) Ψ1 (·)
ϕ2 (·) Ψ2 (·)
.. ..
. .
ϕN (·) ΨN (·)
vi 1 zi xi v 1 z x
σi σ
s s
(a) (b)
Fi (·) z−i F (·)
(a) (b)
Using ω i = −Ψi (z) in (9.2) (interconnection) and letting fi (zi , −Ψi (z)) := fi (z), leads to the state-space model of
the overall closed-loop interconnected system ΣCL ,
(ΣCL ) : żi = fi (zi , −Ψi (z)) := fi (zi , z−i ), or in vector form ż = f (z), (9.3)
xi = σi (zi ), i∈N x = σ ( z)
shown in the equivalent figures below where the integrator is isolated on the feedforward path.
The goal is to show that ΣCL has as equilibrium zeq such that xeq = σ (zeq ) is an NE, x∗ , and to show that this
equilibrium is asymptotically stable, hence the algorithm converges to an NE.
Depending on how much information is available and what observations are made different learning algorithms
can be obtained. Different algorithms will have different ωi zi , fid and σi , hence different fi and ω i . For e.g.,
ωik = xk−i = ω ki in the BR-play case (feedback/interconnection with the other players P−is processing), while ωik =
Ui (eki , ek−i ) := πik (realized payoff), and ω ki = Ui (xki , xk−i ) in reinforcement learning (RL). The state zi can be just
xi (individual state in case of perfect information), some extra state (in case of partial information, the estimate
of other players’ actions, or of own payoff function) or a combination of them. For each player Pi , sometimes we
prefer to separate the processing in Σi into a strategy decision part Si and possibly an estimator part Ei and we use
σi : Zi → ∆i as a strategy decision map, from zi to xi .
9.3. BEST-RESPONSE (BR) TYPE LEARNING 135
We first discuss the Best-Response (BR) play, which is a case of perfect (full) information (FI), where ωi = x−i and
all other information is known (cost/utility function) is known by player Pi .
Recall that an NE x∗ = (x∗i , x∗−i ) is a solution of x∗ ∈ BR(x∗ ) or 0 ∈ BR(x∗ ) − x∗ := Ψ(x∗ ), i.e.,
This first form leads to set-valued dynamical systems modelled by differential inclusions.
To avoid working with set-valued maps, a perturbed best-response can be defined as follows. Consider a perturbed
utility function
Ufi (xi , x−i ) = Ui (xi , x−i ) − νi (xi )
where ν (xi ), νi : int (∆i ) → R is a penalty function, i.e., a deterministic smooth, positive definite function so that the
perturbed utility function is strictly concave. Then the perturbed best-response, denoted by BR f i : ∆−i → ∆i ,
results in a single-valued function. The best known example of penalty function νi : ∆i → R is the Gibbs entropy
where Ui (x−i ) = [Ui (ei1 , x−i ), . . . , Ui (eimi , x−i )]T . As the temperature parameter ε → 0, this approaches the best-
response set BRi (x−i ).
The use of perturbed best-response means that, in general, Nash equilibria are no longer fixed points of the overall
smooth best-response function. An alternative Nash distribution can be similarly defined, as a fixed-point of this
f (x∗ ), i..e, for each i ∈ N ,
perturbed best-response, x∗ = BR
f i (x∗−i )
x∗i = BR
This perturbed BR leads to dynamical systems modelled by ODEs, which we prefer to use. Note that x∗ = (x∗i , x∗−i )
f i (x−i ) + xi .
is a zero of the function Ψ : ∆ → ∆, Ψ = (Ψi , . . . , Ψi . . . , ΨN ), Ψi (x) = −BR
f can be written as
A recursive iterative scheme that finds a zero of Ψ or a fixed point-of BR
f (xk )
xk+1 = (1 − αk ) xk + αk BR
136 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
ωi !i 1 zi xi
BR Id
s+1
x−i
Figure 9.3
f (xk ) − xk := −αk Ψ(xk )
xk+1 − xk = αk BR (9.7)
for some αk ∈ (0, 1] relaxation parameter. Note that a fixed point (equilibrium) xeq of this recursive (difference)
f
equation is indeed a zero of Ψ, or a fixed point-of BR.
f i (x−i ) − xi leads to
Writing (9.7) component-wise for each player i with −Ψi (x) = BR
f i (xk−i ) − xki ),
xki +1 − xki = αk (BR ∀i ∈ N (9.8)
which is a dynamic (parallel-update) discrete-time process for all players, called a (relaxed) BR-iterative scheme or
BR-play. This can be approximated by a continuous-time (CT) system,
dxi
where xi = xi (t ), ẋi = dt , which is written in vector form as the BR-dynamics
Note that in (9.8) all players are synchronously updating their strategy, assuming full knowledge of their utility (BR-
map), as well as all the other players’ strategies, xk−i . Hence we can call this case a BR-based full information (FI)
scheme.
Let ωi = x−i , zi = xi for player Pi so that (9.8) can be written as
( Σi ) : f i (ω i )
żi = −zi + BR := fi (zi , ω i ) (9.11)
xi = zi := σi (zi )
where the input is ω i = ωi = x−i , the output is xi and the internal state zi , as in the equivalent figures above and
below, where Ψi (z, x−i ) = zi − BRf i (x−i ). Thus the overall state space for player Pi is ∆i . Note that assuming all
other players’ P−i have a similar algorithm (processing system), ω i = x−i = z−i . Thus interconnecting Σi of Pi with
Σ−i of the other players P−i , via ω i = x−i = z−i leads to the closed-loop system (9.9), with ∆ the overall state space
for all players.
This will be the case one starts from even in the case of partial information (PI). We discuss the Fictitious Play (FP)
case in the next section.
9.3. BEST-RESPONSE (BR) TYPE LEARNING 137
1 zi xi vi zi xi
Id 1 Id
s+1 - s
!i
BR !i
BR
ωi # x−i ωi = x−i
(a) (b)
Figure 9.4
vi 1 zi xi
-
I
s
Ψi (·) x−i
Recall (9.10) and assume now that player Pi , instead of having access at each stage to all other players’ strategies
as in ωik = xk−i , has access to all other players’ actions, i.e., ωik = ak−i , or ωik = ek−i . This leads to fictitious play
FP algorithm, one of the most studied algorithms. FP is inspired by playing an (optimal) best-response (BR); it
assumes players play a best-response against beliefs about opponents, which are constructed from the average past
play of others. While an actual true best-response requires knowledge of all other players’ mixed-strategies, x−i ,
the FP algorithm constructs an estimate of these mixed-strategies, denoted xb−i by computing empirical historical
frequencies of each action. The assumption is that each player knows (observes) actions played by each other
opponent and uses the history of actions to build these estimates; he plays an action from the best-response set
against these estimates (hence fictitious play). FP thus requires knowledge of player’s own utility/cost function as
well as history of his opponents’ actions. Thus we assume that each player knows the functional form (analytical
structure) of its utility function Ui and can observe the actions of each of the other players at each step in order to
decide his next action, aki or eki for all i ∈ N , or ek−i . In classical fictitious-play (FP) a best-response rule is used
(set-valued BR map BRip ), while in stochastic or smooth FP (SFP) a best-response function BR f i is used. In this
section, for SFP we follow the approach of Benaïm and Hirsch in [27] and Hofbauer and Sandholm, [64], and for
FP we follow [29].
At iteration k player Pi selects the action aki = j or the pure strategy ei j based on a best-response to the empirical
frequency xbk−i , of its opponents (which is his belief). In classical fictitious play (FP) this is selected as an exact best
response, i.e., at time k player Pi chooses eki as the pure best-response to xbk−i (hence he plays “pure"),
In stochastic fictitious play (SFP), player Pi selects eki as some ei j with probability xkij ,
P eki = ei j = xkij , (9.13)
which is obtained from a (perturbed) best-response to xb−i , i.e., xkij = [ xki ] j , where
f i (xbk−i )
xki = BR (9.14)
BR fi (xi , xb−i ),
f i (xb−i ) := arg max U where fi (xi , xb−i ) = Ui (xi , xb−i ) − νi (xi )
U (9.15)
xi ∈∆i
or via a stochastic perturbation, [27]. In this stochastic case, consider that payoffs are perturbed by zero-mean i.i.d.
fε ,
noise εi j′ to U i
Ufε (ei j′ , xb−i ) = Ui (ei j′ , xb−i ) + εi j′
i
and
ε
f i (xb−i ) ] j := P
[ BR fε (ei j′ , xb−i ) = j
arg max j′ ∈Mi U , j ∈ Mi (9.16)
i
f εi (xb−i ) ] j = [BR
[BR f i (xb−i )] j (9.17)
for some admissible νi , so that the form in (9.15) is preferred. For νi entropy related, the logit function is obtained.
f i (xbk−i ) ] j for BR
Thus in SFP from (9.13) - (9.17), action j or ei j is chosen with probability xkij = [ BR f i as in (9.15).
Having set the strategy decision rule, the next issue is how to update the internal state variable (belief) zi of each
player Pi so that it estimates x−i . Consider the empirical frequency (time-average) vector of prior actions of player
Pi after k stages (games) denoted by eki and defined as
1 k−1 k′
eki =
k k∑
ei ∈ ∆i (9.18)
′ =0
Thus eki is a vector with the j-th component giving the frequency (proportion of times) that player Pi has played
action ai = j (the j-th action) in the first k games (stages). Since eki ≈ E (ei ) = E (ai ) = xi , we can take eki as an
estimate xbi of xi , i.e.,
Payer i, Pi computes for each other player i′ 6= i, i′ ∈ N the estimate xbki′ of the probability distribution xi′ as given by
the time-average (empirical frequency) eki′ ,
k−1
1 ′
xbki′ := eki′ = ∑ eki′
k k′ = 0
k k 1 k
xbki +1 = xbi + e (9.20)
k+1 k+1 i
and
k k 1 k
xbk−i+1 = xb−i + e (9.21)
k+1 k + 1 −i
Collecting xbki = eki for all players, gives the vector (N-tuple) xbk = ek ∈ ∆ := ∆Ni with the players’ empirical frequen-
cies at time k, ek = (ek1 , . . . , eki , . . . , ekN ) ∈ ∆, ek = (eki , ek−i ) ∈ ∆i × ∆−i = ∆, or xbk = (xbki , xbk−i ) ∈ ∆i × ∆−i = ∆, which
satisfies the recursion
k k 1 k
xbk+1 = xb + e (9.22)
k+1 k+1
Thus {ek } = {bxk } can be seen as the state sequence of the infinitely repeated game. Therefore, the state sequence
xk } is a non stationary discrete-time Markov process, with values in the compact, convex set ∆. A sequence of
{b
games (stages) produces a sequence of action profiles {ak } and the equivalent sequence of vertices {ek } in ∆.
In SFP, using (9.13,9.14) it follows that
f i (xbk−i ) ] j
P [eki = ei j | xbk ] = [ BR (9.23)
f (xbk )
E (ek | xbk ) = BR (9.25)
Thus
E (ξ k | xbk ) = 0
xk } satisfies:
Theorem 9.1. (The Limit Set Theorem). With probability 1, the state limit set L{b
Using the ODE method of analysis and this theorem Benaïm and Hirsch prove the convergence of stochastic fictitious
play (SFP) for 2 × 2 games, [27]. The associated continuous-time ODE system in (z1 (t ), z2 (t )) is
f 1 (z2 ) − z1
ż1 = BR
f 2 (z1 ) − z2
ż2 = BR
dz (t )
where ż1 = dt 1
. By studying this CT system, Benaïm and Hirsch [27] show that SFP converges to distributions
that approximate one of the two pure-strategy equilibria in 2 × 2 coordination games, and not to approximations of
the (un-stable) mixed equilibrium, while play converges to the (unique) equilibrium distribution in 2 × 2 games with
a unique mixed-strategy equilibrium. Benaïm and Hirsch simplify SFP by setting all prior weights equal to zero.
They apply Theorem 1 of Pemantle, [122] to rule out convergence to linearly unstable equilibria.
9.3. BEST-RESPONSE (BR) TYPE LEARNING 141
Hofbauer and Sandholm, [64] give rigorous convergence results on the global stability of Nash equilibria under SFP
play for all N = 2 zero-sum games, games with an interior evolutionary stable strategy (ESS). They also show that
SFP converges to one of the equilibrium distributions in N potential games (see Appendix B) if all of the rest points
are hyperbolic. In a fictitious play setup, where the natural state variable z is the empirical frequency, convergence
follows in beliefs, not in actual implemented play or mixed-strategies. However, the two convergence notions are
related as shown in [64]. Note that an analysis of only the continuous-time system as in [135] is limited, since
looking only at the positive-limit sets of the continuous-time system does not necessarily give the complete picture
of for convergence of the discrete-time system with noise. Consider the example of a flow Φ on a circle moves
clockwise everywhere except at a single rest point. This rest point is the unique positive-limit point of the flow, (see
[64]). Now suppose the flow represents the expected motion of some underlying stochastic process. If the stochastic
process reaches the rest point, its expected motion is zero. Still actual motion may occur with positive probability;
in fact the process can jump past the rest point and begin another loop. Therefore in the long run all regions of the
circle are visited infinitely often. The long run behaviour is captured by the notion of chain recurrence, as all points
on the circle are chain recurrent under the flow Φ. The proper concept to analyze the asymptotic behaviour is that
of an internally chain recurrent (ICR) or chain transitive (ICT) set.
In classical FP best-response maps are set-valued. Benaïm, Hofbauer and Sorin generalize DSA from ODE to
differential inclusions (DI), [29], and extend the convergence results for potential games. Recall (9.20)
1
xbki +1 − xbki = (ek − xbki )
k+1 i
where for classical FP, eki ∈ BRip (xbk−i ) as in (9.12). Note that
1
E ( xbki +1 − xbki |Fk ) ∈ (Co( BRip (xbk−i ) ) − xbki )
k+1
where Co(·) denotes the convex hull. Since Co( BRip (xbk−i ) ) ⊂ BRi (xbk−i ), (see [29]), it follows that
1
E ( xbki +1 − xbki |Fk ) ∈ (BRi (xbk−i ) ) − xbki ) := fi (xb)
k+1
where the set-valued map f is defined as f (xb) = BR(xb) − xb, with BR : ∆ → ∆the overall best-response map, BR(xb) :=
(BR1 (xb−1 ), . . . , BRN (xb−N )). Thus if all players play a fictitious play strategy, by SA this leads to the following
differential inclusion (DI) induced by F
ḃ
x ∈ BR(b
x) − b
x
This is analyzed in [29], where the result of Monderer and Shapley on convergence of the classical discrete FP for a
potential game with linear utility functions is generalized via Liapunov analysis for DIs.
Remark 9.2. In the following we explicitly write the case for player Pi calculations (only for the SFP case). At
step k, each player Pi plays according to the best response to its internal state variable zki , zki := xbk−i ∈ ∆−i = ∆N−1
i ,
Zi = ∆−i . His mixed-strategy is computed statically as in (9.12) for FP, or (9.13,9.14) for SFP
f i (zki )
xki = BR (9.29)
and he selects his action aki or eki based on it. Since zki depends on prior actions of his opponents, the output from Pi
f i (zki ), is causal. At this step k, based on his inputs which are his opponents current actions ek−i ,
at iteration k, xki = BR
142 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
player Pi performs internal calculations to update his internal state variable (his estimator/observer state, Ei ). From
(9.21,9.29), it follows that at step k player Pi’s internal state and strategy is updated as Σdi
Σdi : zki +1 − zki = 1
k+1 −zki + ek−i
xki f i (zki )
= BR
The internal state variable zi represents the belief about the others’ strategies, or the estimate xb−i of x−i , i.e., zi =
xb−i := e−i , zi ∈ Zi = ∆−i = ∆N−1
i . Note that zki +1 will become his current state at the next iteration step (k + 1), at
which his implemented strategy is cf. (9.29), xki +1 = BR f i (zk+1 ). The processing for player Pi can be put in the form
i
of Σdi in (9.1) with γk = k+1 1 and ωik = ek−i ,
Alternatively the above calculations in Σdi for player Pi can be split into two parts Si (last line) and Ei (first line). Si
f i . The estimator Ei state update
is static and for σi (strategy decision rule) players use a best-response rule, σi = BR
k k
rule is based on time-averaging from his partial information, ωi = e−i . From the foregoing the overall processing of
player Pi , Σdi = Si ◦ Ei .
Recall how from (9.22, 9.25 ) by averaging we obtained the overall continuous-time approximation (averaged or
mean ODE) as in (9.28). Alternatively from (Σdi ) above we obtain
Note that for the overall state vector of all players, z = (z1 , . . . , zi , . . . , zN ) ∈ ∆N−i ,
k k 1 k
zk+1 = z + q (9.30)
k+1 k+1
where q = (e−1 , . . . , e−N ). In a network information case one needs to use zi and z, but here all players use the same
calculations and the reduced form of (9.30) is (9.22), in terms of xbk ∈ ∆Ni of reduced dimension.
Convergence of fictitious play (FP) and smooth or stochastic (SFP) is studied for the empirical frequencies of play
(z). There exists a large literature on specifying classes of games in which these frequencies converge to the set of
9.3. BEST-RESPONSE (BR) TYPE LEARNING 143
ωi 1 zi xi ωi zi xi
!i
BR G(s) σi
s+1 -
! −i
BR Ψi (·) z−i
z−i
(a) (b)
Figure 9.6
G(s) σ1
ω G(s) z σ2 x
- .. ..
. .
G(s) σN
Ψ1 (·)
Ψ2 (·)
..
.
ΨN (·)
Nash equilibria. Convergence for N = 2-player zero-sum games was obtained by Robinson, as well as general 2 × 2
games by Miyasawa. Monderer and Shapley, [97], proved convergence for potential games, and Berger showed the
same for N = 2-player games, 2 × m. Many of these results have been re-analyzed via stochastic approximation
(SA) theory. Benaïm, Hofbauer and Sorin, [30] extend Monderer and Shapley’s result to a general class of potential
games used on recent extension of the SA for DI.
SA-based convergence results were first developed for smooth fictitious play (SFP), which heads to a smoothed
best-response function (single-valued map) to which ODE-based SA results can be applied. As for FP, convergence
holds in N = 2, 2 × 2 (by Benaïm and Hirsch in [27]), zero-sum, potential games (by Hofbauer and Sandholm in
[64]), and supermodular games (Benaim and Faure). More recently, algorithms to which more relaxed assumptions
are sufficient are explored. Benaïm and Raimond propose Markovian Fictitious Play (MFP) in [32], where players
have restrictions on their action sets, instead of full access to all states. Convergence is obtained for zero-sum and
potential games. A heterogenous FP algorithm is considered by Fudenberg and Takahashi in [58], where each player
might have different personal histories. On the other hand, in [28], Benaim, Hofbauer and Hopkins explore the
case in which a weighted empirical frequency is used by players (greater weight on recent experience), or weighted
stochastic fictitious play (WSFP). They consider the case of a large population of players with N strategies, where
players are repeatedly randomly matched in pairs to play a 2 -player matrix game with N strategies. That is, they
consider the single population learning model, symmetric case, as in evolutionary game theory. Under this model,
they found that in games whose Nash equilibria are mixed and are unstable under fictitious play-like learning (such
as some RSP games), i.e., in so called “unstable" games, the time average of play often converges under WSFP, even
while mixed strategies and beliefs continue to cycle.
The requirement of observations of the actions of all other players is relaxed in joint strategy FP (JSFP) as introduced
by Marden, Arslan and Shamma in [92], where each player tracks the empirical frequencies of the joint actions of all
other players. Other fictitious play based approaches aim to reduce the information gathering in graphs. For example
in [145], [146], an average empirical distribution is tracked. Convergence to consensus Nash equilibria is shown, a
subset of NEs identical for all players, for a subclass of games in which such equilibria exist. This "consensus NE"
(identical for all players) is a restrictive notion, since not all games have such equilibrium.
A dynamical system interpretation is given by Shamma and Arslan in [135], which nicely allows dynamical extension
of algorithms, one of the first such proposed extensions. The analysis is started directly on the ODE system for
FP/SFP and gradient-based learning algorithms. A derivative term, similar to PD control, is added to the input of the
BR map and of the projection map used in gradient-dynamics. This PD component gives significant improvement
for convergence in RSP game, Shapley polygon game where typical FP/gradient-based learning does not converge.
Only the continuous-time system is analyzed, i.e., PD added to the ODE system (after DSA) and do not go back to a
discrete time implementation (stochastic), which needs to be done for a rigorous application of asymptotic behaviour
results, [26].
9.4. PAYOFF-BASED LEARNING (P-RL) 145
We saw that in fictitious play players compute best responses to their opponents’ empirical frequencies of play. The
assumptions are that : (1) each player knows the game structure, i.e., his own utility function (as model-based); (2)
each player is observes the actions of each of his opponents at each stage. The next question is how to relax these
assumptions. One approach is to assume that the players observe only their realized utility value or payoff at each
iteration. This minimal information setting, where players are not assumed to know utilities functions (model-free),
nor observe opponent actions, is a natural setting for the so-called Reinforcement Learning schemes, [144].
In the general class of Reinforcement-Learning (RL) algorithms Pi uses only measurements ωi of actual, realized,
costs/payoffs denoted by πi := Ui (ei , e−i ) ∈ R, and does not require knowledge of the functional representation of
these costs/payoffs, hence ωik = πik = Ui (eki , ek−i ). In the RL case one of the internal state variables zi can be some
zi , payoff score vector for all actions, with zi j being the score for action j. Alternatively, zi can be interpreted as an
ci (ei j , x−i ).
b i (x−i ), with zi j = U
estimate of this cost/payoff vector Ui (x−i ) = [Ui (ei1 , x−i ), . . . , Ui (eimi , x−i )]T , zi = U
Hence zi ∈ Rmi := Zi and the internal state space is the payoff space. This zi is constructed and updated by means
of the aggregate information of Pi and his own history of play. Player Pi uses a rule of behavior (a decision map)
which depends on the state variable zi . Most of the work assumes a decision rule/map which is static or stationary
in the sense of being defined through a time-independent function of an internal state variable, xki = σi (zki ), with
σi : Zi → ∆i a probability map. The typical approach is as follows: a) construct a sequence of mixed strategies {xki }
which are updated taking into account the received payoff πik and zki , b) study the convergence (or non-convergence)
of this sequence {xki }. Almost all work considers dynamics only in the zi variable, with xi mapped via a static map σi
from zi (hence xi has only dynamics as induced by the dynamics of zi ). Some of the researchers analyze convergence
in the payoff space (internal state space) and make inferences about convergence in the strategy space. Others derive
directly the induced iterative process/dynamics in the strategy space and analyze convergence in the strategy space.
Among the types of Reinforcement-Learning (RL), we distinguish Payoff-based RL (P-RL), where the choice /de-
cision map σi is any general probability map, from Zi to ∆i . [51], [17]. Another type of RL algorithms is the
Q-Learning (Q-RL) type where the choice/decision map σi is an optimal best-response type (hence realized pay-
offs/costs) are processed optimally. A similar simple averaging rule is used as the estimator rule to update zi . In
most cases, xi is obtained via the static σi map, while some recent research works consider xi obtained via a dynamic
decision process, [82], [39]. Yet, another type of RL algorithms is Regret-based RL (R-RL) learning in which the
concept of using the regret is used, initiated by Hart and Mas-Colell in [63]. This regret is the difference between the
actual payoff/cost that would be realized for a particular choice and the payoff/costs for any other possible choices,
and thus requires access to all other payoff values. Instead of a Nash equilibrium (NE), convergence is shown to a
correlated equilibrium (CE) (which includes NE as a subset).
In the first class of Payoff-based Reinforcement-Learning (P-RL), two basic models are the Erev-Roth model (P-
RL-ER), [51], and the Arthur model (P-RL-AR), [17]. Recent work by Hopkins and Posch, [67] consider P-RL-ER
and AR and show that result in specific time-varying step-sizes, while invariance of the simplex is maintained
automatically. Chasparis, Shamma and Rantzer’s propose a variant of P-RL with more general step-sizes (P-RL-
SH), directly dictating the strategy update rule, inspired by seminal works of Sastry and co-authors, [132], and
Sutton and Barto, [144]. However, in addition to the positive payoff assumption, a projection of the strategy to the
simplex is needed to maintain invariance. In all these P-RL case interesting connections to evolutionary dynamics
have been obtained; the resulting mean ODE dynamics are related to variants of the replicator dynamics (RD) from
evolutionary games, [129]: either standard RD or a modified RD, [67], or a perturbed RD, [42]). In the following
146 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
E [X ] = EY [ E [X |Y ] ] = ∑ E [X |Y = y] P [Y = y]
y
yields h i h h ii h h ii
E πik = Eek E πik |eki = Eek E Ui (eki , ek−i ) |eki
i i
h i
= ∑ E Ui (eki , ek−i ) |eki = ei j P [eki = ei j ]
j∈Mi
To start with the P-RL-ER and the P-RL-AR models, we note that in both cases the internal state zi ∈ Rmi := Zi is
a vector of “payoff scores" zi (representing the score/propensity) for each of its actions in Ai . The strategy decision
maps is a static map, σi : Zi → ∆i ,
where σi is a map that gives a probability vector, hence whose range is ∆i . This σi is taken as
mi
zki
σi (zki ) = , where Zik = ∑ zkij ′ (9.32)
Zik j′ =1
The σi map can be regarded as being approximately linear in zi , which enables an easy derivation of the induced
strategy dynamics for xi . To complete the model, the update of the internal state zi needs to be specified. Based on
the realized payoff πik = Ui (eki , ek−i ), in P-RL-ER (Erev-Roth), [51], the jth component in zi is updated as
where δ zkij denotes the increment in player i’s jth payoff score/propensity. This δ zkij is taken as follows: if action j
is selected at k stage, i.e., eki = ei j , then δ zkij is set equal to the payoff πik ∈ R obtained at stage k, πik = Ui (ei j , ek−i ).
Otherwise, if j is not selected, i.e., if eki = ei j′ , j′ 6= j, then δ zkij is set to 0. This means,
Equivalently, in vector form δ zki = πik eki and (9.33) can be written in a compact vector form as
ωi 1 zi σi
xi
s
X
x−i
Ui (·)
Figure 9.8
Note that this requires that all payoffs be positive. Also, invariance of the simplex ∆i is naturally maintained, albeit
leading to a player payoff dependent step-size (endogenous), where the step-size of player i is taken as 1/Zik . For
the computations of Pi we can write for P-RL-ER, Σdi = Si ◦ Ei ,
Σdi : zki +1 − zki = πik eki
xki = σi (zki )
where Fk denotes the σ -algebra generated by {z1 , z2 , . . . , zk }, with zk the N-tuple of internal variables of all players
at time k, and diag xki is the diagonal matrix with elements of xki on its diagonal.
This leads to a continuous-time approximation as in the above figure and to a closed-loop system as in the following
one, where Σi = σi ◦ 1s , ϕi (x) = − xi . × Ui (x−i ), with .× denoting point-wise multiplication of two vectors.
Comparing to Pi processing in the FP case in the previous section (see table, where zi ∈ ∆−i = ∆N−1
i , ωik = ek−i ), it
can be seen that in the P-RL case zi ∈ Rmi , ωik = πik = Ui (eki , ek−i ).
The alternative model of Arthur (P-RL-AR), [17] is similar except that the step size of each agent is normalized at
each step, such that the state update rule is taken as
C ( k + 1)
( Ei ) : zki +1 = (zki + πik eki ) (9.36)
kC + πik
for some constant C > 0. This leads to Zik = kC for all i (same step-size for all players). While in the P-RL-ER
model steps-sizes are stochastic, of order 1/k, in the P-RL-AR model they are deterministic, of order 1/k. In both
148 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
Σ1
ω Σ2 x
- ..
.
ΣN
ϕ1 (·)
ϕ2 (·)
..
.
ϕN (·)
P-RL-ER and P-RL-AR step-sizes on the order 1/k allow the application of stochastic approximation results, [26],
as done by Hopkins and Posch in [67]. In the strategy space xi (induced strategy dynamics) both P-RL-ER and P-
RL-AR have a mean ODE in the class of replicator dynamics (RD) or a variant of it (Lemma 2 in [67]). Nevertheless
they may have different asymptotic behaviour. This can be seen as follows.
First, let us calculate the induced iterative process for the strategies, xki j+1 − xkij , and then take the expected value of
h i
the increments, E xki j+1 − xkij . We briefly show this for the Erev-Roth model. First, note that if at time k, player i
chooses action j, described as Event j, from (9.31, 9.32, 9.33) we can write for the change in xkij ,
(1 − xkij ) πik
zkij + πik zkij (1 − xkij ) πik 1
xki j+1 − xkij = k − = = +O (9.37)
Zi + πik Zik Zik + πik Zik (Zik )2
where πik = Ui (ei j , ek−i ). Similarly, if at time k player i chooses any other action j′ , j′ 6= j, or Event j′ happens, then
πik = Ui (ei j′ , ek−i ) and the change in xkij is
zkij zkij −xkij πik −xkij πik 1
xki j+1 − xkij = − = = +O (9.38)
Zik + πik Zik Zik + πik Zik (Zik )2
Note that for event j, when eki = ei j and πik = Ui (ei j , ek−i ), taking expected value conditioned on the jth action, we
get E πik |eki = ei j = Ui (ei j , xk−i ), so that
h i
(1 − xkij ) Ui (ei j , xk−i ) 1
E xki j+1 − xkij |Event j = +O (9.39)
Zik (Zik )2
Event j occurs with probability xkij and event j′ with xkij′ , for any j′ 6= j, and one can compute the expected change
9.4. PAYOFF-BASED LEARNING (P-RL) 149
in xkij as
h i h h ii
E xki j+1 − xkij |Fk = Eek E xki j+1 − xkij | eki |Fk
i
h i h i
= xi j E xki j+1 − xkij |eki = ei j +
k
∑ xkij′ E xki j+1 − xkij |eki = ei j′
j′ ∈Mi , j′ 6= j
h i h i
= xkij E xki j+1 − xkij |Event j + ∑ xkij′ E xki j+1 − xkij |Event j′
j′ ∈Mi , j′ 6= j
1 k k k k
1
= k xi j Ui (ei j , x−i ) − ∑ j′ ∈Mi xi j′ Ui (ei j , x−i ) + O
′ ,
Zi (Zik )2
where Fk denotes the σ -algebra generated by {z1 , z2 , . . . , zk }, with zk the N-tuple of internal variables of all players
at time k. Then for the P-RL-ER,
k+1 k 1 h k k k T k
i 1
E [ xi j − xi j |Fk ] = k xi j Ui (ei j , x−i ) − (xi ) Ui (x−i ) +O , (9.40)
Zi (Zik )2
Alternatively, from (9.37) and (9.38), combining the two cases we can write in vector form the induced strategy
process of the P-RL-ER model
d k+1 k 1 k k k
1
Si : xi − xi = k πi ei − xi + O (9.41)
Zi (Zik )2
Taking expected value conditioned on the ei being the jth action, with E πik |eki = ei j = Ui (ei j , xk−i ), we obtain
(ignoring for the moment the O(·) term),
h i h h ii
E xki +1 − xki |Fk = Eek E xki +1 − xki | Fk , eki (9.42)
i
h i
= ∑ E xki +1 − xki | Fk , eki = ei j P [eki = ei j ]
j∈Mi
h i
1 1
= k ∑ E πik eki − xki | Fk , eki = ei j P [eki = ei j ] + O
Zi j∈Mi (Zik )2
1 1
= k ∑ Ui (ei j , xk−i )
k k
ei j − xi xi j + O
Zi j∈Mi (Zik )2
T
1 k k k k k 1
= k diag xi · Ui (x−i ) − xi · xi · Ui (x−i ) + O (9.43)
Zi (Zik )2
which is the vector form of (9.40).
Similarly, for the P-RL-AR (Lemma 1 in [67]),
k+1 k 1 h k k k T k
i 1
E [ xi j − xi j |Fk ] = xi j Ui (ei j , x−i ) − (xi ) Ui (x−i ) +O ,
kC (kC )2
and
h i T
1 1
E xki +1 − xki |Fk = k k k k
diag xi · Ui (x−i ) − xi · xi k
· Ui (x−i ) + O (9.44)
kC (kC )2
Using these, Hopkins and Posch show, [67], that the mean ODE for the induced strategy dynamics of the Arthur
P-RL-AR model is the standard RD, [129],
ẋi j = xi j Ui (ei j , x−i ) − xTi Ui (x−i ) (9.45)
150 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
which is a modified and adjusted RD, related to Maynard Smith’s version of the RD. The two mean ODE dynamics
have identical equilibria points but their stability properties can be different. Note that R has very useful properties:
if xi j > 0, then R is positive semidefinite and yT R(xi )y > 0 for all y ∈ T ∆i = {y ∈ Rmi | ∑mj=i 1 y j = 0}, (see Hofbauer
and Sigmund, Section 9.6, [65]), which is the space of trajectories of the RD system.
While in the P-RL-AR model the resulting step-size (endogenous) is the same, i.e., 1/k for all players, this not the
true for the P-RL-ER which makes the analysis more difficult. This is corrected via the introduction of new variables
µ̃ik = k/Zik , [67]. This leads to two recursive equations for xki j+1 and µ̃ikj+1 with the same step size, 1/k but with extra
noise terms (still Martingale noise). In turn the mean ODE is a modified adjusted RD (Lemma 2 in [67]), which is
like an scaled RD driven by the µ̃ dynamics,
This ODE has very nice properties, directly related to the RD dynamics, in that if xe is an equilibrium point of the RD
then (xe , µ̃e ), µ̃ei = xT U 1(x ) is an equilibrium point of (9.47), and moreover if xe is a linearly unstable equilibrium
ei i −ei
point of the adjured RD then (xe , µ̃e ) is a linearly unstable equilibrium point of (9.47). These properties allow the
derivation of some powerful results. Hopkins and Posch’s main result (Theorem 1 in [67]) is applied to equilibria
points of the standard RD or adjusted RD and shows that the Erev-Roth model P-RL-ER converges with probability
zero to: (1) a point that is not an NE, (2) an NE that is a linearly unstable equilibrium of the adjusted RD, or (3) if
N = 2 to an NE that is a linearly unstable equilibrium of the standard RD, if the initial conditions are so that z0i j > 0,
for all i ∈ N and j ∈ Mi . However as mentioned in [67] the nice results for the P-RL-ER model do not hold for
games with a unique NE in mixed strategies (such as RSP), one of the reasons being that the modifications to the
RD dynamics are not enough to overcome its limitations. A possible idea is to add other extra dynamics or feedback
control for these case. The application of stochastic approximation results can be done in the case in which the
associated ODE has a global attractor, so that the stochastic process will converge to that attractor with probability
one. A second method of analysis based on application of stochastic approximation results relies on a globally
applicable Liapunov function (see result in the Appendix) and possibly the use of Pemantle’s negative convergence
result to an equilibrium that is linearly unstable for the ODE. However this result needs noise sufficiently large in all
directions in the state space (see result in the Appendix), so that it cannot be applied to points on the boundary (e.g.
pure equilibria). In [67], Hopkins and Posch apply a more general result of Brandière that only requires the noise to
be sufficiently large in unstable directions. Their result shows that the Erev-Roth model P-RL-ER does not converge
to an NE that is a linearly unstable equilibrium of the corresponding modified and adjusted RD (its mean ODE) if
the initial conditions are so that z0i j > 0, for all i ∈ N and j ∈ Mi , (Proposition 2 in [67]).
Unlike P-RL-ER/AR where a static decision map with range ∆i is used and where the recursive process for the
strategy is obtained as being induced by that of z (as if exploiting the linearity of σ ), in [42], Chasparis and Shamma
9.4. PAYOFF-BASED LEARNING (P-RL) 151
start directly from the recursive process in the strategy space. They propose the following P-RL-SH strategy update
process (heuristically proposed, as motivated from (9.41) without O(1/(Zik )2 )), hence Σdi = Si ,
under the same assumptions of positive and bounded payoffs as in [67], but with an exogenous step-size αk , where
πik := Ui (ak ) = Ui (qk ). However a projection to ∆i has to be used to maintain invariance and feasibility of xki as
a probability vector (not needed in the case of the static decision map in P-Rl-ER, which automatically satisfies it).
While this simplifies the recursive process and allows more general step-sizes αk (diminishing), at the same time
it complicates the analysis due to the necessity of performing projection. This projection step is in fact ignored in
[42], justified by sufficiently small step sizes. From (9.48) ignoring the projection following the same steps as in the
derivation following (9.41), one can obtain that
1 k+1
E[ (x − xki ) |Fk ] = fi (x)
αk i
where Fk = xk and where fi is the right-hand side of (9.46). Then (9.48) can be written as
ẋi = fi (x)
where fi (x) is as the right-hand side of (9.46). The authors in [42] use an additional perturbation on xkij to avoid
convergence to boundary points, which leads to a perturbed RD as the mean ODE dynamics. Unlike [67] the direct
correspondence between the equilibria points is lost, i.e., a perturbed equilibria point is obtained. This means that
the results obtained for equilibrium selection are only to a neighbourhood of the original pure and strict NE, i.e. to
perturbation of the NE. Also, a more rigorous analysis will need to take the projection into account.
Hopkins and Posch, [67] consider reinforcement learning models inspired by urn models such as Erev-Roth model
(P-RL-ER), [51], or Arthur model (AR), [17]. They apply a version of non-convergence results such as those of
Brandiére and Duflo, [38], [37], or Pemantle, [122], regarding attainability of boundary points. Step-sizes resulting
from these models are quite specific, but invariance of the simplex is maintained automatically. Beggs has several
rigorous convergence results, [22] and also for supermodular games, [21]. Chasparis, Shamma and Rantzer’s propose
a variant of P-RL with more general step-sizes, directly dictating the strategy update rule. This is inspired by seminal
works of Sastry and co-authors, [132], and Sutton and Barto, [144]. Although not using the payoff in a optimal way,
these algorithms are based on relative simple rules of the type of reward-active update of the strategies based on the
actual measured payoff of each player. However, in addition to the positive payoff assumption, a projection of the
strategy to the simplex is needed to maintain invariance (this point is relatively vague treated by them in [42]).
In all these P-RL cases interesting connections to evolutionary dynamics have been obtained, where the resulting
ODE dynamics are related to variants of the replicator dynamics (RD) from evolutionary games: either standard RD
or a modified RD, [67], or a perturbed RD, [42]).
152 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
Next we review Q-learning (Q-RL) algorithms, also in the class of Reinforcement-Learning algorithms. The main
idea of Q-RL is to respond optimally via a best-response map that uses not the cost function, but an estimate of it.
Thus action selection is based on estimated utilities, zi , or Q-values, which characterize the relative utility or score
of a particular action, constructed from measurements of the realized cost/payoff, πik . Either a best-response or a
smooth best-response can be used as the action selection mechanism. A player learns the utility based on the chosen
action and the actual realized payoff, without knowledge of the utility/payoff function (model-free).
Recall that in both cases (P-RL-ER/AR or P-RL-SH) the resulting process is of first order in strategy space and the
analysis is done on a set of first order induced ODEs in the strategy space, i.e., the dynamics of xi as induced by the
dynamics of zi . The choice function σi is any choice probability function that gives a probability vector. In Q-RL
learning or exponential RL, σi is a best-response type map, a special decision map such as Boltzmann selection or
logit, σi : Zi → ∆i , σi = β ε , where β ε is as in (9.6). Thus xi is taken as static map
As in P-RL, the strategy dynamics is as induced by the zi dynamics (see Leslie and Collins, [83], or Cominetti et
al, [44]). When the strategy process xi has its own dynamics also, fewer results exist (see time-scale separation
approach in [82], [35]). In the following we review these algorithms. First, note that if zki the estimated utility is such
that zki = Ui (xk−i ), (true utility), then
f i (xk−i ) ] j
[ ( β ε ◦ Ui )(xk−i ) ] j = [ ( σi ◦ Ui )(xk−i ) ] j = [ BR
and hence a BR-type scheme is obtained. The issue is how to obtain this only from realized utility values Ui (ek−i ).
In the N-player case of Leslie and Collins in [83], each of the players learns independently, called independent Q-RL
(IQL). Let zi denote the Q-value vector (called Qi in [83]) of Player Pi , i ∈ N . Each Pi chooses an action aki = j or
eki = ei j , based on his strategy xki , receives and measures at time k a payoff (or reward) πik = Ui (ei j , ek−i ) and, based
on it updates only the j-th entry of the estimated utility, or Q-value vector, zki as
1 1
zki j+1 = (1 − γik )zkij k
+ γik πik k (9.50)
xi j xi j
zki j+′ 1 = zkij′ , ∀ j′ ∈ Mi , j′ 6= j
where γik is the learning rate (step-size), assumed to be diminishing. In compact form, (9.50) is written as
1 k k
zki j+1 = zkij + γ π − zk
i j 1{eki =ei j } (9.51)
xkij i i
where 1{ek =ei j } is the indicator function. A similar scheme is proposed in [44] with no normalization,
i
zki j+1 = zkij + γik πik − zkij 1{ek =ei j } (9.52)
i
where I is the identity matrix and where diag(zki ) denotes the diagonal matrix with elements of zki on its diagonal,
i.e., zkij as the ( j, j ) diagonal element. Thus only the j-th component of zi is updated when the j-th action is selected,
i.e., when eki = ei j . Compare (9.53) with P-RL-ER in (9.34). Similarly the vector form of (9.51) is the estimator Ei ,
1
( Ei ) : zki +1 = zki + γik πik I − diag k
zi diag k eki (9.54)
xi
exp( ε1 zkij )
(Si ) : xkij = [ β ε (zki ) ] j := [σi (zki )] j = , j ∈ Mi (9.55)
∑ j′ ∈Mi exp( ε1 zkij′ )
based on the Q-value vector, zi ∈ Rmi , internal state of player i. This is similar to the soft-max exploration method
of reinforcement learning [144]. In Q-RL-type, from (9.54), (9.55), the computations of Pi are Σdi = Si ◦ Ei ,
Σdi : zki +1 − zki = γik πik I − diag zki diag 1
xki
eki
Conditioning on Pi choosing the j action or ei j , take expected value of the incremental difference in zi ,
h i h h ii
E (zki +1 − zki )/γik |Fk = Eek i E (zki +1 − zki )/γik |Fk | eki (9.56)
h i
= ∑ E (zki +1 − zki )/γik |Fk | eki = ei j P [eki = ei j ]
j∈Mi
1
= ∑ E k
πik diag
k k
ei |Fk | ei = ei j P [eki = ei j ]
j∈Mi x i
h i
− ∑ E diag zki /xki eki |Fk | eki = ei j P [eki = ei j ]
j∈Mi
where E [· |Fk ] denotes the expectation induced by the history of the process up to time k, Fk = zk ). Next,
1 1
E πik diag k · eki | eki = ei j = E Ui (eki , ek−i ) diag k · eki | eki = ei j
xi xi
h i
1 1 1
=E Ui (ei j , ek−i ) diag k · ei j = E Ui (ei j , ek−i ) diag k · ei j = Ui (ei j , xk−i ) diag k · ei j
xi xi xi
where xk−i = σ−i (zk ) are the opponents’ mixed strategies that result from the vector z. Also,
h i
E diag zki /xki eki | eki = ei j = diag zki /xki ei j
154 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
ωi 1 zi xi
σi
s+1
x−i
Ui (·)
Figure 9.10
h i
1
E (zki +1 − zki )/γik |Fk = ∑ Ui (ei j , xk−i ) diag k · ei j xkij − ∑ diag zki /xki ei j xkij (9.57)
j∈Mi xi j∈Mi
1 1
= ∑ Ui (ei j , xk−i ) k
· ei j xkij − ∑ zkij k ei j xkij
j∈Mi xi j j∈Mi xi j
= ∑ Ui (ei j , xk−i ) · ei j − ∑ zkij ei j = Ui (xk−i ) − zki
j∈Mi j∈Mi
since {ei1 , . . . , ei j , . . . eimi } are basis vectors in Rmi , where xkij = [σi (zki )] j = [β ε (zki )] j , and xk−i = σ−i (zk ) = β−i
ε (zk ).
These relations and xki = σi (zki ) := β ε (zki ) lead to the mean ODE in continuous-time,
a dynamical system with state zi and output xi , where we note that x−i = σ−i (z) are the other players’ mixed
strategies that result from the whole vector z, as shown in the figure. Note that zi , the internal state variable which is
a utility function estimate, has dynamics of an observer-type.
Component-wise,
A slightly different ODE (with xi j a factor on the right-hand side) results in the work of Cominetti, where no
normalization is used, [44]. For N = 2-player games, the long-run behavior of the algorithm is analyzed via this
mean ODE in the z (payoff estimation) space, [83], based on standard theorems of stochastic approximation [26]. The
strategy evolution resembles the smooth best response dynamics that characterizes stochastic fictitious play (FP), see
Benaïm and Hirsch’s work, [27], even though individual Q-learning (IQL) uses significantly less information. Using
techniques from [82], extension to N-player partnership games is studied for player-dependent learning rates. The
use of diminishing learning rates is beneficial and allows one to use stochastic approximation results, but in general
leads to slow convergence time. Similar results are obtained in [44] for potential games (congestion games). An
induced x (strategy) dynamics is obtained based on a key property of the logit function β ε (see Lemma 6 in Section
2.3 in [44] ). In addition, some nice Lagrangian interpretation is obtained (Section 3.2), which seems connected with
the HBF and the port-hamiltonian approach in control systems.
9.5. Q-LEARNING (Q-RL-TYPE AND Q-RL) 155
Algorithms of Q-RL type are also called actor-critic algorithms and were initially developed for a single decision
maker (N = 1) in the context of Markov-processes, in the work of Tsitsiklis, [149], Singh and co-authors, [138],
Borkar and Meyn, [36], Kushner and Yin, [81]. While convergence of Q-RL and Q-learning process is well estab-
lished for a single-agent, it is a more difficult problem for multi-agent settings. In the last several years extensions
to multiple-decision makers (games) have been made, such as the works of Claus and Boutilier, [43], Leslie and
Collins, [83], [41]. In the single-player case, assuming a stationary environment (i.e., that the probabilities of re-
ceiving specific reinforcement signals do not change over time), if each action is executed in each state an infinite
number of times and the learning rate is decayed appropriately, the Q-values will converge with probability one to
the true optimal ones, [144]. In a game setting one of the main difficulties is the fact that utilities /payoffs of each
player depend on the joint-actions or strategies of all other players, and hence are not stationary. Thus convergence
results are limited because the Markovian property does not hold.
In Q-RL-type algorithms, where the x strategy update rule is a static map, for example in the work of Leslie and
Collins, [83], and Cominetti and co-authors in [44], only the Q or z-dynamics is considered (payoff estimate dynam-
ics). This dynamics is similar to the z dynamics in the P-RL (see [94]). Both [83] and [44] analyze the z-dynamics,
using techniques borrowed from the theory of stochastic approximations. One of the differences between the two
works is that the normalization factor considered in the z update in [83] leads to a simpler ODE for z. Also, for
Q-RL-type, Mertikopoulos and Sandholm, [94], use a σi of best-response type and start the development of the RL
algorithm directly from the ODE. They analyze the ODE which is of RD type (with some extensions), for “non-
steep" or “ steep" penalty. This RL or z dynamics (payoff estimates’ ODE) is as the Q-dynamics used by Leslie and
Collins in [83]. However, unlike [83], [44], the analysis is done for the x strategy variables, whose induced strategy
dynamics is obtained from the z-dynamics. This approach follows the idea of Coucheney et al. in [45], where a
general perturbation function ν is used, not necessarily only entropy-like, and a logit (exponential) map is used for
the strategy rule, and a relaxation parameter T > 0. The key to derive this induced strategy dynamics is the fact that
σi is an instantaneous static map of best-response type, like a change of coordinates from the payoff estimate z. The
derivation of the induced dynamics is made tractable by exploiting either KKT conditions (Section 2.3 in in [45], or
[94]) or specific properties of the logit function (as in Lemma 6 in [44]). In [44] alternative descriptions are obtained
for the dynamics and the corresponding rest points, such as a Lagrangian representation. What is remarkable is that
in all these cases the induced strategy dynamics are variants of the replicator dynamics (RD) again!
However in general the strategy update rule can also be dynamic (has memory). Hence the natural state is the
combined mixed-strategy xi and the payoff estimate zi of each player, leading to a coupled/interconnected system
between the two coupled sub-systems, as done by Leslie and Collins, [82]. We note that there are very few results for
this case. See Leslie and Collins’s work [82] on using specific multiple time-scales, inspired by SA-approximation
results of Borkar, in [35]
156 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
9.6 Appendices
Consider a discrete-time stochastic process with diminishing step-size such that it asymptotically converges to its
deterministic continuous-time limit. Such a process is called a stochastic approximation algorithm. The theory
of stochastic approximation algorithms was initiated by Robbins and Monro and by Kiefer and Wolfowitz since
the early 1950. This work has been extended by a number of authors, among which Benaïm’s work stands out as
generalizing the results beyond simple dynamics and gradient systems.
Consider a probability space and a discrete-time process {zk }, k ∈ N taking values in Rn , which satisfies the fol-
lowing difference equation
∑ γk = ∞, and lim γk = 0
k→∞
k
for the ODE ż = f (z). Thus one can compare sample paths {zk } with trajectories of the flow Φ induced by the vector
field f or of the dynamical system described by the ODE.
The ODE method allows one to analyze the asymptotic behaviour of the algorithm in terms of the behaviour of the
averaged ordinary differential equation (ODE) dz
= f ( z)
dt
obtained by suitable averaging, where f is defined as
1 k+1 k
f (zk ) := E ( (z − z ) |Fk )
γk
when the noise
1 k+1
ξ k := (z − zk ) − f (zk )
γk
has special properties (called “martingale difference noise"), i.e.,
E (ξ k |Fk ) = 0.
(iii) E (ξ k |Fk ) = 0.
Assumption (A1) means that the vector field f is globally integrable, i.e., there exist unique integral solutions of
the ODE ẋ = f (x). A stochastic process {zk } given by (9.60) under Assumptions (A2)(i)-(iii) is called a Robbins-
Monro algorithm, or that it satisfies the martingale difference noise condition, (see Kushner and Yin’s book, [81]).
Assumption (A3) bounds on the step-sizes and the perturbations. The boundedness assumption (A4) on the iterates
{zk } is satisfied in many applications where f (z) is a subset of a compact set.
As assumption on f , most of the initial works considered the simplest dynamics, such as when f is linear or is the
gradient of a cost function. While very appropriate for algorithms designed to minimize a cost function (optimization
setup), such an assumption does not fit typical learning in games where f may have more complicated dynamics.
In his influential work, [24], [26], Benaïm generalizes such results to other dynamics beside gradients. Specifically,
Benaïm shows that the behaviour of the algorithm can be related to the notion of chain recurrence, which is a weak
notion of recurrence for the ODE. The main theorem states that the limit sets of the solutions of (DSA) are nonempty
compact connected sets, invariant under the flow of the ODE and contained in its set of chain-recurrence. In other
words, the limit sets of (DSA) look like the ω -limit sets of the ODE. Alternatively stated, the result shows that the
limit sets of the process are almost surely compact connected attractor free (or internally chain transitive (ICT)) for
the deterministic flow induced by f .
The following is a classical result which relates the asymptotic behaviour of the stochastic algorithm to a notion
of recurrence for the ODE, namely that of a fixed point: Let z∗ be a stable equilibrium for the ODE. If {γk }k≥0
goes to zero at a suitable rate and if the sequence {zk }k≥0 enters infinitely often a compact subset K of the basin of
attraction of z∗ , Ba (z∗ ), then {zk }k≥0 converges almost surely toward z∗ . The main result of Benaïm in [26], [24] is
that for Robbins-Monro type algorithms, the limit sets L(zk ) of the solutions for a discrete stochastic approximation
(DSA) are nonempty, compact, connected sets, invariant under the flow Φ of a corresponding averaged ODE, and
are contained in C R (Φ), hence the limit sets L(zk ) of (DSA) look like the positive-limit sets L+ (Φ) of the ODE.
This relationship is showed through a particular continuous-time linear interpolation of zk .
In [29], Benaïm and co-authors show that the dynamical system approach can be generalized from ODE to
dz
∈ f ( z)
dt
i.e., a differential inclusion (DI) as often encountered in game theory.
158 CHAPTER 9. COMPUTATION OF NASH EQUILIBRIA: LEARNING IN REPEATED GAMES
An important class of games is represented by the class of potential games introduced by Monderer and Shapley,
[97]. These games have attractive properties and can be used to design decentralized algorithms in large-scale
problems, (see Scutari et al., [133], Arslan, Marden and Shamma, [16]. The problem of distributed traffic routing,
modelled as a large-scale congestion game, is one of the most known examples of such large-scale problems, studied
by Rosenthal forty years ago, [126]. In a potential game all players’ utility functions are related to a single function,
called potential function, [97].
Definition 9.3. [97] P : A → R is a potential for the game G if for all i ∈ N , for all a−i ∈ A−i ,
Intuitively, for any player i’s unilateral change of action, the change in his utility/payoff is equal to the potential
change. A potential game has a number of useful properties.
Theorem 9.4. [97] Every finite potential game has at least one pure Nash equilibrium.
When a player changes his strategy this is called a step in the game. A step in which that player’s utility is improved
is called an improvement step. A sequence of steps, (a(0), a(1), . . . , a(T )) in which at each step one and only one
player changes his strategy is called a path. A path of finite length T has an initial point, a(0) and a terminal point
a(T ). A path is called an improvement path if the utility of the deviating player improves. A game has the finite
improvement property if every improvement path is finite.
Theorem 9.5. [97] Every improvement path in an ordinal potential game is finite.
Based on the finite improvement property, algorithms in which players play “better responses" in each iteration
converge to a Nash equilibrium in finite time, as well as simple adaptive processes, [156].
Similar to Ui (x) an extension of the potential function to the mixed-strategy space can be made, so that
P ( x) = ∑ xi j P(ei j , x−i )
j∈Ai
where
Definition 9.6. A finite N-player game with action sets {Ai }Ni=1 and utility functions {Ui }Ni=1 is a potential game if,
there exists some continuously differentiable potential function P : ∆ → R, such that
We give below the two most important forms of the fixed-point theorem, Brower’s and Kakutani’s one, with wide
applications in general equilibrium theory.
Before that we review some definitions and a basic Maximum theorem that is needed by both of them.
Definition A.1. A set-valued mapping or correspondence Φ from X ⊂ Rl to Y ⊂ Rn is some rule that associates
one or more points in Y with each point in X , hence a mapping that associates with each element x ∈ X a (nonempty)
subset Φ(x) ⊂ Y . Thus Φ is a correspondence from X to Y , written as Φ : X ⇒ Y .
This means that for a set-valued mapping (correspondence) the image of each element of the domain corresponds
to a subset of one or more elements of the co-domain (range), or for each input there may be many outputs. We
reserve the name of function for the case in which the image of every point in X under Φ is a singleton in Y , and we
generally denote these by a lower case, e.g. φ : X → Y .
Definition A.3. Let Φ : X ⇒ Y be a set-valued mapping (correspondence). Then x is fixed point of Φ if x ∈ Φ(x).
The property of continuity as defined for single-valued functions is not immediately extendible to multi-valued or
set-valued mappings. In order to derive a more generalized definition, the dual concepts of upper semi-continuity
and lower semi-continuity are introduced. A correspondence that has both properties is said to be continuous in an
analogy to the property of the same name for functions.
161
162 APPENDIX A. FIXED-POINT THEOREMS
By the Closed Graph Theorem, for a map Φ that has Φ(x) closed and compact in Y , a necessary and sufficient
condition for Φ to be upper semi-continuous at point x ∈ X is that Φ has a closed graph Gr(Φ), i.e., iff
The maximum theorem provides conditions for the continuity of an optimized function and the set of its maximizers
as some parameter changes. The statement was first proven by Claude Berge in 1959. It is a particular useful result
in game theory where optimization for a player with respect to its own action, while the other players vary can be
seen as a parametrized or a parametric optimization.
Theorem A.7 ([143], Berge’s Maximum Theorem). Let X and Y be metric spaces, and f : X ×Y → R be a function
jointly continuous in its two arguments, and Y compact. Then
1. f ∗ : X → R with
f ∗ (x) := max{ f (x, y)}
y∈Y
is a continuous function;
2. Φ : X ⇒ Y with
The theorem is typically interpreted as providing conditions for a parametric optimization problem to have contin-
uous solutions with regard to the parameter (x in this case). As we will see Brouwer’s or Kakutani’s fixed-point
theorems given next require compactness and continuity, and the maximum theorem provides the sufficient condi-
tions to do so.
A.1. BROWER’S FIXED-POINT THEOREM 163
The Brouwer fixed point theorem is a fundamental result in topology which proves the existence of fixed points for
continuous functions defined on compact, convex subsets of Euclidean spaces.
Theorem A.8 ([19], Brouwer Fixed-point Theorem). If S is a compact and convex subset of Rn and f : S → S is a
continuous function mapping S into itself, then there exists at least one x ∈ S such that x = f (x).
There are many variants of fixed point theorems but Brouwer’s is particularly well known, due in part to its use across
numerous fields of mathematics. The theorem is used for proving results about differential equations and is covered
in most introductory courses on differential geometry. It appears in fields such as game theory and economics in the
proof of existence of general equilibrium.
We give below a very simple one-dimensional case sketch of the proof. For higher dimension, topological arguments
come into play, but are beyond our scope. An intuitive argument is as follows. Consider a continuous function
f : [a, b] → [a, b]. Saying that this function has a fixed point amounts to saying that its graph intersects that of the
unit function defined on the same interval [a, b], y = x (diagonal). Intuitively, any continuous line from the left edge
of the square to the right edge must necessarily intersect the diagonal.
Formally, consider the function g which maps x to f (x) − x. then g(x) ≥ 0 at a and g(x) ≤ 0 at b. By the intermediate
value theorem, g has a zero in [a, b]; this zero is a fixed point. Thus in one dimension, Brouwer’s fixed point theorem
is equivalent to the intermediate value theorem.
While Brouwer fixed point theorem proves the existence of fixed points for continuous functions defined on compact,
convex subsets of Euclidean spaces. Kakutani’s theorem extends this to set-valued mappings (correspondences). In
that case a fixed-point is a point which is mapped to a set containing it. The theorem was proved by Shizuo Kakutani
in 1941 (hence its name). It is this variant that John Nash used in his description of Nash equilibrium for N-player
games, work that would later earn him a Nobel Prize in Economics, i.e., that every finite game with mixed strategies
for any number of players has a Nash equilibrium.
Theorem A.9. Let S be a non-empty, compact and convex subset of Rn , and let Φ : S ⇒ S be a set-valued corre-
spondence, with a closed graph, and which assigns to each x ∈ S a non-empty, convex subset Φ(x) ⊂ S. Then Φ has
a fixed-point, i.e., there exists some x ∈ S such that x ∈ Φ(x).
Note that in this last statement of the Kakutani theorem, Φ is required to be closed-valued and upper semi-continuous.
By the Closed Graph theorem this is equivalent to Φ having a closed graph, Gr(Φ), hence the first statement.
There are many extensions of the Brower’s fixed point theorem. For example a generalization to infinite dimensional
Banach spaces is known as Banach contraction mapping theorem. Another extension is a combinatorial analog,
known as Sperner’s lemma.
164 APPENDIX A. FIXED-POINT THEOREMS
Appendix B
Standard Optimization Review
A set Ω ⊂ Rm is convex if for every x, y ∈ Ω and every α ∈ [0, 1], it follows that α x + (1 − α ) y ∈ Ω. A real valued
function f : Ω → R defined over a convex set Ω is convex if for every x, y ∈ Ω and every α ∈ [0, 1],
f ( α x + ( 1 − α ) y ) ≤ α f ( x) + ( 1 − α ) f ( y)
If the above is a strict inequality for every α ∈ (0, 1), then f is called strictly convex.
A function f : Rm → R is said to be differentiable if the partial derivatives of f with respect to the components
of its argument u ∈ Rm , ∂ ∂f (uui ) , i = 1, . . . , m, exist. If these are also continuous, function f is called a continuously
differentiable (C1 ) function. Its gradient at u denoted by ∇ f (u), is defined as the column vector
h iT
∇ f (u) = ∂∂f u(u) . . . ∂∂fu(u) , (B.1)
1 m
Let f : Rm → R be a twice continuously differentiable (C2 ) function. Its Hessian denoted by ∇2 f , is the symmetric
matrix defined as
∂ 2 f (u) ∂ 2 f (u) ∂ 2 f (u)
∂ u21 ∂ u1 ∂ u2 ··· ∂ u1 ∂ um
∂ 2 f (u) ∂ 2 f (u) ∂ 2 f (u)
∂ u2 ∂ u1 ∂ u22
··· ∂ u2 ∂ um
∇ f (u) =
2
.. .. ..
(B.2)
..
. . . .
∂ 2 f (u) ∂ 2 f (u) ∂ 2 f (u)
∂ um ∂ u1 ∂ um ∂ u2 ··· ∂ u2m
f ( u ) ≥ f ( ξ ) + ∇T f ( ξ ) ( u − ξ )
A twice continuously differentiable function is convex on Ω if and only if its Hessian matrix ∇2 f is positive semidef-
inite on the interior of Ω.
For a vector-valued function f : Rm → Rm , with components f1 , . . . , fm , which are twice continuously differentiable,
its pseudo-gradient at u, denoted by ∇f(u) is defined as a column vector of the first-order partial derivatives of fi (u)
∂ f (u)
with respect to ui , ∂iui ,
h iT
∇f(u) := ∂ ∂f1u(u) . . . ∂ ∂fmu(u) , (B.3)
1 m
165
166 APPENDIX B. STANDARD OPTIMIZATION REVIEW
2
which is the diagonal of the Jacobian matrix of f. The Jacobian of ∇f(u) with respect to u is denoted by ∇ f(u),
2
∂ f 1 (u) ∂ 2 f 1 (u) ∂ 2 f 1 (u)
∂ u1 ∂ u2 · · · ∂ u1 ∂ um
2∂ u21
∂ f 2 (u) ∂ 2 f 2 (u) ∂ 2 f 2 (u)
2 ∂ u2 ∂ u1 ∂ u22
· · · ∂ u2 ∂ um
∇ f( u ) : =
.. .. ..
.
(B.4)
..
. . . .
∂ 2 f m (u) ∂ 2 f m (u) ∂ 2 f m (u)
∂ um ∂ u1 ∂ um ∂ u2 ··· ∂ u2m
where ∇v f (v; u) and ∇u f (v; u) are the gradients defined in (B.1), with respect to the first argument v and the second
argument u, respectively.
This section provides an overview of optimization problems. In particular, the covered topics include unconstrained
and constrained optimization problems, the Lagrange multiplier and the duality approach.
A mathematical optimization problem has the form
min J0 (u)
subject to u ∈ Ω, (B.6)
where u = [u1 , . . . , um ]T is the variable, Ω ⊆ Rm is the general constraint set. The problem (B.6) is called an
unconstrained optimization problem if Ω = Rm . The objective function J0 : Ω → R is the cost function.
A vector satisfying the constraints in (B.6) is called a feasible vector for this problem. A feasible vector uopt is a
local minimum of J0 (u) if there exists an ε > 0 such that
where k · k denotes the Euclidean norm. A feasible vector uopt is called a global minimum of J0 (u), or a solution of
the problem (B.6) if
J0 (uopt ) ≤ J0 (u), ∀ u ∈ Ω
and uopt is called strict if the inequality above is strict for u 6= uopt .
The following is a standard result (from Proposition 1.1.1 and Proposition 1.1.2, [33]) regarding a solution of (B.6).
Proposition B.1 ([33]). Let uopt be a local minimum of J0 : Ω → R. Assume that J0 is continuously differentiable
in an open set S ⊆ Ω containing uopt . Then
If in addition S is convex and J0 (u) is a convex function over S , then (B.7) is a necessary and sufficient condition
for uopt ∈ S to be a global minimum of J0 (u) over S .
B.2. STANDARD OPTIMIZATION REVIEW 167
The problem (B.6) is called a constrained optimization problem if Ω is a strict subset of Rm , Ω ⊂ Rm . Throughout
this section, it is assumed that Ω is convex. The following result follows directly from Proposition 2.1.1 in [33].
Proposition B.2. If Ω ⊂ Rm is a convex and compact set and J0 : Ω → R is a strictly convex function, then the
problem (B.6) admits a unique global minimum.
For coupled constrained optimization problems, a specific structure constructed by inequalities can be considered
min J0 (u)
subject to gr (u) ≤ 0, r = 1, . . . , R, (B.9)
where constraints g1 (u), . . . , gR (u) are real-valued continuously differentiable functions. In a compact form the
general constraint set can be written as
Ω = {u ∈ Rm | gr (u) ≤ 0, r = 1, . . . , R}.
We denote this optimization problem (B.9) by OPT (Ω, J0 ). For any feasible vector u, the set of active inequality
constraints is denoted by
A (u) = {r | gr (u) = 0, r = 1, . . . , R}. (B.10)
/ A (u), the constraint gr (u) is inactive at u. A feasible vector u is said to be regular if the active inequality
If r ∈
constraint gradients ∇gr (u), r ∈ A (u), are linear independent.
The Lagrangian function L : Rm+R → R for problem (B.9) is defined as
R
L(u, µ ) = J0 (u) + ∑ µr gr (u), (B.11)
r =1
where µr , r = 1, . . . , R, are scalars. The following proposition (Proposition 3.3.1 in [33]) states necessary conditions
for solving OPT (Ω, J0 ) in terms of the Lagrangian function defined in (B.11).
Proposition B.3 (Karush-Kuhn-Tucker (KKT) Necessary Conditions). Let uopt be a local minimum of the problem
(B.9). Assume that uopt is regular. Then there exists a unique vector µ ∗ = ( µ1∗ , . . . , µR∗ ), called a Lagrange multiplier
vector, such that
∇u L(uopt , µ ∗ ) = 0 (B.12)
µr∗ ≥ 0, ∀ r = 1, . . . , R (B.13)
µr∗ = 0, ∀ r ∈
/ A (uopt ). (B.14)
min J0 (u)
subject to gr (u) ≤ 0, r = 1, . . . , R, (B.15)
u ∈ Ω,
where Ω ⊆ Rm , which may be a strict subset of Rm . The conditions are general since differentiability and convexity
of J0 and gr , r = 1, . . . , R, are not required.
168 APPENDIX B. STANDARD OPTIMIZATION REVIEW
Proposition B.4 (General Sufficiency Condition). Consider the problem (B.15). Let uopt be a feasible vector, to-
gether with a vector µ ∗ = [ µ1∗ , . . . , µR∗ ]T that satisfies
µr∗ ≥ 0, ∀ r = 1, . . . , R
µr∗ = 0, ∀ r ∈
/ A (uopt )
and assume uopt minimizes the Lagrangian function L(u, µ ∗ ) (B.11) over u ∈ Ω, denoted as
Remark B.5. If in addition, J0 and gr , r = 1, . . . , R, are also convex and Ω = Rm , then the Lagrangian function
L(u, µ ) is convex with respect to u. Therefore by Proposition B.1, (B.16) is equivalent to the first-order necessary
condition (B.12) in Proposition B.3. Thus conditions (B.12)-(B.14) are also sufficient.
Definition B.6. [101] Let A := [ai j ] be an m × m matrix. The matrix A is said to be diagonally dominant if
m
|aii | ≥ ∑ |ai j |, ∀i = 1, . . . , m
j =1, j6=i
Some useful results are shown in the following theorem, adapted from [68].
Theorem B.7 ([68], pp.349). Let the m × m matrix A = [ai j ] be strictly diagonally dominant. Then A is invertible
and
(a) If all main diagonal entries of A are positive, then all the eigenvalues of A have positive real part.
(b) If A is Hermitian and all main diagonal entries of A are positive, then all the eigenvalues of A are real and
positive.
Lemma B.8. Let A be an m × m real matrix with all main diagonal entries positive. Then A is positive definite if A
and AT are both strictly diagonally dominant.
Proof: If A and AT are both strictly diagonally dominant, then it follows that
m m
aii > ∑ |ai j | and aii > ∑ |a ji |.
j =1, j6=i j =1, j6=i
B.4. PROJECTION THEOREM 169
Thus As is also strictly diagonally dominant. From Theorem B.7, it follows that As is positive definite. Therefore A
is positive definite in the sense that the symmetric part, 12 As , is positive definite.
Projection method is widely used in developing iterative algorithms solving constrained optimization problems.
Each time when an update jumps outside the feasible set X , the algorithm can project it back to the set X .
The projection of a vector x onto a nonempty, closed and convex set X is defined with respect to the Euclidean
norm, denoted by [·]+ , i.e.,
[x]+ := arg minkz − xk.
z∈X
The properties of the projection are presented in the following theorem.
Theorem B.9 ([34], Projection Theorem). Let X be a nonempty, closed and convex subset of Rn .
1. For every x ∈ Rn , there exists a unique z ∈ X that minimizes kz − xk over all z ∈ X and is denoted by [x]+ .
(z − x∗ )T (x − x∗ ) ≤ 0, ∀ z ∈ X .
3. The mapping f : Rn → X defined by f (x) = [x]+ is continuous and nonexpansive, that is,
Lipschitz continuity is a smoothness condition for functions which is stronger than regular continuity. For a function
f (x) : D → Rn , where D is a compact subset of Rm , the Lipschitz condition is defined as
where k · k denotes Euclidean norm. The positive constant L is called a Lipschitz constant.
Definition B.10. A function f (x) is said to be Lipschitz continuous if there exists a constant L ≥ 0 such that for all
x, y ∈ D, the Lipschitz condition (B.17) is satisfied.
The function f (x) is called locally Lipschitz continuous if each point of D has a neighborhood D0 such that f (x)
satisfies the Lipschitz condition for all points in D0 with some Lipschitz constant L0 .
A locally Lipschitz continuous function f (x) on D is Lipschitz continuous on every compact (closed and bounded)
set W ⊆ D.
170 APPENDIX B. STANDARD OPTIMIZATION REVIEW
Lipschitz property is weaker than continuous differentiability, as stated in the next proposition.
∂f
Proposition B.11 (Lemma 3.2, [76]). If a function f (x) and ∂ x ( x) are continuous on D, then f (x) is locally
Lipschitz continuous on D.
ẋ = f (t, x) (B.18)
where f : [0, ∞) × D → Rn is piecewise continuous in time t and locally Lipschitz in x on [0, ∞) × D and D ⊂ Rn is
a domain that contains the origin x = 0. The equilibrium point x = 0 of (B.18) is exponentially stable if there exist
positive constants c1 , c2 and c3 such that
and globally exponentially stable if (B.19) is satisfied for any initial state x(t0 ).
Bibliography
[1] D. Acemoglu and A. Ozaglar. Costs of competition in general networks. In Proceedings of the 44th IEEE
Conference on Decision and Control, pages 5324–5329, December 2005.
[2] G. P. Agrawal. Fiber-optic Communication Systems. John Wiley, 3rd edition, 2002.
[3] T. Alpcan. Noncooperative games for control of networked systems. Ph.D. Thesis, University of Illinois at
Urbana-Champaign, Illinois, USA, 2006.
[4] T. Alpcan and T. Basar. A game-theoretic framework for congestion control in general topology networks. In
Proceedings of the 41st IEEE Conference on Decision and Control, pages 1218–1224, December 2002.
[5] T. Alpcan and T. Basar. A hybrid systems model for power control in multicell wireless data networks.
Performance Evaluation, 57(4):477–495, August 2004.
[6] T. Alpcan and T. Basar. Distributed algorithms for Nash equilibria of flow control games. Advances in Dy-
namic Games: Applications to Economics, Finance, Optimization, and Stochastic Control, Annals of Dynamic
Games, 7:473–498, 2005.
[7] T. Alpcan, T. Basar, R. Srikant, and E. Altman. CDMA uplink power control as a noncooperative game. In
Proceedings of the 40th IEEE Conference on Decision and Control, pages 197–202, December 2001.
[8] T. Alpcan, X. Fan, T. Basar, M. Arcak, and J. T. Wen. Power control for multicell CDMA wireless networks:
A team optimization approach. Wireless Networks, 14(5):647–657, 2008.
[9] E. Altman and Z. Altman. S-modular games and power control in wireless networks. IEEE Transactions on
Automatic Control, 48(5):839–842, 2003.
[10] E. Altman and T. Basar. Multiuser rate-based flow control. IEEE Transactions on Communications,
46(7):940–949, 1998.
[11] E. Altman, T. Basar, T. Jimenez, and N. Shimkin. Routing into two parallel links: game-theoretic distributed
algorithms. Journal of Parallel and Distributed Computing, 61(9):1367–1381, 2001.
[12] E. Altman, T. Basar, and R. Srikant. Nash equilibria for combined flow control and routing in networks:
asymptotic behavior for a large number of users. IEEE Transactions on Automatic Control, 47(6):917–930,
2002.
171
172 BIBLIOGRAPHY
[13] E. Altman, T. Boulogne, R. El-Azouziand T. Jimenez, and L. Wynter. A survey on networking games in
telecommunications. Computers and Operations Research, 33(2):286–311, 2006.
[14] E. Altman and E. Solan. Constrained games: The impact of the attitude to adversary’s constraints. IEEE
Transactions on Automatic Control, 54(10):2435 – 2440, 2009.
[15] K. J. Arrow and G. Debreu. Existence of an equilibrium for a competitive economy. Econometrica, 22(3):265–
290, 1954.
[16] G. Arslan, J.R. Marden, and J.S. Shamma. Autonomous vehicle target assignment: A game-theoretical for-
mulation. ASME Journal of Dynamic Systems, Measurement and Control, (129):584–596, 2007.
[17] W. B. Arthur. On designing economic agents that behave like human agents. J. Evol. Econ., 3:1–22, 1993.
[18] R. EI Azouzi and E. Altman. Constrained traffic equilibrium in routing. IEEE Transactions on Automatic
Control, 48(9):1656–1660, 2003.
[19] T. Basar and G. J. Olsder. Dynamic noncooperative game theory. SIAM Series Classics in Applied Mathe-
matics, 2nd edition, 1999.
[20] T. Basar and R. Srikant. Revenue-maximizing pricing and capacity expansion in a many-users regime. In
Proceedings of the 21st IEEE Conference on Computer Communications (INFOCOM), pages 294–301, June
2002.
[21] A. W. Beggs. Learning in bayesian games with binary actions. University of Oxford, Discussion Paper Series,
(232):1–25, 2005.
[22] A. W. Beggs. On the convergence of reinforcement learning. Journal of Economic Theory, (122):1–36, 2005.
[23] M. Benaïm and J. W. Weibull. Deterministic approximation of stochastic evolution in games. Econometrica,
(71):873–903, 2003.
[24] M. Benaïm. A dynamical system approach to stochastic approximations. SIAM J. Control and Optimization,
34(2):437–472, 1996.
[25] M. Benaïm. Recursive algorithms, urn processes and chaining number of chain recurrent sets. Ergodic Theory
and Dynamical Systems, 18(1):53–87, 1998.
[26] M. Benaïm. Dynamics of stochastic approximation algorithms. Le Seminaire de probabilites, Lecture Notes
in Math. 1709, pages 1–68, 1999.
[27] M. Benaïm and M. W. Hirsch. Mixed equilibria and dynamical systems arising from fictitious play in per-
turbed games. Games Econ. Behaviour, 29:36–72, 1999.
[28] M. Benaïm, J. Hofbauer, and E. Hopkins. Learning in games with unstable equilbria. Journal of Economic
Theory, pages 1694–1709, 2009.
[29] M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAM J. Control
and Optimization, 44:328–348, 2005.
[30] M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions, part ii: Appli-
cations. Mathematics of Operations Research, 31:673–695, 2006.
BIBLIOGRAPHY 173
[31] M. Benaïm, J. Hofbauer, and S. Sorin. Perturbations of set-valued dynamical systems with applications to
game theory. Dynamic Games and Applications, 2:195–205, 2012.
[32] M. Benaïm and O. Raimond. A class of self-interacting processes with applications to games and reinforced
random walks. SIAM J. Control and Optimization, 48:4707–4730, 2010.
[34] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall,
1989.
[35] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
[36] V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochastic approximation and reinforce-
ment learning. SIAM Journal on Control and Optimization, 38:447–469, 2000.
[37] O. Brandière. The dynamic system method and the traps. Advances in Applied Probability, 30(1):137–151,
1998.
[38] O. Brandière and M. Duflo. Les algorithmes stochastiques contournent-ils les pieges. Ann. Inst. Henri
Poincare, 32:395–427, 1996.
[39] M. Bravo and M. Faure. Reinforcement learning with restrictions on the action set. arXiv:1306.2918v1, pages
1–28, 2012.
[40] G. W. Brown. Iterative solutions of games by fictitious play. in Koopmans, T. C. et al., editors, Activity
Analysis of Production and Allocation, Wiley, New York:374–376, 1951.
[41] A. C. Chapman, D. Leslie, A. Rogers, and N. R. Jennings. Convergent learning algorithms for unknown
reward games. SIAM Journal on Control and Optimization, 51(4):3154 – 3180, 2013.
[42] G. C. Chasparis, J. S. Shamma, and A. Rantzer. Perturbed learning automata in potential games. In Proceed-
ings of the Conference on Decision and Control, pages 2453–2458, December 2011.
[43] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multi agent systems. In
Proceedings of the 15th AAAI Nat. Conf. on Artificial IntelligenceTheoretical Population Biology, pages 746–
752, 1998.
[44] R. Cominetti, E. Melo, and S. Sorin. A payoff-based learning procedure and its application to traffic games.
Games and Economic Behavior, 70:71–83, 2010.
[45] P. Coucheney, B. Gaujal, and P. Mertikopoulos. Penalty-regulated dynamics and robust learning procedures
in games. arXiv:1303.2270v2, pages 1–33, 2014.
[46] R. Cressman and J. Hofbauer. Measure dynamics on a one-dimensional continuous trait space: Theoretical
foundations for adaptive dynamics. Theoretical Population Biology, 67:47–59, 2005.
[47] B. de Meyer. Repeated games, duality and the central limit theorem. Mathematics of Operations Research,
21(1):237–251, 1996.
[48] B. de Meyer and A. Marino. Duality and optimal strategies in the finitely repeated zero-sum games with
incomplete information on both sides. Cahiers de la Maison des Sciences Economiques, pages 1–7, 2005.
174 BIBLIOGRAPHY
[49] G. Debreu. A social equilibrium existence theorem. Proceedings of National Academy Sciences of the United
States of America, 38(10):886–893, October 1952.
[50] P. Dubey. Inefficiency of Nash equilibria. Mathematics of Operations Research, 11(1):1–8, February 1986.
[51] I. Erev and A. E. Roth. Predicting how people play games: Reinforcement learning in experimental games
with unique, mixed strategy equilibria. American Economic Review, 88:848–881, 1998.
[52] F. Facchinei, A. Fischer, and V. Piccialli. On generalized Nash games and variational inequalities. Operations
Research Letters, 35(2):159–164, 2007.
[53] D. Falomari, N. Mandayam, and D. Goodman. A new framework for power control in wireless data net-
works: games utility and pricing. In Proceedings of the Allerton Conference on Communication, Control, and
Computing, pages 546–555, September 1998.
[54] R. A. Fischer. The Genetical Theory of Natural Selection. Oxford, Claredon, 2nd (1958) edition, 1930.
[56] D. Fudenberg and D.M. Kreps. Learning mixed equilibria. Games and Economic Behaviour, 5:320–367,
1993.
[57] D. Fudenberg and D. K. Levine. Theory of Learning in Games. The MIT Press, Cambridge, 1998.
[58] D. Fudenberg and S. Takahashi. Heterogeneous beliefs and local information in stochastic fictitious play.
Games and Economic Behaviour, 71:100–120, 2011.
[59] I. Gilboa and A. Matsui. Social stability and equilibrium. Econometrica, Wiley, New York:859–867, 1991.
[60] A. Greenwald, E.J. Friedman, and S. Shenker. Learning in network contexts: experimental results from
simulations. Games and Economic Behaviour, 35:80–123, 2001.
[61] P. T. Harker. Generalized Nash games and quasi-variational inequalities. European Journal of Operational
Research, 54(1):81–94, 1991.
[62] J. C. Harsanyi. Games with randomly disturbed payoffs: A new rationale for mixed-strategy equilibrium
points. International Journal of Game Theory, 2:1–23, 1973.
[63] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica,
68(5):1127 – 1150, 2000.
[64] J. Hofbauer and W. H. Sandholm. On the global convergence of stochastic fictitious play. Econometrica,
70:2265–2294, 2002.
[65] J. Hofbauer and K. Sigmund. Evolutionary games and population dynamics. Cambridge University Press,
1998.
[66] J. Hofbauer and K. Sigmund. Evolutionary game dynamics. Bulletin of the American Mathematical Society
(New Series), 40:479–519, 2003.
[67] E. Hopkins and M. Posch. Attainability of boundary points under reinforcement learning. Games and Eco-
nomic Behavior, 53:110–125, 2005.
BIBLIOGRAPHY 175
[68] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge University Press, 1999.
[69] M. Huang, R. P. Malhame, and P. E. Caines. Nash certainty equivalence in large population stochastic dy-
namic games: connections with the physics of interacting particle systems. In Proceedings of the 45th IEEE
Conference on Decision and Control, pages 4921–4926, December 2006.
[70] H. Ji and C. Huang. Non-cooperative uplink power control in cellular radio systems. Wireless Networks,
4(3):233–240, 1998.
[71] R. Johari, S. Mannor, and J. N. Tsitsiklis. Efficiency loss in a resource allocation game: A single link in elastic
supply. In Proceedings of the 43rd IEEE Conference on Decision and Control, pages 4679–4683, December
2004.
[72] R. Johari and J. Tsitsiklis. Efficiency loss in a network resource allocation game. Mathematics of Operations
Research, 29(3):407–435, August 2004.
[74] F. Kelly. Charging and rate control for elastic traffic. European Transactions on Telecommunications, 8:33–
37, 1997.
[75] F. P. Kelly, A. K. Maulloo, and D. Tan. Rate control for communication networks: Shadow prices, proportional
fairness and stability. Journal of the Operational Research Society, 49(3):237–252, 1998.
[77] S. Koskie and Z. Gajic. A Nash game algorithm for SIR-based power control in 3G wireless CDMA networks.
IEEE/ACM Transactions on Networking, 13(5):1017–1026, 2005.
[78] E. Koutsoupias and C. H. Papadimitriou. Worst-case equilibria. In Proceedings of the 16th Annual Symposium
STACS, pages 404–413, 1999.
[79] N.N. Krasovskii and A.I. Subbotin. Game-theoretical control problems. Springer-Verlag, 1988.
[80] S. Kunniyur and R. Srikant. End-to-end congestion control schemes: utility functions, random losses and
ECN marks. IEEE/ACM Transactions on Networking, 11(5):689–702, 2003.
[81] H. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer,
2003.
[82] D. Leslie and E. Collins. Convergent multiple-timescales reinforcement learning algorithms in normal form
games. Annals of Appl. Probability, 13:1231–1251, 2003.
[83] D. Leslie and E. Collins. Individual Q-learning in normal form games. SIAM Journal on Control and Opti-
mization, 1:495–514, 2005.
[84] L. Libman and A. Orda. The designer’s perspective to atomic noncooperative networks. IEEE/ACM Trans-
actions on Networking, 7(6):875–884, 1999.
[85] A. Loja and et al. Inter-domain routing in multiprovider optical networks: game theory and simulations. In
Proceedings of Next Gen. Internet Networks, pages 157–164, May 2005.
176 BIBLIOGRAPHY
[86] S. Low, F. Paganini, and J. Doyle. Internet congestion control. IEEE Control Systems Magazine, 22(1):28–43,
2002.
[87] S. H. Low and D. E. Lapsley. Optimization flow control-I: basic algorithm and convergence. IEEE/ACM
Transactions on Networking, 7(6):861–874, 1999.
[88] Z. Luo and J. Pang. Analysis of iterative waterfiling algorithm for multiuser power control in digital subscriber
lines. EURASIP Journal on Applied Signal Processing, pages 1–10, 2006.
[89] M. Maggiore. Introduction to nonlinear control systems, course notes for ECE1647, 2009. Available in the
handouts section on the course website.
[90] P. Marbach and R. Berry. Downlink resource allocation and pricing for wireless networks. In Proceedings of
the 21st IEEE Conference on Computer Communications (INFOCOM), pages 1470–1479, June 2002.
[91] J. R. Marden, G. Arslan, and J. S. Shamma. Joint strategy fictitious play with inertia for potential games. In
Proceedings of the 44th IEEE Conference on Decision and Control, pages 6692–6697, December 2005.
[92] J. R. Marden, G. Arslan, and J. S. Shamma. Joint strategy fictitious play with inertia for potential games.
IEEE Transactions on Automatic Control, 54:208–220, 2009.
[93] A. Mecozzi. On the optimization of the gain distribution of transmission lines with unequal amplifier spacing.
IEEE Photonics Technology Letters, 10(7):1033–1035, 1998.
[94] P. Mertikopoulos and W. Sandholm. Regularized best responses and reinforcement learning in games.
arXiv:1305.0967v2 TBD, pages 1–33, 2015.
[95] D. Mitra. An asynchronous distributed algorithm for power control in cellular radio systems. In Proceedings
of the 4th WINLAB Workshop on Third Generation Wireless Information Networks, pages 249–257, November
1993.
[96] D. Monderer and L. S. Shapley. Potential games. Games and Economic Behavior, 14(1):124–143, 1996.
[97] D. Monderer and L. S. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996.
[98] J. Morgan and M. Romaniello. Generalized quasi-variational inequalities and duality. Journal of Inequalities
in Pure and Applied Mathematics, 4(2):Article 28, 1–7, 2003.
[99] J. Morgan and M. Romaniello. Generalized quasi-variational inequalities: Duality under perturbations. Jour-
nal of Mathematical Analysis and Applications, 324(2):773–784, 2006.
[100] R. Myerson. Game Theory: Analysis of Conflict. Harvard University Press, 1991.
[102] J. Nash. Equilibrium points in n-person games. Proceedings of National Academy Sciences. USA, 36(1):48–
49, 1950.
[103] J. Nash. Non-cooperative games. The Annals of Mathematics, 54(2):286–295, September 1951.
[104] N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani (eds.). Algorithmic Game Theory. Cambridge Univer-
sity Press, 2007.
BIBLIOGRAPHY 177
[105] M. A. Nowak. Evolutionary Dynamics: Exploring the Equations of Life. Belknap/Harvard, Cambridge, 2006.
[106] M. A. Nowak and K. Sigmund. Evolutionary Dynamics of Biological Games. Number 303. 2004.
[109] A. Ozdaglar and D. Bertsekas. Routing and wavelength assignment in optical networks. IEEE/ACM Trans-
actions on Networking, 11(2):259–272, 2003.
[110] Y. Pan and L. Pavel. OSNR optimization in optical networks: extension for capacity constraints. In Proceed-
ings of the 2005 American Control Conference, pages 2379–2385, June 2005.
[111] Y. Pan and L. Pavel. Global convergence of an iterative gradient algorithm for the Nash equilibrium in an
extended OSNR game. In Proceedings of the 26th IEEE Conference on Computer Communications (INFO-
COM), pages 206–212, May 2007.
[112] Y. Pan and L. Pavel. Iterative algorithms for Nash equilibrium of an extended OSNR optimization game. In
Proceedings of the 6th International Conference on Networking (ICN), April 2007.
[113] Y. Pan and L. Pavel. OSNR optimization with link capacity constraints in WDM networks: a cross layer
game approach. In Proceedings of the 4th IEEE Conference on Broadband Communications, Networks and
Systems (BroadNets), September 2007.
[114] Y. Pan and L. Pavel. A Nash game approach for OSNR optimization with capacity constraint in optical links.
IEEE Transactions on Communications, 56(11):1919–1928, November 2008.
[115] Y. Pan and L. Pavel. Games with coupled propagated constraints in optical networks with multi-link topolo-
gies. Automatica, 45(4):871–880, 2009.
[116] P. A. Parrilo. Polynomial games and sum of squares optimization. In Proceedings of the 45th IEEE Conference
on Decision and Control, pages 2855–2860, December 2006.
[117] L. Pavel. Power control for OSNR optimization in optical networks: a noncooperative game approach. In
Proceedings of the 43rd IEEE Conference on Decision and Control, pages 3033–3038, December 2004.
[118] L. Pavel. An extension of duality and hierarchical decomposition to a game-theoretic framework. In Proceed-
ings of the 44th IEEE Conference on Decision and Control, pages 5317–5323, December 2005.
[119] L. Pavel. A noncooperative game approach to OSNR optimization in optical networks. IEEE Transactions
on Automatic Control, 51(5):848–852, 2006.
[121] L. Pavel. Game theory for control of optical networks. Birkhäuser-Springer Science, 2012.
[122] R. Pemantle. Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of
Probability, 18(2):698–712, 1990.
178 BIBLIOGRAPHY
[123] G. Perakis. The price of anarchy when costs are non-separable and asymmetric. In Proceedings of the 10th
International Integer Programming and Combinatorial Optimization (IPCO) Conference, pages 46–58, June
2004.
[124] G. Perakis. The “price of anarchy” under nonlinear and asymmetric costs. Mathematics of Operations Re-
search, 32(3):614–628, August 2007.
[125] J. B. Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econometrica,
33(3):520–534, 1965.
[126] R. W. Rosenthal. A class of games possessing pure-strategy Nash equilibria. Int. J. Game Theory, 2:65–67,
1973.
[127] G. Roth and W. H. Sandholm. Stochastic approximations with constant step size and differential inclusions.
SIAM, TBD:manuscript, 2013.
[128] T. Roughgarden. Selfish routing and the price of anarchy. MIT Press, 2005.
[129] W. H. Sandholm. Population Games and Evolutionary Dynamics. Cambridge University Press, 2009.
[130] C. Saraydar, N. B. Mandayam, and D.J. Goodman. Pricing and power control in a multicell wireless data
network. IEEE Journal on Selected Areas in Communications, 19(10):1883–1892, October 2001.
[131] C. Saraydar, N. B. Mandayam, and D.J. Goodman. Efficient power control via pricing in wireless data
networks. IEEE Transactions on Communications, 50(2):291–303, 2002.
[133] G. Scutari, S. Barbarossa, and D. P. Palomar. Potential games: a framework for vector power control problems
with coupled constraints. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2006.
[134] R. Selten. Re-examination of the perfectness concept for equilibrium points in extensive games. International
Journal of Game Theory, 4:25–55, 1975.
[135] J. S. Shamma and G. Arslan. Dynamic fictitious play, dynamic gradient play and distributed convergence to
Nash equilibria. IEEE Transactions on Automatic Control, 50(1):312 – 326, 2005.
[136] U. Shandbag and et al. Nash equilibrium problems with congestion costs and shared constraints. In Proceed-
ings of the Conference on Decision and Control, December 2009.
[137] H. Shen and T. Basar. Differentiated Internet pricing using a hierarchical network game model. In Proceedings
of the 2004 American Control Conference, pages 2322–2327, June 2004.
[138] S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state estimation in partially observable Markovian
decision processes. In Proceedings of Machine Learning Conference, 1994.
[139] J. M. Smith. The theory of games and the evolution of animal conflicts. Journal of Theoretical Biology, 7:209
– 221, 1974.
BIBLIOGRAPHY 179
[140] J. Maynard Smith. Evolution and the Theory of Games. Cambridge University Press, 1982.
[141] J. Maynard Smith and G. R. Price. The logic of animal conSSict. Nature, 246:15–18, 1973.
[142] N. D. Stein, A. Ozdaglar, and P. A. Parrilo. Separable and low-rank continuous games. In Proceedings of the
45th IEEE Conference on Decision and Control, pages 2849–2854, December 2006.
[143] R. K. Sundaram. A first course in optimization theory. Cambridge University Press, 1996.
[144] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
[145] B. Swenson, S. Kar, and J. Xavier. Distributed learning in large-scale multi-agent games: a modified fictitious
play approach. In Proceedings of the Signals, Systems and Computers (Asilomar) Conference, pages 1490–
1495, 2012.
[146] B. Swenson, S. Kar, and J. Xavier. Empirical centroid fictitious play: an approach for distributed learning in
multi-agent games. IEEE Transactions on Signal Processing, pages submitted, arXiv:1304.4577v2, 2014.
[147] C. Tan, D. Palomar, and M. Chiang. Solving nonconvex power control problems in wireless networks: low
SIR regime and distributed algorithms. In Proceedings of the 48th IEEE Global Telecommunications Confer-
ence (GLOBECOM), November 2005.
[148] P. D. Taylor and L. Jonker. Evolutionarily stable strategies and game dynamics. Mathematical Biosciences,
40:145–156, 1978.
[149] J. N. Tsitsiklis. Asynchronous Sstochastic Aapproximation and Q-Learning. Machine Learning, 16:185–202,
1994.
[150] E. van Damme. Stability and Perfection of Nash Equilibria. Springer Verlag, Berlin, 2nd edition, 1987.
[151] T. L. Vincent and J. S. Brown. Evolutionary game theory, natural selection, and Darwinian dynamics. Cam-
bridge University Press, 2005.
[152] J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press,
1944.
[154] H. Yaiche, R. Mazumdar, and C. Rosenberg. A game theoretic framework for bandwidth allocation and
pricing in broadband networks. IEEE/ACM Transactions on Networking, 8(5):667–678, October 2000.
[155] A. Yassine and et al. Competitive game theoretic optimal routing in optical networks. In Proceedings SPIE,
pages 27–36, May 2002.
[156] H. P. Young. Strategic Learning and Its Limits. Arne Ryde Memorial Lectures. Oxford University Press,
2005.
180 BIBLIOGRAPHY