Stochastic Dynamic Programming: 4.1 The Axiomatic Approach To Probability: Basic Con-Cepts of Measure Theory
Stochastic Dynamic Programming: 4.1 The Axiomatic Approach To Probability: Basic Con-Cepts of Measure Theory
The aim of this chapter is to extend the framework we introduced in Chapter 3 to include
uncertainty. To evaluate decisions, we use the well known expected utility theory.1 With
uncertainty we will face Bellman equations of the following form
V (x, z) = sup F (x, x′ , z) + βE [V (x′ , z ′ ) | z] , (4.1)
x′ ∈Γ(x,z)
69
70 CHAPTER 4. STOCHASTIC DYNAMIC PROGRAMMING
We first introduce a set Z which will be our sample space. Any subset E of Z, will
be denoted as an event. In this way, all results of set theory - unions, intersections,
complements, ... - can be directly applied to events as subsets of Z. To each event we
also assign a “measure” µ(E) = Pr {E} called probability of the event. These values are
assigned according to the function µ which has by assumption the following properties
(or axioms):
1. 0 ≤ µ(E) ≤ 1;
2. µ(Z) = 1;
3. For any finite or infinite sequence of disjoint sets (or mutually exclusive events)
E1 , E2 , ....; such that Ei ∩ Ej = ∅ for any i, j, we have
N
X
µ ∪N
i=1 Ei = µ (Ei ) where N possibly equals ∞.
i=1
All properties 1-3 are very intuitive for probabilities. Moreover, we would intuitively like
to consider E as any subset of Z. Well, if Z is a finite or countable set then E can literally
be any subset of Z. Unfortunately, when Z is a uncountably infinite set - such as the
interval [0, 1] for example - it might be impossible to find a function µ defined on all
possible subsets of Z and at the same time satisfying all the three axioms we presented
above. Typically, what fails is the last axiom of additivity when N = ∞. Lebesgue
managed to keep property 3 above by defining the measure function µ only on the so-
called measurable sets (or events). This is not an important limitation, as virtually all
events of any practical interest tuned out to be measurable. Actually, in applications
one typically considers only some class of possible events. A subset of the class of all
measurable sets.
The reference class of sets Z represents the set of possible events, and will constitute
a σ-algebra.3 Notice that Z is a set of sets, hence an event E is an element of Z, i.e. in
contrast to E ⊂ Z we will write E ∈ Z. The pair (Z, Z) constitutes a measurable space
while the turple (µ, Z, Z) is denotes as a measured (or probability) space.
3
A family Z of subsets of Z is called a σ algebra if: (i) both the empty set ∅ and Z belong to Z; (ii)
If E ∈ Z then also its complement (with respect to Z) E c = Z\E ∈ Z; and (iii) for any sequence of
sets such that En ∈ Z for all n = 1, 2, .... we have that the set (∪∞ n=1 En ) ∈ Z. It is easy to show that
∞
whenever Z is a σ-algebra then (∩n=1 En ) ∈ Z as well. When Z is a set of real numbers, we can consider
our set of possible events as the Borel σ-algebra. Which is σ-algebra ‘generated’ by the set of all open
sets.
4.1. THE AXIOMATIC APPROACH TO PROBABILITY: BASIC CONCEPTS OF MEASURE T
I am sure it is well known to you that the expectation operator E [·] in (4.1) is nothing
more than an integral, or a summation when z takes finitely or countably many values.
For example, assume pi is the probability that z = zi . The expectation of the function f
can be computed as follows
N
X
E [f (z)] = pi f (zi ).
i=1
One of the advantages of the Lebesgue theory of integration is that, for example, it
includes both summations and the usual concept of (Riemann) integration in an unified
framework. We will be able to compute expectations4
Z
E [f (z)] = f (z)dµ(z)
Z
no matter how Z is and no matter what is the distribution µ of the events. For example,
we can deal with situations where Z is the interval [0, 1] and the event z = 0 has a
positive probability µ(0) = p0 . Since the set of all measurable events Z does not include
all possible subsets of Z, we must restrict the set of functions f for which we can take
expectations (integrals) as well.
Definition 32 A real valued function f is measurable with respect to Z if for every real
number x the set
Efx = {z ∈ Z : f (z) ≥ x}
4
When at z the measure µ has a density, the notation dµ (z) corresponds to the more familiar fµ (z) dz.
When µ does not admits density, dµ (z) it is just the notation we use for its analogous concept.
72 CHAPTER 4. STOCHASTIC DYNAMIC PROGRAMMING
In the definition, φ is any simple (positive) function (in its standard representation),
that is, φ is a finite weighted sum of indicator functions5
n
X
φ(z) = ai IEi (z); ai ≥ 0; and its integral is
i=1
Z n
X
φ(z)dµ(z) = ai µ(Ei ),
Z i=1
πij = Pr {z ′ = zj | z = zi } , i, j = 1, 2, ..., N.
5
The indicator function of a set E is defined as
(
1 if z ∈ E
IE (z) =
0 otherwise.
6
See for example SLP, Ch. 7.
7
One typical counter-example is the function f : [0, 1] → [0, 1] defined as follows
(
1 if z is rational
f (z) =
0 otherwise.
R
This function is Lebesgue integrable with f (x)dx = 0, but it is not Riemann integrable.
4.2. MARKOV CHAINS AND MARKOV PROCESSES 73
Since πij describes the probability of the system to move to state zj if the previous state
was zi , they are also called transition probabilities and the stochastic process form a
Markov chain. To be probabilities, the πij must satisfy
N
X
πij ≥ 0, and πij = 1 for i = 1, 2, ..., N,
j=1
Such an array is called transition matrix or Markov matrix, or stochastic matrix. If the
probability distribution over the state in period t is pt = (pt1 , pt2 , ...ptN ) , the distribution
over the state in period t + 1 is pt Π = pt+1 t+1 t+1
1 , p2 , ...pN , where
N
X
pt+1
j = pti πij , j = 1, 2, ..., N.
i=1
For example, suppose we want to know what is the distribution of the next period states if
in the current period the is zi . Well, this means that the initial distribution is a degenerate
one, namely pt = ei = (0, ..., 1, ..., 0) . As a consequence, the probability distribution
over the next period state is the i−th row of Π : ei Π = (πi1 , πi2 , ...πiN ) . Similarly, if pt
is the period t distribution, then by the properties of the matrix multiplication, pt Πn =
p(Π · Π · ...Π) is the t + n period distribution pt+n over the states. It is easy to see that
if Π is a Markov matrix then so is Πn . A set of natural question then arises. Is there a
stationary distribution, that is a probability distribution p∗ with the property p∗ = p∗ Π?
Under what conditions can we be sure that if we start from any initial distribution p0 ,
the system converges to a unique limiting probability p∗ = limn→∞ {p0 Πn }?
The answer to the first question turns out to always be affirmative for Markov chains.
Theorem 18 Given a stochastic matrix Π, there always exists at least one stationary
P
distribution p∗ such that p∗ = p∗ Π, with p∗i ≥ 0 and N ∗
i=1 pi = 1.
(I − Π′ )p∗′ = 0.
Theorem 19 Assume that πij > 0 for all i, j = 1, 2, ...N. There exists a limiting distri-
bution p∗ such that
(n)
p∗j = lim πij ,
n→∞
(n)
where πij is the (i, j) element of the matrix Πn . And p∗j are the unique nonnegative
solutions of the following system of equations
N
X
p∗j = p∗k πkj ; or p∗ = p∗ Π; and
k=1
N
X
p∗j = 1.
j=1
TΠ : ∆N → ∆N (4.2)
TΠ p = pΠ
defines a contraction on the metric space ∆N , |·|N where
N
X
|x|N ≡ |xi | .
i=1
Exercise 42 (i) Show that ∆N , |·|N is a complete metric space. (ii) Moreover, show
that if πij > 0, i, j = 1, 2, ....; the mapping T in (4.2) is a contraction of modulus β = 1−ε,
P
where ε = N j=1 εj and εj = mini πij > 0.
(n)
When some πij = 0, we might loose uniqueness. However, following the same line of
P
proof one can show that the stationary distribution is unique as long as ε = N j=1 εj > 0.
Could you explain intuitively why this is the case?
Moreover, from the contraction mapping theorem, it is easy to see that the above
proposition remains valid if the assumption πij > 0 is replaced with: there exists a n ≥ 1
(n)
such that πij > 0 for all i, j. (see Corollary 2 of the contraction mapping Theorem (Th.
3.2) in SLP).
n ∞
Notice
" # the sequence {Π }n=0 might not always converge. For example,
that " consider
#
0 1 1 0
Π = . It is easy to verify that the sequence jumps from Π2n = and
1 0 0 1
76 CHAPTER 4. STOCHASTIC DYNAMIC PROGRAMMING
Π2n+1 = Π. However, the fact that in a Markov chain the state space is finite implies that
the long-run averages ( T −1 )∞
1X t
Π
T t=0
T =1
do always converge to a stochastic matrix P , and the sequence pt = p0 Πt converges to
∗
T −1
1X t
lim p = p0 P ∗ .
T →∞ T
t=0
" #
1
PT −1 1/2 1/2
In the example we saw above one can easily verify that T t=0 Πt → P ∗ = ,
1/2 1/2
and the unique stationary distribution is p∗ = (1/2, 1/2) .
In other cases, the rows of the limit matrix P ∗ are not necessarily
" always# identical to
1 0
each other. For example, consider now the transition matrix Π = . It is obvious
0 1
that in this case P ∗ = Π, which has two different rows. It is also clear that both rows
constitute a stationary distribution. This is true in general: any row of the limit matrix
P ∗ is an invariant distribution for the transition matrix Π.
What is perhaps less obvious is that any convex combination of the rows of P ∗ con-
stitute a stationary distribution, and that all invariant distributions for Π can be derived
by making convex combinations of the rows of P ∗ .
" #
1 0
Exercise 43 (i) Consider first the above example with P ∗ = Π = . Show that
0 1
any vector p∗λ = (λ, 1 − λ) obtained as a convex combination of the rows of P ∗ constitutes
a stationary distribution for Π. Provide an intuition for the result. (ii) Now consider the
general case, and let p∗ and p∗∗ two stationary distributions for a Markov chains defined
by a generic stochastic matrix Π. Show that any convex combination pλ of p∗ and p∗∗
constitute a stationary distribution for Π.
Markov Processes The more general concept corresponding to a Markov chain, where
Z can take countably or uncountably many values, is denoted as a Markov Process.
Similarly to the case where Z is finite, a Markov process is defined by a transition function
(or kernel) Q : Z ×Z → [0, 1] such that: (i) for each z ∈ Z Q(z, ·) is a probability measure;
and (ii) for each C ∈ Z Q(·, C) is a measurable function.
Given Q, one can compute conditional probabilities
Pr {zt+1 ∈ C | zt = c} = Q(c, C)
4.2. MARKOV CHAINS AND MARKOV PROCESSES 77
Notice that Q can be used to map probability measure into probability measures since
for any µ on (Z, Z) we get a new µ′ by assigning to each C ∈ Z the measure
Z
′
(TQ µ) (C) = µ (C) = Q(z, C)dµ(z),
Z
Definition 34 Q has the Feller property if for any bounded and continuous function f
the function
Z
g(z) = (PQ f )(z) = E [f | z] = f (z ′ )dQ(z, z ′ ) for any z
The above definition first of all shown another view of Q. It also defines an operator
(sometimes called transition operator) that in general maps bounded and measurable
functions into bounded measurable functions. When Q has the feller property the operator
PQ preserves continuity.
Technical Digression (optional). It turns out that the Feller property characterizes
continuous Markov transitions. The rigorous idea is simple. Let M be the set of all
probability measures on Borel sets Z over a metrizable space Z, and for each z, let
Q (z, ·) a member of M. The usual topology defined in the space of Borel measures is
the topology of convergence in distribution (or weak topology).8 It is now useful to make
pointwise considerations. For each z the probability measure Q (z, ·) can be seen as a
linear mapping from the set of bounded and measurable functions into the real numbers
R
according to x = hf, Q (z, ·)i = f (z ′ )dQ(z, z ′ ).
It turns out that a transition function Q : Z → M is continuous if and only if it has
the Feller property. The fact that a continuous Q has the Feller property is immediate: By
definition of the topology defined on M (weak topology), via the map Ff (µ) = hf, µi each
8
R R
In this topology, a sequence {µn } in M converges to µ if and only if f dµn → f dµ for all continuous
and bounded functions f.
78 CHAPTER 4. STOCHASTIC DYNAMIC PROGRAMMING
that is µ∗ , is a fixed point of the Markov operator TQ . There are many results establishing
existence and uniqueness of a stationary distribution. Here is a result which is among the
easiest to understand, and that uses the Feller property of Q.
Theorem 20 If Z is a compact set and Q has the Feller property then there exists a
R R
stationary distribution µ∗ : µ∗ = TQ µ∗ , where µ = λ if and only if f dµ = f dλ for each
continuous and bounded function f.
Proof. See SLP, Theorem 12.10, page 376-77. The basic idea of the proof can also
be get as an application of one of the infinite dimensional extensions of the Brower fixed
point theorem (usually called Brower-Shauder-Tyconoff fixed point). We saw above that
whenever Q has the Feller property, the associated Markov operator TQ is a continuous
map from the compact convex (locally convex Hausdorff) space of distributions Λ into
itself. [See Aliprantis and Border (1994), Corollary 14.51, page 485] Q.E.D.
Similarly to
n the finite state case, this invariant measure can be obtained by looking at
1
PT −1 t o∞
the sequence T t=1 TQ λ0 of T -period averages.
T =1
When the state space is not finite, we may define several different concepts of converge
for distributions. The most known ones are weak convergence (commonly denoted con-
vergence in distribution) and strong convergence (or convergence in total variation norm,
also denoted as setwise convergence). We are not dealing with these issues in these class
notes. The concept of weak convergence is in most cases all that we care about in the
9
Let xµn = Ff (µn ) . By definition of weak topology, if µn → µ then Ff (µn ) → Ff (µ) .
10
The interest reader can have a look at Aliprantis and Border (1994), Theorem 15.14, page 531-2.
4.3. BELLMAN PRINCIPLE IN THE STOCHASTIC FRAMEWORK 79
context of describing the dynamics of an economic system. Theorem 20 deals with weak
convergence. The most known results of uniqueness use some monotonicity conditions on
the Markov operator, together with some mixing conditions. For a quite general treat-
ment of monotonic Markov operators, with direct applications to economics and dynamic
programming, see Hopenhayn and Prescott (1992).
If we require strong convergence, one can guarantee uniqueness under conditions sim-
ilar to those of Theorem 19, using the contraction mapping theorem. See Chapter 11 in
SLP, especially Theorem 11.12.
The Finite Z case. When the shocks belong to a finite set all the results we saw
for the deterministic case are true for the stochastic environment as well. The Bellman
Principle of optimality remains true since both Lemma 1 and 2 remain true. Expectation
are simply a weighted sums of the continuation values. In this case Theorem 12 remains
true under the same conditions as in the deterministic case. From the proof of Theorem
13 and 14 it is easy to see that also the verification and sufficiency theorems can easily be
extended to the stochastic case with finite shocks. We just need to require boundedness
to be true for all z. Even the Theorems 15 and 16 are easily extended to the stochastic
case following the same lines of proof we proposed in Chapter 3.1. In order to show you
that there is practically no difference between the deterministic and the stochastic case
when Z is finite, let me be a bit boring and consider for example the stochastic extension
of Theorem 15. Assume w.l.o.g. that z may take N values, i.e. Z = (z1 , z2 , ..., zN ). We
can always consider our fixed point
N
X
V (x, zi ) = sup F (x, x′ , zi ) + β πij V (x′ , zj ), ∀i
x′ ∈Γ(x,z i) j=1
N
X N
X
dN
∞ (V, W) = d∞ (Vi , Wi ) = sup |V (x, zi ) − W (x, zi )| .
x
i=1 i=1
One can easily show that such metric space of functions is complete, and that the same
conditions for a contraction in the deterministic case can be used here to show that the
operator
T CN (X) → CN (X)
:
′
PN ′
sup x ′ ∈Γ(x,z ) F (x, x , z1 ) + β
j=1 π1j V (x , zj )
sup
1
′
PN ′
x′ ∈Γ(x,z2 ) F (x, x , z2 ) + β j=1 π2j V (x , zj )
T V(x) =
...
P
supx′ ∈Γ(x,zN ) F (x, x , zN ) + β N
′ ′
j=1 πN j V (x , zj )
is a contraction with modulus β. It is easy to see that both boundedness and - by the
Theorem of the Maximum - continuity is preserved under T. Similarly, given that (condi-
tional) expectations are nothing more than convex combinations, concavity is preserved
under T , and the same conditions used for the deterministic case can be assumed here to
guarantee the stochastic analogous to Theorem 16.
The General case When Z is continuous, we need to use measure theory. We need to
assume some additional technical restrictions to guarantee that the integrals involved in
the expectations and the limits inside those integrals are well defined.
Unfortunately, these technical complications prevent the possibility of having a result
on the lines of Theorem 12. The reason is that we one cannot be sure that the true value
function is measurable. As a consequence, the typical result in this case are in form of
the verification or sufficiency theorems. Before stating formally the result we need to
introduce some notation.
12
A function is said to be ht −measurable when it is measurable with respect to the σ−algebra generated
by the set of all possible ht histories H t .
4.3. BELLMAN PRINCIPLE IN THE STOCHASTIC FRAMEWORK 81
for all t ≥ 1, where H t is the set of all length-t histories of shocks: ht = (z0 , z1 , ...zt ) , zt ∈
Z.
That is, πt (ht ) is the value of the endogenous state xt+1 that is chosen in period t,
when the (partial) history up to this moment is ht . So, in a stochastic framework agents
are taking contingent plans. They are deciding what to do for any possible history, even
though some of these histories are never going to happen. Moreover, for any partial
history ht ∈ H t one can define a probability measure µt : µt (C) = Pr {ht ∈ C ⊆ H t }. In
this environment, feasibility is defined similarly to the deterministic case. We say that
the plan π is feasible, and write π ∈ Π(x0 , z0 ) if π0 ∈ Γ(x0 , z0 ) and for each t ≥ 1 and
ht we have πt (ht ) ∈ Γ(πt−1 (ht−1 ), zt ). We will always assume that F, Γ, β and µ are such
that Π(x0 , z0 ) is nonempty for any (x0 , z0 ) ∈ X × Z, and that the objective function
T
X Z
t
U(π) = lim F (x0 , π0 , z0 ) + β F πt−1 (ht−1 ), πt (ht ), zt dµt (ht )
T →∞ Ht
t=1
XT
= lim F (x0 , π0 , z0 ) + β t E0 F πt−1 (ht−1 ), πt (ht ), zt
T →∞
t=1
is well defined for any π ∈ Π(x0 , z0 ) and (x0 , z0 ) . Similarly to the compact notation for
the deterministic case, the true value function V ∗ is defined as follows
Theorem 21 Assume that V (x, z) is a measurable function which satisfies the Bellman
equation (4.1). Moreover, assume that
lim β t+1 E0 V (πt (ht ), zt+1 ) = 0
t→∞
for every possible contingent plan π ∈ Π(x0 , z0 ) for all (x0 , z0 ) ∈ X × Z; and that the
policy correspondence
Z
′ ′ ′ ′ ′
G(x, z) = x ∈ Γ(x, z) : V (x, z) = F (x, x , z) + β V (x , z )dQ(z, z ) (4.4)
Z
is non empty and permits a measurable selection. Then V = V ∗ and all plans generated
by G are optimal.
82 CHAPTER 4. STOCHASTIC DYNAMIC PROGRAMMING
Proof. The idea of the proof follows very closely the lines of Theorems 13 and 14.
A plan that solves the Bellman equation and that does not have any left-over value at
infinity, is optimal. Of course, we must impose few additional technical conditions imposed
by measure theory.13 For details the reader can see Chapter 9 of SLP. Q.E.D.
In order to be able to recover Theorem 12 we need to make an assumption on the
endogenous V ∗ :
Theorem 22 Let F be bounded and measurable. If the value function V ∗ (x0 , z0 ) defined
in (4.3) is measurable and assume that the correspondence analogous to (4.4) admits
a measurable selection. Then V ∗ (x0 , z0 ) satisfies the functional equation (4.1) for all
(x0 , z0 ) , and any optimal plan π ∗ (which solves (4.3)) also solves
Z
V (πt−1 (h ), zt ) = F (πt−1 (h ), πt (h ), zt ) + β V ∗ (πt∗ (ht ), zt+1 )dQ (zt , zt+1 ) ,
∗ ∗ t−1 ∗ t−1 ∗ t
Proof. The idea of the proof is similar to that of Theorem 12. For the several details
however, the reader is demanded to Theorem 9.4 in SLP. Q.E.D.
Let finally state the corresponding of Theorems 15 and 16 for the stochastic environ-
ment allowing for continuous shocks.
has a unique fixed point V in the space of continuous and bounded functions.
Proof. Once we have noted that the Feller property of Q guarantees that if W is
R
bounded and continuous function then Z W (x′ , z ′ )dQ(z, z ′ ) is also bounded and contin-
uous for all (x′ , z), we can apply basically line by line the proof of Theorem 15. Q.E.D.
Proof. Again the proof is similar to the deterministic case. Once we have noted that
R
the linearity of the integral preserves concavity (since Z dQ(z, z ′ ) = 1 ) we can basically
apply line by line the proof of Theorem 16. Q.E.D.
It is important to notice that whenever the conditions of Theorem 23 are met, the
boundedness of V and an application of the Maximum Theorem imply the conditions of
Theorem 21 are also satisfied, hence V = V ∗ which is a continuous function (hence mea-
surable). In this case the Bellman equation fully characterizes the optimization problem
also with uncountably many possible levels of the shock.
Exercise 44 Let u(c) = ln c and f (z, k) = zk α , 0 < α < 1 (so δ = 1). I tell you that the
optimal policy function takes the form kt+1 = αβzt ktα for any t and zt . (i) Use this fact
to calculate an expression for the optimal policy πt∗ (ht ) [recall that ht = (z0 , ...zt )] and
the value function V ∗ (k0 , z0 ) for any initial values (k0 , z0 ), and verify that V ∗ solves the
following Bellman equation
V (k, z) = max α
ln(zk α − k ′ ) + βE [V (k ′ , z ′ )] .
0≤k ≤zk
′
This model can be extended in many directions. This model with persistent shocks
and non inelastic labor supply has been used in the Real Business Cycles literature to
84 CHAPTER 4. STOCHASTIC DYNAMIC PROGRAMMING
study the effects of technological shocks on aggregate variables like consumption and
employment. This line of research started in the 80s, and for many macroeconomists is
still the building block for any study about the aggregate real economy. RBC will be the
next topic of these notes. Moreover, since most interesting economic problem do not have
closed forms, you must first learn how to use numerical methods to approximate V and
perform simulations.
Bibliography
[2] Bertsekas, D. P., and S. E. Shreve (1978) Stochastic Control: The Discrete Time
Case. New York: Academic Press.
[4] Stokey, N., R. Lucas and E. Prescott (1991), Recursive Methods for Economic
Dynamics, Harvard University Press.
85