0% found this document useful (0 votes)
139 views101 pages

Quantum Information Theory (Lecture Notes)

Probability, Information theory, and formal framework of Quantum information theory with applications

Uploaded by

50_BMG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views101 pages

Quantum Information Theory (Lecture Notes)

Probability, Information theory, and formal framework of Quantum information theory with applications

Uploaded by

50_BMG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Lecture Notes

Quantum Information Theory

Renato Renner
with contributions by Matthias Christandl

February 6, 2013
Contents

1 Introduction 5

2 Probability Theory 6
2.1 What is probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Definition of probability spaces and random variables . . . . . . . . . . . . . 7
2.2.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Notation for events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Conditioning on events . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Probability theory with discrete random variables . . . . . . . . . . . . . . . 9
2.3.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Marginals and conditional distributions . . . . . . . . . . . . . . . . 9
2.3.3 Special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.4 Independence and Markov chains . . . . . . . . . . . . . . . . . . . . 10
2.3.5 Functions of random variables, expectation values, and Jensen’s in-
equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.6 Trace distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.7 I.i.d. distributions and the law of large numbers . . . . . . . . . . . 12
2.3.8 Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Information Theory 15
3.1 Quantifying information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Approaches to define information and entropy . . . . . . . . . . . . . 15
3.1.2 Entropy of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Entropy of random variables . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 Conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.5 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.6 Smooth min- and max- entropies . . . . . . . . . . . . . . . . . . . . 20
3.1.7 Shannon entropy as a special case of min- and max-entropy . . . . . 20
3.2 An example application: channel coding . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Definition of the problem . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 The general channel coding theorem . . . . . . . . . . . . . . . . . . 21
3.2.3 Channel coding for i.i.d. channels . . . . . . . . . . . . . . . . . . . . 24
3.2.4 The converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2
4 Quantum States and Operations 26
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Hilbert spaces and operators on them . . . . . . . . . . . . . . . . . 26
4.1.2 The bra-ket notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.3 Tensor products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.4 Trace and partial trace . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.5 Decompositions of operators and vectors . . . . . . . . . . . . . . . . 29
4.1.6 Operator norms and the Hilbert-Schmidt inner product . . . . . . . 31
4.1.7 The vector space of Hermitian operators . . . . . . . . . . . . . . . . 32
4.2 Postulates of quantum mechanics . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Quantum states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Density operators — definition and properties . . . . . . . . . . . . . 34
4.3.2 Quantum-mechanical postulates in the language of density operators 34
4.3.3 Partial trace and purification . . . . . . . . . . . . . . . . . . . . . . 35
4.3.4 Mixtures of states . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.5 Hybrid classical-quantum states . . . . . . . . . . . . . . . . . . . . . 37
4.3.6 Distance between states . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Evolution and measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 Completely positive maps (CPMs) . . . . . . . . . . . . . . . . . . . 44
4.4.2 The Choi-Jamiolkowski isomorphism . . . . . . . . . . . . . . . . . . 45
4.4.3 Stinespring dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.4 Operator-sum representation . . . . . . . . . . . . . . . . . . . . . . 48
4.4.5 Measurements as CPMs . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.6 Positive operator valued measures (POVMs) . . . . . . . . . . . . . 50
4.4.7 The diamond norm of CPMs . . . . . . . . . . . . . . . . . . . . . . 51
4.4.8 Example: Why to enlarge the Hilbert space . . . . . . . . . . . . . . 54

5 The Completeness of Quantum Theory 56


5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Preliminary definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 The informal theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Proof sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Basic Protocols 65
6.1 Teleportation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Superdense coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Entanglement conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3.1 Majorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Entropy of Quantum States 72


7.1 Motivation and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 Properties of the von Neumann entropy . . . . . . . . . . . . . . . . . . . . 73
7.3 The conditional entropy and its properties . . . . . . . . . . . . . . . . . . . 77
7.4 The mutual information and its properties . . . . . . . . . . . . . . . . . . . 81

3
7.5 Conditional min-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8 Resources Inequalities 85
8.1 Resources and inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Monotones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Teleportation is optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4 Superdense coding is optimal . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.5 Entanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.6 Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9 Quantum Key Distribution 92


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2 Classical message encryption . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.3 Quantum cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.4 QKD protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.4.1 BB84 protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.4.2 Security proof of BB84 . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4
1 Introduction
The very process of doing physics is to acquire information about the world around us.
At the same time, the storage and processing of information is necessarily a physical
process. It is thus not surprising that physics and the theory of information are inherently
connected.1 Quantum information theory is an interdisciplinary research area whose goal
is to explore this connection.
As the name indicates, the information carriers in quantum information theory are
quantum-mechanical systems (e.g., the spin of a single electron). This is in contrast to
classical information theory where information is assumed to be represented by systems
that are accurately characterized by the laws of classical mechanics and electrodynamics
(e.g., a classical computer, or simply a piece of paper). Because any such classical system
can in principle be described in the language of quantum mechanics, classical information
theory is a (practically significant) special case of quantum information theory.
The course starts with a quick introduction to classical probability and information the-
ory. Many of the relevant concepts, e.g., the notion of entropy as a measure of uncertainty,
can already be defined in the purely classical case. I thus consider this classical part as a
good preparation as well as a source of intuition for the more general quantum-mechanical
treatment.
We will then move on to the quantum setting, where we will spend a considerable
amount of time to introduce a convenient framework for representing and manipulating
quantum states and quantum operations. This framework will be the prerequisite for
formulating and studying typical information-theoretic problems such as information stor-
age and transmission (with possibly noisy devices). Furthermore, we will learn in what
sense information represented by quantum systems is different from information that is
represented classically. Finally, we will have a look at applications such as quantum key
distribution.
I would like to emphasize that it is not an intention of this course to give a complete
treatment of quantum information theory. Instead, the goal is to focus on certain key
concepts and to study them in more detail. For further reading, I recommend the standard
textbook by Nielsen and Chuang [1].

1 Thisconnection has been noticed by numerous famous scientists over the past fifty years, among them
Rolf Landauer with his claim “information is physical.”

5
2 Probability Theory
Information theory is largely based on probability theory. Therefore, before introducing
information-theoretic concepts, we need to recall some key notions of probability theory.
The following section is, however, not thought as an introduction to probability theory.
Rather, its main purpose is to summarize some basic facts as well as the notation we are
going to use in this course.

2.1 What is probability?


This is actually a rather philosophical question and it is not the topic of this course to an-
swer it.1 Nevertheless, it might be useful to spend some thoughts about how probabilities
are related to actual physical quantities.
For the purpose of this course, it might make sense to take a Bayesian point of view,
meaning that probability distributions are generally interpreted as a state of knowledge.
To illustrate the Bayesian approach, consider a game where a quizmaster hides a prize
behind one of three doors, and where the task of a candidate is to find the prize. Let X
be the number of the door (1, 2, or 3) which hides the prize. Obviously, as long as the
candidate does not get any additional information, each of the doors is equally likely to
cand
hide the prize. Hence, the probability distribution PX that the candidate would assign
to X is uniform,
cand cand cand
PX (1) = PX (2) = PX (3) = 1/3 .
On the other hand, the quizmaster knows where he has hidden the prize, so he would assign
a deterministic value to X. For example, if the prize is behind door 1, the probability
distribution P mast the quizmaster would assign to X has the form
mast mast mast
PX (1) = 1 and PX (2) = PX (3) = 0 .
cand mast
The crucial thing to note here is that, although the distributions PX and PX are
referring to the same physical value X, they are different because they correspond to
different states of knowledge.
We could extend this example arbitrarily. For instance, the quizmaster could open one
of the doors, say 3, to reveal that the prize is not behind it. This additional information, of
course, changes the state of knowledge of the candidate, resulting in yet another probability
cand0
distribution PX associated with X,2
1 For
a nice introduction to the philosophy of probability theory, I recommend the book [2].
2 Thesituation becomes more intriguing if the quizmaster opens a door after the candidate has already
made a guess. The problem of determining the probability distribution that the candidate assigns to
X in this case is known as the Monty Hall problem.

6
0 0 0
cand cand cand
PX (1) = PX (2) = 1/2 and PX (3) = 0 .
When interpreting a probability distribution as a state of knowledge and, hence, as
subjective quantity, we need to carefully specify whose state of knowledge we are referring
to. This is particularly relevant for the analysis of information-theoretic settings, which
usually involve more than one party. For example, in a communication scenario, we might
have a sender who intends to transmit a message M to a receiver. Clearly, before M is
sent, the sender and the receiver have different knowledge about M and, consequently,
would assign different probability distributions to M . In the following, when describing
such settings, we will typically understand all distributions as states of knowledge of an
outside observer.

2.2 Definition of probability spaces and random variables


The concept of random variables is important in both physics and information theory.
Roughly speaking, one can think of a random variable as the state of a classical proba-
bilistic system. Hence, in classical information theory, it is natural to think of data as
being represented by random variables.
In this section, we define random variables and explain a few related concepts. For
completeness, we first give the general mathematical definition based on probability spaces.
Later, we will restrict to discrete random variables (i.e., random variables that only take
countably many values). These are easier to handle than general random variables but
still sufficient for our information-theoretic considerations.

2.2.1 Probability space


A probability space is a triple (Ω, E, P ), where (Ω, E) is a measurable space, called sample
space, and P is a probability measure. The measurable space consists of a set Ω and a
σ-algebra E of subsets of Ω, called events.
By definition, the σ-algebra E must contain at least one event, and be closed under
complements and countable unions. That is, (i) E 6= ∅, (ii) if E is anSevent then so is its
complement E c := Ω\E, and (iii) if (Ei )i∈N is a family of events then i∈N Ei is an event.
In particular, Ω and ∅ are events, called the certain event and the impossible event.
The probability measure P on (Ω, E) is a function

P : E → R+

that assigns to each event E ∈ E a nonnegative real number PS[E], calledPthe probability
of E. It must satisfy the probability axioms P [Ω] = 1 and P [ i∈N Ei ] = i∈N P [Ei ] for
any family (Ei )in∈N of pairwise disjoint events.

2.2.2 Random variables


Let (Ω, E, P ) be a probability space and let (X , F) be a measurable space. A random
variable X is a function from Ω to X which is measurable with respect to the σ-algebras

7
E and F. This means that the preimage of any F ∈ F is an event, i.e., X −1 (F ) ∈ E.
The probability measure P on (Ω, E) induces a probability measure PX on the measurable
space (X , F), which is also called range of X,

PX [F ] := P [X −1 (F )] ∀F ∈ F . (2.1)

A pair (X, Y ) of random variables can obviously be seen as a new random variable. More
precisely, if X and Y are random variables with range (X , F) and (Y, G), respectively, then
(X, Y ) is the random variable with range (X × Y, F × G) defined by3

(X, Y ) : ω 7→ X(ω) × Y (ω) .

We will typically write PXY to denote the joint probability measure P(X,Y ) on (X × Y, F × G)
induced by (X, Y ). This convention can, of course, be extended to more than two random
variables in a straightforward way. For example, we will write PX1 ···Xn for the probability
measure induced by an n-tuple of random variables (X1 , . . . , Xn ).
In a context involving only finitely many random variables X1 , . . . , Xn , it is usually
sufficient to specify the joint probability measure PX1 ···Xn , while the underlying probability
space (Ω, E, P ) is irrelevant. In fact, as long as we are only interested in events defined in
terms of the random variables X1 , . . . , Xn (see Section 2.2.3 below), we can without loss
of generality identify the sample space (Ω, E) with the range of the tuple (X1 , . . . , Xn ) and
define the probability measure P to be equal to PX1 ···Xn .

2.2.3 Notation for events


Events are often defined in terms of random variables. For example, if the range of X
is (a subset of) the set of real numbers R then E := {ω ∈ Ω : X(ω) > x0 } is the event
that X takes a value larger than x0 . To denote such events, we will usually drop ω, i.e.,
we simply write E = {X > x0 }. If the event is given as an argument to a function, we
also omit the curly brackets. For instance, we write P [X > x0 ] instead of P [{X > x0 }] to
denote the probability of the event {X > x0 }.

2.2.4 Conditioning on events


Let (Ω, E, P ) be a probability space. Any event E 0 ∈ E such that P (E 0 ) > 0 gives rise to
a new probability measure P [·|E 0 ] on (Ω, E) defined by

P [E ∩ E 0 ]
P [E|E 0 ] := ∀E ∈ E .
P [E 0 ]

P [E|E 0 ] is called the probability of E conditioned on E 0 and can be interpreted as the


probability that the event E occurs if we already know that the event E 0 has occurred.
In particular, if E and E 0 are mutually independent, i.e., P [E ∩ E 0 ] = P [E]P [E 0 ], then
P [E|E 0 ] = P [E].

3F × G denotes the set {F × G : F ∈ F , G ∈ G}. It is easy to see that F × G is a σ-algebra over X × Y.

8
Similarly, we can define PX|E 0 as the probability measure of a random variable X con-
ditioned on E 0 . Analogously to (2.1), it is the probability measure induced by P [·|E 0 ],
i.e.,
PX|E 0 [F ] := P [X −1 (F )|E 0 ] ∀F ∈ F .

2.3 Probability theory with discrete random variables


2.3.1 Discrete random variables
In the remainder of this script, if not stated otherwise, all random variables are assumed
to be discrete. This means that their range (X , F) consists of a countably infinite or even
finite set X . In addition, we will assume that the σ-algebra F is the power set of X , i.e.,
F := {F ⊆ X }.4 Furthermore, we call X the alphabet of X. The probability measure PX
is then defined for any singleton set {x}. Setting PX (x) := PX [{x}], we can interpret PX
as a probability mass function, i.e., a positive function
PX : X → R+
that satisfies the normalization condition
X
PX (x) = 1 . (2.2)
x∈X

More generally, for an event E 0 with P [E 0 ] > 0, the probability mass function of X con-
ditioned on E 0 is given by PX|E 0 (x) := PX|E 0 [{x}], and also satisfies the normalization
condition (2.2).

2.3.2 Marginals and conditional distributions


Although the following definitions and statements apply to arbitrary n-tuples of random
variables, we will formulate them only for pairs (X, Y ) in order to keep the notation
simple. In particular, it suffices to specify a bipartite probability distribution PXY , i.e.,
a positive function on X × Y satisfying the normalization condition (2.2), where X and
Y are the alphabets of X and Y , respectively. The extension to arbitrary n-tuples is
straightforward.5
Given PXY , we call PX and PY the marginal distributions. It is easy to verify that
X
PX (x) = PXY (x, y) ∀x ∈ X , (2.3)
y∈Y

and likewise for PY . Furthermore, for any y ∈ Y with PY (y) > 0, the distribution PX|Y =y
of X conditioned on the event Y = y obeys
PXY (x, y)
PX|Y =y (x) = ∀x ∈ X . (2.4)
PY (y)
4 It is easy to see that the power set of X is indeed a σ-algebra over X .
5 Note that X and Y can themselves be tuples of random variables.

9
2.3.3 Special distributions
Certain distributions are important enough to be given a name. We call PX flat if all
non-zero probabilities are equal, i.e.,
PX (x) ∈ {0, q} ∀x ∈ X
1
for some q ∈ [0, 1]. Because of the normalization condition (2.2), we have q = |suppPX|
,
where suppPX := {x ∈ X : PX (x) > 0} is the support of the function PX . Furthermore,
if PX is flat and has no zero probabilities, i.e.,
1
PX (x) = ∀x ∈ X ,
|X |
we call it uniform.

2.3.4 Independence and Markov chains


Two discrete random variables X and Y are said to be mutually independent if the events
{X = x} and {Y = y} are mutually independent for any (x, y) ∈ X × Y. Their joint
probability mass function then satisfies PXY = PX × PY .6
Related to this is the notion of Markov chains. A sequence of random variables X1 , X2 , . . .
is said to have the Markov property, denoted X1 ↔ X2 ↔ · · · ↔ Xn , if for all i ∈
{1, . . . , n − 1}
PXi+1 |X1 =x1 ,...,Xi =xi = PXi+1 |Xi =xi ∀x1 , . . . , xi .
This expresses the fact that, given any fixed value of Xi , the random variable Xi+1 is
completely independent of all previous random variables X1 , . . . , Xi−1 . In particular,
Xi+1 can be computed given only Xi .

2.3.5 Functions of random variables, expectation values, and Jensen’s


inequality
Let X be a random variable with alphabet X and let f be a function from X to Y. We
denote by f (X) the random variable defined by the concatenation f ◦ X. Obviously, f (X)
has alphabet Y and, in the discrete case we consider here, the corresponding probability
mass function Pf (X) is given by
X
Pf (X) (y) = PX (x) .
x∈f −1 ({y})

For a random variable X whose alphabet X is a module over the reals R (i.e., there is a
notion of addition and multiplication with reals), we define the expectation value of X by
X
hXiPX := PX (x)x .
x∈X
6P × PY denotes the function (x, y) 7→ PX (x)PY (y).
X

10
If the distribution PX is clear from the context, we sometimes omit the subscript.
For a convex real function f on a convex set X , the expectation values of X and f (X)
are related by Jensen’s inequality
hf (X)i ≥ f (hXi) .
The inequality is essentially a direct consequence of the definition of convexity (see Fig. 2.1).

Figure 2.1: Jensen’s inequality for a convex function

2.3.6 Trace distance


Let P and Q be two probability mass functions7 on an alphabet X . The trace distance δ
between P and Q is defined by
1X
δ(P, Q) = |P (x) − Q(x)|
2
x∈X

In the literature, the trace distance is also called statistical distance, variational distance,
or Kolmogorov distance.8 It is easy to verify that δ is indeed a metric, that is, it is
symmetric, nonnegative, zero if and only if P = Q, and it satisfies the triangle inequality.
Furthermore, δ(P, Q) ≤ 1 with equality if and only if P and Q have distinct support.
Because P and Q satisfy the normalization condition (2.2), the trace distance can equiv-
alently be written as
X
δ(P, Q) = 1 − min[P (x), Q(x)] . (2.5)
x∈X

The trace distance between the probability mass functions QX and QX 0 of two random
variables X and X 0 has a simple interpretation. It can be seen as the minimum probability
that X and X 0 take different values.
7 Thedefinition can easily be generalized to probability measures.
8 We use the term trace distance because, as we shall see, it is a special case of the trace distance for
density operators.

11
Lemma 2.3.1. Let QX and QX 0 be probability mass functions on X . Then

δ(QX , QX 0 ) = min PXX 0 [X 6= X 0 ]


PXX 0

where the minimum ranges over all joint probability mass functions PXX 0 with marginals
PX = QX and PX 0 = QX 0 .
Proof. To prove the inequality δ(QX , QX 0 ) ≤ minPXX 0 PXX 0 [X 6= X 0 ], we use (2.5) and the
fact that, for any joint probability mass function PXX 0 , min[PX (x), PX 0 (x)] ≥ PXX 0 (x, x),
which gives
X X
δ(PX , PX 0 ) = 1 − min[PX (x), PX 0 (x)] ≤ 1 − PXX 0 (x, x) = PXX 0 [X 6= X 0 ] .
x∈X x∈X

We thus have δ(PX , PX 0 ) ≤ PXX 0 [X 6= X 0 ], for any probability mass function PXX 0 .
Taking the minimum over all PXX 0 with PX = QX and PX 0 = QX 0 gives the desired
inequality.
The proof of the opposite inequality is given in the exercises.
An important property of the trace distance is that it can only decrease under the
operation of taking marginals.
Lemma 2.3.2. For any two density mass functions PXY and QXY ,

δ(PXY , QXY ) ≥ δ(PX , QX ) .

Proof. Applying the triangle inequality for the absolute value, we find
1X 1X X
|PXY (x, y) − QXY (x, y)| ≥ | PXY (x, y) − QXY (x, y)|
2 x,y 2 x y
1X
= |PX (x) − QX (x)| ,
2 x

where the second equality is (2.3). The assertion then follows from the definition of the
trace distance.

2.3.7 I.i.d. distributions and the law of large numbers


An n-tuple of random variables X1 , . . . , Xn with alphabet X is said to be independent and
identically distributed (i.i.d.) if their joint probability mass function has the form
×n
PX1 ···Xn = PX := PX × · · · × PX .

The i.i.d. property thus characterizes situations where a certain process is repeated n times
independently. In the context of information theory, the i.i.d. property is often used to
describe the statistics of noise, e.g., in repeated uses of a communication channel (see
Section 3.2).

12
The law of large numbers characterizes the “typical behavior” of real-valued i.i.d. ran-
dom variables X1 , . . . , Xn in the limit of large n. It usually comes in two versions, called
the weak and the strong law of large numbers. As the name suggests, the latter implies
the first.
Let µ = hXi i be the expectation value of Xi (which, by the i.i.d. assumption, is the
same for all X1 , . . . , Xn ), and let
n
1X
Zn := Xi
n i=1

be the sample mean. Then, according to the weak law of large numbers, the probability
that Zn is ε-close to µ for any positive ε converges to one, i.e.,
 
lim P |Zn − µ| < ε = 1 ∀ε > 0 . (2.6)
n→∞

The weak law of large numbers will be sufficient for our purposes. However, for com-
pleteness, we mention the strong law of large numbers which says that Zn converges to µ
with probability 1,
 
P lim Zn = µ = 1 .
n→∞

2.3.8 Channels
A channel p is a probabilistic mapping that assigns to each value of an input alphabet X
a value of the output alphabet. Formally, p is a function

p: X × Y → R+
(x, y) 7→ p(y|x)

such that p(·|x) is a probability mass function for any x ∈ X .


Given a random variable X with alphabet X , a channel p from X to Y naturally defines
a new random variable Y via the joint probability mass function PXY given by9

PXY (x, y) := PX (x)p(y|x) . (2.7)

Note also that channels can be seen as generalizations of functions. Indeed, if f is a


function from X to Y, its description as a channel p is given by

p(y|x) = δy,f (x) .

Channels can be seen as abstractions of any (classical) physical device that takes an
input X and outputs Y . A typical example for such a device is, of course, a communication
channel, e.g., an optical fiber, where X is the input provided by a sender and where Y is
the (possibly noisy) version of X delivered to a receiver. A practically relevant question
9 It is easy to verify that PXY is indeed a probability mass function.

13
then is how much information one can transmit reliably over such a channel, using an
appropriate encoding.
But channels do not only carry information over space, but also over time. Typical
examples are memory devices, e.g., a hard drive or a CD (where one wants to model the
errors introduced between storage and reading out of data). Here, the question is how
much redundancy we need to introduce in the stored data in order to correct these errors.
The notion of channels is illustrated by the following two examples.

Figure 2.2: Example 1. A reliable channel

Example 2.3.3. The channel depicted in Fig. 2.2 maps the input 0 with equal probability
to either 0 or 1; the input 1 is always mapped to 2. The channel has the property that its
input is uniquely determined by its output. As we shall see later, such a channel would
allow to reliably transmit one classical bit of information.

Figure 2.3: Example 2. An unreliable channel

Example 2.3.4. The channel shown in Fig. 2.3 maps each possible input with equal
probability to either 0 or 1. The output is thus completely independent of the input. Such
a channel is obviously not useful to transmit information.
The notion of i.i.d. random variables naturally translates to channels. A channel pn from
X ×· · ·×X to Y ×· · ·×Y is said to be i.i.d. if it can be written as pn = p×n := p×· · ·×p.

14
3 Information Theory
3.1 Quantifying information
The main object of interest in information theory, of course, is information and the way it
is processed. The quantification of information thus plays a central role. The aim of this
section is to introduce some notions and techniques that are needed for the quantitative
study of classical information, i.e., information that can be represented by the state of a
classical (in contrast to quantum) system.

3.1.1 Approaches to define information and entropy


Measures of information and measures of uncertainty, also called entropy measures, are
closely related. In fact, the information contained in a message X can be seen as the
amount by which our uncertainty (measured in terms of entropy) decreases when we
learn X.
There are, however, a variety of approaches to defining entropy measures. The decision
what approach to take mainly depends on the type of questions we would like to answer.
Let us thus consider a few examples.
Example 3.1.1 (Data transmission). Given a (possibly noisy) communication channel
connecting a sender and a receiver (e.g., an optical fiber), we are interested in the time it
takes to reliably transmit a certain document (e.g., the content of a textbook).
Example 3.1.2 (Data storage). Given certain data (e.g., a movie), we want to determine
the minimum space (e.g., on a hard drive) needed to store it.
The latter question is related to data compression, where the task is to find a space-
saving representation Z of given data X. In some sense, this corresponds to finding the
shortest possible description of X. An elegant way to make this more precise is to view
the description of X as an algorithm that generates X. Applied to the problem of data
storage, this would mean that, instead of storing data X directly, one would store an (as
small as possible) algorithm Z which can reproduce X.
The definition of algorithmic entropy, also known as Kolmogorov complexity, is exactly
based on this idea. The algorithmic entropy of X is defined as the minimum length of an
algorithm that generates X. For example, a bitstring X = 00 · · · 0 consisting of n  1
zeros has small algorithmic entropy because it can be generated by a short program (the
program that simply outputs a sequence of zeros). The same is true if X consists of the
first n digits of π, because there is a short algorithm that computes the circular constant
π. In contrast, if X is a sequence of n bits chosen at random, its algorithmic entropy
will, with high probability, be roughly equal to n. This is because the shortest program

15
generating the exact sequence of bits X is, most likely, simply the program that has the
whole sequence already stored.1
Despite the elegance of its definition, the algorithmic entropy has a fundamental disad-
vantage when being used as a measure for uncertainty: it is not computable. This means
that there cannot exist a method (e.g., a computer program) that estimates the algorith-
mic complexity of a given string X. This deficiency as well as its implications2 render the
algorithmic complexity unsuitable as a measure of entropy for most practical applications.
In this course, we will consider a different approach which is based on ideas developed in
thermodynamics. The approach has been proposed in 1948 by Shannon [3] and, since then,
has proved highly successful, with numerous applications in various scientific disciplines
(including, of course, physics). It can also be seen as the theoretical foundation of modern
information and communication technology. Today, Shannon’s theory is viewed as the
standard approach to information theory.
In contrast to the algorithmic approach described above, where the entropy is defined
as a function of the actual data X, the information measures used in Shannon’s theory
depend on the probability distribution of the data. More precisely, the entropy of a value
X is a measure for the likelihood that a particular value occurs. Applied to the above
compression problem, this means that one needs to assign a probability mass function to
the data to be compressed. The method used for compression might then be optimized
for the particular probability mass function assigned to the data.

3.1.2 Entropy of events


We take an axiomatic approach to motivate the definition of the Shannon entropy and
related quantities. In a first step, we will think of the entropy as a property of events E.
More precisely, given a probability space (Ω, E, P ), we consider a function H that assigns
to each event E a real value H(E),

H: E → R ∪ {∞}
E 7→ H(E) .

For the following, we assume that the events are defined on a probability space with
probability measure P . The function H should then satisfy the following properties.
1. Independence of the representation: H(E) only depends on the probability P [E] of
the event E.
2. Continuity: H is continuous in the probability measure P (relative to the topology
induced by the trace distance).
3. Additivity: H(E ∩ E 0 ) = H(E) + H(E 0 ) for two independent events E and E 0 .
4. Normalization: H(E) = 1 for E with P [E] = 12 .
1 Infact, a (deterministic) computer can only generate pseudo-random numbers, i.e., numbers that cannot
be distinguished (using any efficient method) from true random numbers.
2 An immediate implication is that there cannot exist a compression method that takes as input data X

and outputs a short algorithm that generates X.

16
The axioms appear natural if we think of H as a measure of uncertainty. Indeed,
Axiom 3 reflects the idea that our total uncertainty about two independent events is
simply the sum of the uncertainty about the individual events. We also note that the
normalization imposed by Axiom 4 can be chosen arbitrarily; the convention, however, is
to assign entropy 1 to the event corresponding to the outcome of a fair coin flip.
The axioms uniquely define the function H.
Lemma 3.1.3. The function H satisfies the above axioms if and only if it has the form

H: E 7−→ − log2 P [E] .

Proof. It is straightforward that H as defined in the lemma satisfies all the axioms. It
thus remains to show that the definition is unique. For this, we make the ansatz

H(E) = f (− log2 P [E])

where f is an arbitrary function from R+ ∪ {∞} to R ∪ {∞}. We note that, apart from
taking into account the first axiom, this is no restriction of generality, because any possible
function of P [E] can be written in this form.
From the continuity axiom, it follows that f must be continuous. Furthermore, inserting
the additivity axiom for events E and E 0 with probabilities p and p0 , respectively, gives

f (− log2 p) + f (− log2 p0 ) = f (− log2 pp0 ) .

Setting a := − log2 p and a0 := − log2 p0 , this can be rewritten as

f (a) + f (a0 ) = f (a + a0 ) .

Together with the continuity axiom, we conclude that f is linear, i.e., f (x) = γx for some
γ ∈ R. The normalization axiom then implies that γ = 1.

3.1.3 Entropy of random variables


We are now ready to define entropy measures for random variables. Analogously to the
entropy of an event E, which only depends on the probability P [E] of the event, the
entropy of a random variable X only depends on the probability mass function PX .
We start with the most standard measure in classical information theory, the Shannon
entropy, in the following denoted by H. Let X be a random variable with alphabet X and
let h(x) be the entropy of the event Ex := {X = x}, for any x ∈ X , that is,

h(x) := H(Ex ) = − log2 PX (x) . (3.1)

Then the Shannon entropy is defined as the expectation value of h(x), i.e.,
X
H(X) := hh(X)i = − PX (x) log2 PX (x) .
x∈X

If the probability measure P is unclear from the context, we will include it in the notation
as a subscript, i.e., we write H(X)P .

17
Similarly, the min-entropy, denoted Hmin , is defined as the minimum entropy H(Ex ) of
the events Ex , i.e.,

Hmin (X) := min h(x) = − log2 max PX (x) .


x∈X x∈X

A slightly different entropy measure is the max-entropy, denoted Hmax . Despite the
similarity of its name to the above measure, the definition does not rely on the entropy of
events, but rather on the cardinality of the support suppPX := {x ∈ X : PX (x) > 0} of
PX ,

Hmax (X) := log2 suppPX .

It is easy to verify that the entropies defined above are related by

Hmin (X) ≤ H(X) ≤ Hmax (X) , (3.2)

with equality if the probability mass function PX is flat. Furthermore, they have various
properties in common. The following holds for H, Hmin , and Hmax ; to keep the notation
simple, however, we only write H.

1. H is invariant under permutations of the elements, i.e., H(X) = H(π(X)), for any
permutation π.

2. H is nonnegative.3
3. H is upper bounded by the logarithm of the alphabet size, i.e., H(X) ≤ log2 |X |.
4. H equals zero if and only if exactly one of the entries of PX equals one, i.e., if
|suppPX | = 1.

3.1.4 Conditional entropy


In information theory, one typically wants to quantify the uncertainty about some data X,
given that one already has information Y . To capture such situations, we need to generalize
the entropy measures introduced in Section 3.1.3.
Let X and Y be random variables with alphabet X and Y, respectively, and define,
analogously to (3.1),

h(x|y) := − log2 PX|Y =y (x) , (3.3)

for any x ∈ X and y ∈ Y. Then the Shannon entropy of X conditioned on Y is again


defined as an expectation value,
X
H(X|Y ) := hh(X|Y )i = − PXY (x, y) log2 PX|Y =y (x) .
x∈X
y∈Y

3 Note that this will no longer be true for the conditional entropy of quantum states.

18
For the definition of the min-entropy of X given Y , the expectation value is replaced
by a minimum, i.e.,
Hmin (X|Y ) := min h(x|y) = − log2 max PX|Y =y (x) .
x∈X x∈X
y∈Y y∈Y

Finally, the max-entropy of X given Y is defined by


Hmax (X|Y ) := max log2 |suppPX|Y =y | .
y∈Y

The conditional entropies H, Hmin , and Hmax satisfy the rules listed in Section 3.1.3.
Furthermore, the entropies can only decrease when conditioning on an additional random
variable Z, i.e.,
H(X|Y ) ≥ H(X|Y Z) . (3.4)
This relation is also known as strong subadditivity and we will prove it in the more general
quantum case.
Finally, it is straightforward to verify that the Shannon entropy H satisfies the chain
rule
H(X|Y Z) = H(XY |Z) − H(Y |Z) .
In particular, if we omit the random variable Z, we get
H(X|Y ) = H(XY ) − H(Y )
that is, the uncertainty of X given Y can be seen as the uncertainty about the pair (X, Y )
minus the uncertainty about Y . We note here that a slightly modified version of the chain
rule also holds for Hmin and Hmax , but we will not go further into this.

3.1.5 Mutual information


Let X and Y be two random variables. The (Shannon) mutual information between X
and Y , denoted I(X : Y ) is defined as the amount by which the Shannon entropy on X
decreases when one learns Y ,
I(X : Y ) := H(X) − H(X|Y ) .
More generally, given an additional random variable Z, the (Shannon) mutual information
between X and Y conditioned on Z, I(X : Y |Z), is defined by
I(X : Y |Z) := H(X|Z) − H(X|Y Z) .
It is easy to see that the mutual information is symmetric under exchange of X and Y ,
i.e.,
I(X : Y |Z) = I(Y : X|Z) .
Furthermore, because of the strong subadditivity (3.4), the mutual information cannot be
negative, and I(X : Y ) = 0 holds if and only if X and Y are mutually independent. More
generally, I(X : Y |Z) = 0 if and only if X ↔ Z ↔ Y is a Markov chain.

19
3.1.6 Smooth min- and max- entropies
The dependency of the min- and max-entropy of a random variable on the underlying
probability mass functions is discontinuous. To see this, consider a random variable X
with alphabet {1, . . . , 2` } and probability mass function PX
ε
given by
ε
PX (1) = 1 − ε
ε ε
PX (x) = ` if x > 1 ,
2 −1
where ε ∈ [0, 1]. It is easy to see that, for ε = 0,
Hmax (X)PX0 = 0
whereas, for any ε > 0,
Hmax (X)PXε = ` .
0 ε
Note also that the trace distance between the two distributions satisfies δ(PX , PX ) = ε.
That is, an arbitrarily small change in the distribution can change the entropy Hmax (X)
by an arbitrary amount. In contrast, a small change of the underlying probability mass
function is often irrelevant in applications. This motivates the following definition of
smooth min- and max-entropies, which extends the above definition.
Let X and Y be random variables with joint probability mass function PXY , and let
ε ≥ 0. The ε-smooth min-entropy of X conditioned on Y is defined as
ε
Hmin (X|Y ) := max Hmin (X|Y )QXY
QXY ∈Bε (PXY )

where the maximum ranges over the ε-ball B ε (PXY ) of probability mass functions QXY
satisfying δ(PXY , QXY ) ≤ ε. Similarly, the ε-smooth max-entropy of X conditioned on Y
is defined as
ε
Hmax (X|Y ) := min Hmax (X|Y )QXY .
QXY ∈Bε (PXY )

Note that the original definitions of Hmin and Hmax can be seen as the special case
where ε = 0.

3.1.7 Shannon entropy as a special case of min- and max-entropy


We have already seen that the Shannon entropy always lies between the min- and the
max-entropy (see (3.2)). In the special case of n-tuples of i.i.d. random variables, the gap
ε ε
between Hmin and Hmax approaches zero with increasing n, which means that all entropies
become identical. This is expressed by the following lemma.
Lemma 3.1.4. For any n ∈ N, let (X1 , Y1 ), . . . , (Xn , Yn ) be a sequence of i.i.d. pairs of
×n
random variables, i.e., PX1 Y1 ···Xn Yn = PXY . Then
1 ε
H(X|Y )PXY = lim lim H (X1 · · · Xn |Y1 · · · Yn )
ε→0 n→∞n min
1 ε
H(X|Y )PXY = lim lim Hmax (X1 · · · Xn |Y1 · · · Yn ) .
ε→0 n→∞ n

20
Proof. The lemma is a consequence of the law of large numbers (see Section 2.3.7), applied
to the random variables Zi := h(Xi |Yi ), for h(x|y) defined by (3.3). More details are given
in the exercises.

3.2 An example application: channel coding


3.2.1 Definition of the problem
Consider the following scenario. A sender, traditionally called Alice, wants to send a
message M to a receiver, Bob. They are connected by a communication channel p that
takes inputs X from Alice and outputs Y on Bob’s side (see Section 2.3.8). The channel
might be noisy, which means that Y can differ from X. The challenge is to find an
appropriate encoding scheme that allows Bob to retrieve the correct message M , except
with a small error probability ε. As we shall see, ε can always be made arbitrarily small
(at the cost of the amount of information that can be transmitted), but it is generally
impossible to reach ε = 0, i.e., Bob cannot retrieve M with absolute certainty.
To describe the encoding and decoding process, we assume without loss of generality4
that the message M is represented as an `-bit string, i.e., M takes values from the set
{0, 1}` . Alice then applies an encoding function enc` : {0, 1}` → X that maps M to a
channel input X. On the other end of the line, Bob applies a decoding function dec` :
Y → {0, 1}` to the channel output Y in order to retrieve M 0 .

M −−−−−−−−→ X −−−−−−−−→ Y −−−−−−−−→ M 0 . (3.5)


enc` p dec`

The transmission is successful if M = M 0 . More generally, for any fixed encoding and
decoding procedures enc` and dec` , and for any message m ∈ {0, 1}` , we can define

penc
err
` ,dec`
(m) := P [dec` ◦ p ◦ enc` (M ) 6= M |M = m]

as the probability that the decoded message M 0 := dec` ◦ p ◦ enc` (M ) generated by the
process (3.5) does not coincide with M .5
In the following, we analyze the maximum number of message bits ` that can be trans-
mitted in one use of the channel p if we tolerate a maximum error probability ε,

`ε (p) := max{` ∈ N : ∃ enc` , dec` : max perr


enc` ,dec`
(m) ≤ ε} .
m

3.2.2 The general channel coding theorem


The channel coding theorem provides a lower bound on the quantity `ε (p). It is easy to
see from the formula below that reducing the maximum tolerated error probability by
a factor of 2 comes at the cost of reducing the number of bits that can be transmitted
reliably by 1. It can also be shown that the bound is almost tight (up to terms log2 1ε ).
4 Note that all our statements will be independent of the actual representation of M . The only quantity
that matters is the alphabet size of M , i.e., the total number of possible values.
5 Here we interpret a channel as a probabilistic function from the input to the output alphabets.

21
Theorem 3.2.1. For any channel p and any ε ≥ 0,
1
`ε (p) ≥ max Hmin (X) − Hmax (X|Y ) − log2 − 3 ,

PX ε
where the entropies on the right hand side are evaluated for the random variables X and
Y jointly distributed according to PXY = PX p.6
The proof idea is illustrated in Fig. 3.1.

X codewords Y

M M
enc l p dec l
m m

�  y
supp PX|Y =y

Figure 3.1: The figure illustrates the proof idea of the channel coding theorem. The range
of the encoding function enc` is called code and their elements are the code-
words.

Proof. The argument is based on a randomized construction of the encoding function. Let
PX be the distribution that maximizes the right hand side of the claim of the theorem
and let ` be
2
` = bHmin (X) − Hmax (X|Y ) − log2 c. (3.6)
ε
In a first step, we consider an encoding function enc` chosen at random by assigning to
each m ∈ {0, 1}` a value enc` (m) := X where X is chosen according to PX . We then show
that for a decoding function dec` that maps y ∈ Y to an arbitrary value m0 ∈ {0, 1}` that
is compatible with y, i.e., enc` (m0 ) ∈ suppPX|Y =y , the error probability for a message M
chosen uniformly at random satisfies
ε
penc ` ,dec`


err (M ) = P [dec` ◦ p ◦ enc` (M ) 6= M ] ≤ . (3.7)
2
In a second step, we use this bound to show that there exist enc0`−1 and dec0`−1 such that

enc0 ,dec0`−1
perr `−1 (m) ≤ ε ∀m ∈ {0, 1}`−1 . (3.8)

6 See also (2.7).

22
We then have

`ε (p) ≥ ` − 1
= bHmin (X) − Hmax (X|Y ) − log2 (2/ε)c − 1
≥ Hmin (X) − Hmax (X|Y ) − log2 (1/ε) − 3.

To prove (3.7), let enc` and M be chosen at random as described, let Y := p ◦ enc` (M )
be the channel output, and let M 0 := dec` (Y ) be the decoded message. We then consider
any pair (m, y) such that PM Y (m, y) > 0. It is easy to see that, conditioned on the
event that (M, Y ) = (m, y), the decoding function dec` described above can only fail, i.e.,
produce an M 0 6= M , if there exists m0 6= m such that enc` (m0 ) ∈ suppPX|Y =y . Hence,
the probability that the decoding fails is bounded by

P [M 6= M 0 |M = m, Y = y] ≤ P [∃m0 6= m : enc` (m0 ) ∈ suppPX|Y =y ] . (3.9)

Furthermore, by the union bound, we have


X
P [∃m0 6= m : enc` (m0 ) ∈ suppPX|Y =y ] ≤ P [enc` (m0 ) ∈ suppPX|Y =y ] .
m0 6=m

Because, by construction, enc` (m0 ) is a value chosen at random according to the distri-
bution PX , the probability in the sum on the right hand side of the inequality is given
by
X
P [enc` (m0 ) ∈ suppPX|Y =y ] = PX (x)
x∈suppPX|Y =y

≤ |suppPX|Y =y | max PX (x)


x

≤ 2−(Hmin (X)−Hmax (X|Y )) ,

where the last inequality follows from the definitions of Hmin and Hmax . Combining this
with the above and observing that there are only 2` − 1 values m0 6= m, we find
ε
P [M 6= M 0 |M = m, Y = y] ≤ 2`−(Hmin (X)−Hmax (X|Y )) ≤ .
2
Because this holds for any m and y, we have
ε
P [M 6= M 0 ] ≤ max P [M 6= M 0 |M = m, Y = y] ≤ .
m,y 2
This immediately implies that (3.7) holds on average over all choices of enc` . But this
also implies that there exists at least one specific choice for enc` such that (3.7) holds.
It remains to show inequality (3.8). For this, we divide the set of messages {0, 1}` into
enc` ,dec`
two equally large sets M and M such that perr (m) ≤ penc
err
` ,dec`
(m) for any m ∈ M
and m ∈ M. We then have
X
max penc
err
` ,dec`
(m) ≤ min penc
err
` ,dec`
(m) ≤ 2−(`−1) penc
err
` ,dec`
(m) .
m∈M m∈M
m∈M

23
Using (3.7), we conclude
X
max penc ` ,dec`
2−` penc ` ,dec`
(m) = 2 penc ` ,dec`


err (m) ≤ 2 err err (M ) ≤ ε .
m∈M
m∈{0,1}`

Inequality (3.8) then follows by defining enc0`−1 as the encoding function enc` restricted
to M, and adapting the decoding function accordingly.

3.2.3 Channel coding for i.i.d. channels


Realistic communication channels (e.g., an optical fiber) can usually be used repeatedly.
Moreover, such channels often are accurately described by an i.i.d. noise model. In this
case, the transmission of n subsequent signals over the physical channel corresponds to
a single use of a channel of the form p×n = p × · · · p. To determine the amount of
information that can be transmitted from a sender to a receiver using the physical channel
n times is thus given by Theorem 3.2.1 applied to p×n .
In applications, the number n of channel uses is typically large. It is thus convenient to
measure the capacity of a channel in terms of the asymptotic rate
1 ε ×n
rate(p) = lim lim ` (p ) (3.10)
ε→0 n→∞ n
The computation of the rate will rely on the following corollary, which follows from
Theorem 3.2.1 and the definition of smooth entropies.
Corollary 3.2.2. For any channel p and any ε, ε0 , ε00 ≥ 0,
0 00
ε0 ε00 1
`ε+ε +ε (p) ≥ max Hmin

(X) − Hmax (X|Y ) − log2 − 3
PX ε
where the entropies on the right hand side are evaluated for PXY := PX p.
Combining this with Lemma 3.1.4, we get the following lower bound for the rate of a
channel.
Theorem 3.2.3. For any channel p

rate(p) ≥ max H(X) − H(X|Y ) = max I(X : Y ) .
PX PX

where the entropies on the right hand side are evaluated for PXY := PX p.

3.2.4 The converse


We conclude our treatment of channel coding with a proof sketch which shows that the
bound given in Theorem 3.2.3 is tight. The main ingredient to the proof is the information
processing inequality

I(U : W ) ≤ I(U : V )

24
which holds for any random variables such that U ↔ V ↔ W is a Markov chain. The
inequality is proved by

I(U : W ) ≤ I(U : W ) + I(U : V |W ) = I(U : V W ) = I(U : V ) + I(U : W |V ) = I(U : V ) ,

where the first inequality holds because the mutual information cannot be negative and
the last equality follows because I(U : W |V ) = 0 (see end of Section 3.1.5). The remaining
equalities are essentially rewritings of the chain rule (for the Shannon entropy).
Let now M , X, Y , and M 0 be defined as in (3.5). If the decoding is successful then
M = M 0 which implies

H(M ) = I(M : M 0 ) . (3.11)

Applying the information processing inequality first to the Markov chain M ↔ Y ↔ M 0


and then to the Markov chain M ↔ X ↔ Y gives

I(M : M 0 ) ≤ I(M : Y ) ≤ I(X : Y ) .

Combining this with (3.11) and assuming that the message M is uniformly distributed
over the set {0, 1}` of bitstrings of length ` gives

` = H(M ) ≤ max I(X : Y ) .


PX

It is straightforward to verify that the statement still holds approximately if ` on the left
hand side is replaced by `ε , for some small decoding error ε > 0. Taking the limits as
in (3.10) finally gives

rate(p) ≤ max I(X : Y ) .


PX

25
4 Quantum States and Operations
The mathematical formalism used in quantum information theory to describe quantum
mechanical systems is in many ways more general than that of typical introductory books
on quantum mechanics. This is why we devote a whole chapter to it. The main concepts
to be treated in the following are density operators, which represent the state of a system,
as well as positive-valued measures (POVMs) and completely positive maps (CPMs), which
describe measurements and, more generally, the evolution of a system.

4.1 Preliminaries
4.1.1 Hilbert spaces and operators on them
An inner product space is a vector space (over R or C) equipped with an inner product
(·, ·). A Hilbert
p space H is an inner product space such that the metric defined by the
norm kαk ≡ (α, α) is complete, i.e., every Cauchy sequence is convergent. We will often
deal with finite-dimensional spaces, where the completeness condition always holds, i.e.,
inner product spaces are equivalent to Hilbert spaces.
We denote the set of homomorphisms (i.e., the linear maps) from a Hilbert space H to a
Hilbert space H0 by Hom(H, H0 ). Furthermore, End(H) is the set of endomorphism (i.e.,
the homomorphisms from a space to itself) on H, that is, End(H) = Hom(H, H). The
identity operator α 7→ α that maps any vector α ∈ H to itself is denoted by id.
The adjoint of a homomorphism S ∈ Hom(H, H0 ), denoted S ∗ , is the unique operator
in Hom(H0 , H) such that
(α0 , Sα) = (S ∗ α0 , α) ,
for any α ∈ H and α0 ∈ H0 . In particular, we have (S ∗ )∗ = S. If S is represented as a
matrix, then the adjoint operation can be thought of as the conjugate transpose.
In the following, we list some properties of endomorphisms S ∈ End(H).
• S is normal if SS ∗ = S ∗ S.
• S is unitary if SS ∗ = S ∗ S = id. Unitary operators S are always normal.
• S is Hermitian if S ∗ = S. Hermitian operators are always normal.
• S is positive if (α, Sα) ≥ 0 for all α ∈ H. Positive operators are always Hermitian.
We will sometimes write S ≥ 0 to express that S is positive.
• S is a projector if SS = S. Projectors are always positive.
Given an orthonormal basis {ei }i of H, we also say that S is diagonal with respect to {ei }i
if the matrix (Si,j ) defined by the elements Si,j = (ei , Sej ) is diagonal.

26
4.1.2 The bra-ket notation
In this script, we will make extensive use of a variant of Dirac’s bra-ket notation, where
vectors are interpreted as operators. More precisely, we identify any vector α ∈ H with
an endomorphism |αi ∈ Hom(C, H), called ket, and defined as

|αi : γ 7→ αγ

for any γ ∈ C. The adjoint |αi∗ of this mapping is called bra and denoted by hα|. It is
easy to see that hα| is an element of the dual space H∗ := Hom(H, C), namely the linear
functional defined by

hα| : β 7→ (α, β)

for any β ∈ H.
Using this notation, the concatenation hα||βi of a bra hα| ∈ Hom(H, C) with a ket
|βi ∈ Hom(C, H) results in an element of Hom(C, C), which can be identified with C. It
follows immediately from the above definitions that, for any α, β ∈ H,

hα||βi ≡ (α, β) .

We will thus in the following denote the scalar product by hα|βi.


Conversely, the concatenation |βihα| is an element of End(H) (or, more generally, of
Hom(H, H0 ) if α ∈ H and β ∈ H0 are defined on different spaces). In fact, any endo-
morphism S ∈ End(H) can be written as a linear combination of such concatenations,
i.e., X
S= |βi ihαi |
i

for some families of vectors {αi }i and {βi }i . For example, the identity id ∈ End(H) can
be written as
X
id = |ei ihei |
i

for any basis {ei } of H.

4.1.3 Tensor products


Given two Hilbert spaces HA and HB , the tensor product HA ⊗ HB is defined as the
Hilbert space spanned by elements of the form |αi ⊗ |βi, where α ∈ HA and β ∈ HB , such
that the following equivalences hold
• (α + α0 ) ⊗ β = α ⊗ β + α0 ⊗ β

• α ⊗ (β + β 0 ) = α ⊗ β + α ⊗ β 0
• 0⊗β =α⊗0=0

27
for any α, α0 ∈ HA and β, β 0 ∈ HB , where 0 denotes the zero vector. Furthermore, the
inner product of HA ⊗ HB is defined by the linear extension (and completion) of

hα ⊗ β|α0 ⊗ β 0 i = hα|α0 ihβ|β 0 i .


0 0
For two homomorphisms S ∈ Hom(HA , HA ) and T ∈ Hom(HB , HB ), the tensor product
S ⊗ T is defined as

(S ⊗ T )(α ⊗ β) ≡ (Sα) ⊗ (T β) (4.1)

for any α ∈ HA and β ∈ HB . The space spanned by the products S ⊗ T can be canonically
identified1 with the tensor product of the spaces of the homomorphisms, i.e.,
0
Hom(HA , HA 0
) ⊗ Hom(HB , HB )∼ 0
= Hom(HA ⊗ HB , HA 0
⊗ HB ). (4.2)

This identification allows us to write, for instance,

|αi ⊗ |βi = |α ⊗ βi ,

for any α ∈ HA and β ∈ HB .

4.1.4 Trace and partial trace


The trace of an endomorphism S ∈ End(H) over a Hilbert space H is defined by2
X
tr(S) ≡ hei |S|ei i
i

where {ei }i is any orthonormal basis of H. The trace is well defined because the above
expression is independent of the choice of the basis, as one can easily verify.
The trace operation tr is obviously linear, i.e.,

tr(uS + vT ) = utr(S) + vtr(T ) ,

for any S, T ∈ End(H) and u, v ∈ C. It also commutes with the operation of taking the
adjoint,3

tr(S ∗ ) = tr(S)∗ .

Furthermore, the trace is cyclic, i.e.,

tr(ST ) = tr(T S) .

1 That is, the mapping defined by (4.1) is an isomorphism between these two vector spaces.
2 More precisely, the trace is only defined for trace class operators over a separable Hilbert space. However,
all endomorphisms on a finite-dimensional Hilbert space are trace class operators.
3 The adjoint of a complex number γ ∈ C is simply its complex conjugate.

28
Also, it is easy to verify4 that the trace tr(S) of a positive operator S ≥ 0 is positive.
More generally

(S ≥ 0) ∧ (T ≥ 0) =⇒ tr(ST ) ≥ 0 . (4.3)

The partial trace 5 trB is a mapping from the endomorphisms End(HA ⊗ HB ) on a


product space HA ⊗ HB onto the endomorphisms End(HA ) on HA . It is defined by the
linear extension of the mapping.6

trB : S ⊗ T 7→ tr(T )S ,

for any S ∈ End(HA ) and T ∈ End(HB ).


Similarly to the trace operation, the partial trace trB is linear and commutes with
the operation of taking the adjoint. Furthermore, it commutes with the left and right
multiplication with an operator of the form TA ⊗ idB where TA ∈ End(HA ).7 That is, for
any operator SAB ∈ End(End(HA ⊗ HB )),

trB SAB (TA ⊗ idB ) = trB (SAB )TA (4.4)

and

trB (TA ⊗ idB )SAB = TA trB (SAB ) . (4.5)

We will also make use of the property that the trace on a bipartite system can be
decomposed into partial traces on the individual subsystems, i.e.,

tr(SAB ) = tr(trB (SAB )) (4.6)

or, more generally, for an operator SABC ∈ End(HA ⊗ HB ⊗ HC ),

trAB (SABC ) = trA (trB (SABC )) .

4.1.5 Decompositions of operators and vectors


Spectral decomposition. Let S ∈ End(H) be normal and let {ei }i be an orthonormal
basis of H. Then there exists a unitary U ∈ End(H) and an operator D ∈ End(H) which
is diagonal with respect to {ei }i such that

S = U DU ∗ .

4 The assertion can, for instance, be proved using the spectral decomposition of S and T (see below for
a review of the spectral decomposition).
5 Here and in the following, we will use subscripts to indicate the space on which an operator acts.
6 Alternatively, the partial trace tr
B can be defined as a product mapping I ⊗ tr where I is the identity
operation on End(HA ) and tr is the trace mapping elements of End(HB ) to End(C). Because the
trace is a completely positive map (see definition below) the same is true for the partial trace.
7 More generally, the partial trace commutes with any mapping that acts like the identity on End(H ).
B

29
The spectral decomposition implies that, for any normal S ∈ End(H), there exists a
basis {ei }i of H with respect to which S is diagonal. That is, S can be written as
X
S= αi |ei ihei | (4.7)
i

where αi are the eigenvalues of S.


Expression (4.7) can be used to give a meaning to a complex function f : C → C applied
to a normal operator S. We define f (S) by
X
f (S) ≡ f (αi )|ei ihei | .
i

Polar decomposition. Let S ∈ End(H). Then there exists a unitary U ∈ End(H) such
that

S = SS ∗ U

and

S = U S∗S .

Singular decomposition. Let S ∈ End(H) and let {ei }i be an orthonormal basis of H.


Then there exist unitaries U, V ∈ End(H) and an operator D ∈ End(H) which is diagonal
with respect to {ei }i such that

S = V DU .

In particular, for any S ∈ Hom(H, H0 ), there exist bases {ei }i of H and {e0i }i of H0 such
that the matrix defined by the elements (e0i , Sej ) is diagonal.

Schmidt decomposition. The Schmidt decomposition can be seen as a version of the


singular decomposition for vectors. The statement is that any vector Ψ ∈ HA ⊗ HB can
be written in the form
X
Ψ= γi ei ⊗ e0i
i

where ei ∈ HA and e0i∈ HB are eigenvectors of the operators ρA := trB (|ΨihΨ|) and
ρB := trA (|ΨihΨ|), respectively, and where γi2 are the corresponding eigenvalues. In
particular, the existence of the Schmidt decomposition implies that ρA and ρB have the
same nonzero eigenvalues.

30
4.1.6 Operator norms and the Hilbert-Schmidt inner product
The Hilbert-Schmidt inner product between two operators S, T ∈ End(H) is defined by

(S, T ) := tr(S ∗ T ) .
p
The induced norm kSk2 := P(S, S) is called Hilbert-Schmidt norm. If S is normal with
spectral decomposition S = i αi |ei ihei | then
sX
kSk2 = |αi |2 .
i

An important property of the Hilbert-Schmidt inner product (S, T ) is that it is positive


whenever S and T are positive.
Lemma 4.1.1. Let S, T ∈ End(H). If S ≥ 0 and T ≥ 0 then

tr(ST ) ≥ 0 .
√ 2 √ 2
Proof. If S is positive we have S = S and T = T . Hence, using the cyclicity of the
trace, we have

tr(ST ) = tr(V ∗ V )
√ √
where V = S T . Because the trace of a positive operator is positive, it suffices to show
that V ∗ V ≥ 0. This, however, follows from the fact that, for any φ ∈ H,

hφ|V ∗ V |φi = kV φk2 ≥ 0 .

The trace norm of S is defined by

kSk1 := tr|S|

where

S∗S .
|S| :=
P
If S is normal with spectral decomposition S = i αi |ei ihei | then
X
kSk1 = |αi | .
i

The following lemma provides a useful characterization of the trace norm.


Lemma 4.1.2. For any S ∈ End(H),

kSk1 = max |tr(U S)|


U

where U ranges over all unitaries on H.

31
Proof. We need to show that, for any unitary U ,

|tr(U S)| ≤ tr|S| (4.8)

with equality for some appropriately chosen U .


Let S = V |S| be the polar decomposition of S. Then, using the Cauchy-Schwarz
inequality

|tr(Q∗ R)| ≤ kQk2 kRk2


p p
with Q := |S|V ∗ U ∗ and R := |S| we find
p p p
tr(U S) = tr(U V |S|) = tr(U V |S| |S|) ≤ tr(U V |S|V ∗ U ∗ )tr(|S|) = tr(|S|) ,

which proves (4.8). Finally, it is easy to see that equality holds for U := V ∗ .

4.1.7 The vector space of Hermitian operators


The set of Hermitian operators on a Hilbert space H, in the following denoted Herm(H),
forms a real vector space. Furthermore, equipped with the Hilbert Schmidt inner product
defined in the previous section, Herm(H) is an inner product space.
If {ei }i is an orthonormal basis of H then the set of operators Ei,j defined by
 1
√1 |ej ihei |
 √2 |ei ihej | +
 2
if i < j
Ei,j := √i2 |ei ihej | − √i |ej ihei |
2
if i > j

|ei ihei | otherwise

forms an orthonormal basis of Herm(H). We conclude from this that

dim Herm(H) = (dim H)2 . (4.9)

For two Hilbert spaces HA and HB , we have in analogy to (4.2)

Herm(HA ) ⊗ Herm(HB ) ∼
= Herm(HA ⊗ HB ) . (4.10)

To see this, consider the canonical mapping from Herm(HA )⊗Herm(HB ) to Herm(HA ⊗ HB )
defined by (4.1). It is easy to verify that this mapping is injective. Furthermore, because
by (4.9) the dimension of both spaces equals dim(HA )2 dim(HB )2 , it is a bijection, which
proves (4.10).

4.2 Postulates of quantum mechanics


Despite more than one century of research, numerous questions related to the foundations
of quantum mechanics are still unsolved (and highly disputed). For example, no fully
satisfying explanation for the fact that quantum mechanics has its particular mathematical
structure has been found so far. As a consequence, some of the aspects to be discussed

32
in the following, e.g., the postulates of quantum mechanics, might appear to lack a clear
motivation.
In this section, we pursue one of the standard approaches to quantum mechanics. It
is based on a number of postulates about the states of physical systems as well as their
evolution. (For more details, we refer to Section 2 of [1], where an equivalent approach is
described.) The postulates are as follows:

1. States: The set of states of an isolated physical system is in one-to-one correspon-


dence to the projective space of a Hilbert space H. In particular, any physical state
can be represented by a normalized vector φ ∈ H which is unique up to a phase
factor. In the following, we will call H the state space of the system.8
2. Composition: For two physical systems with state spaces HA and HB , the state space
of the product system is isomorphic to HA ⊗ HB . Furthermore, if the individual
systems are in states φ ∈ HA and φ0 ∈ HB , then the joint state is

Ψ = φ ⊗ φ0 ∈ HA ⊗ HB .

3. Evolutions: For any possible evolution of an isolated physical system with state
space H and for any fixed time interval [t0 , t1 ] there exists a unitary U describing
the mapping of states φ ∈ H at time t0 to states

φ0 = U φ

at time t1 . The unitary U is unique up to a phase factor.


4. Measurements: For any measurement on a physical system with state space H there
exists an observable O with the following properties. O is a Hermitian operator on
H such that each eigenvalue x of O corresponds to a possible measurement outcome.
If the system is in state φ ∈ H, then the probability of observing outcome x when
applying the measurement is given by

PX (x) = tr(Px |φihφ|)

where Px denotes the projector onto the eigenspace belonging to the eigenvalue
x, i.e., O = x xPx . Finally, the state φ0x of the system after the measurement,
P
conditioned on the event that the outcome is x, equals
s
0 1
φx := Px φ .
PX (x)

4.3 Quantum states


In quantum information theory, one often considers situations where the state or the
evolution of a system is only partially known. For example, we might be interested in
8 In quantum mechanics, the elements φ ∈ H are also called wave functions.

33
a scenario where a system might be in two possible states φ0 or φ1 , chosen according to
a certain probability distribution. Another simple example is a system consisting of two
correlated parts A and B in a state
r
1 
Ψ= e0 ⊗ e0 + e1 ⊗ e1 ∈ HA ⊗ HB , (4.11)
2
where {e0 , e1 } are orthonormal vectors in HA = HB . From the point of view of an
observer that has no access to system B, the state of A does not correspond to a fixed
vector φ ∈ HA , but is rather described by a mixture of such states. In this section,
we introduce the density operator formalism, which allows for a simple and convenient
characterization of such situations.

4.3.1 Density operators — definition and properties


The notion of density operators has been introduced independently by von Neumann and
Landau in 1927. Since then, it has been widely used in quantum statistical mechanics
and, more recently, in quantum information theory.
Definition 4.3.1. A density operator ρ on a Hilbert space H is a normalized positive
operator on H, i.e., ρ ≥ 0 and tr(ρ) = 1. The set of density operators on H is denoted
by S(H). A density operator is said to be pure if it has the form ρ = |φihφ|. If H is
d-dimensional and ρ has the form ρ = d1 · id then it is called fully mixed.
It follows from the spectral decomposition theorem that any density operator can be
written in the form
X
ρ= PX (x)|ex ihex |
x

where PX is the probability mass function defined by the eigenvalues PX (x) of ρ and
{ex }x are the corresponding eigenvectors. Given this representation, it is easy to see that
a density operator is pure if and only if exactly one of the eigenvalues equals 1 whereas
the others are 0. In particular, we have the following lemma.
Lemma 4.3.2. A density operator ρ is pure if and only if tr(ρ2 ) = 1.

4.3.2 Quantum-mechanical postulates in the language of density


operators
In a first step, we adapt the postulates of Section 4.2 to the notion of density operators.
At the same time, we generalize them to situations where the evolution and measurements
only act on parts of a composite system.

1. States: The states of a physical system are represented as density operators on a


state space H. For an isolated system whose state, represented as a vector, is φ ∈ H,
the corresponding density operator is defined by ρ := |φihφ|.9
9 Note that this density operator is pure.

34
2. Composition: The states of a composite system with state spaces HA and HB are
represented as density operators on HA ⊗ HB . Furthermore, if the states of the
individual subsystems are independent of each other and represented by density
operators ρA and ρB , respectively, then the state of the joint system is ρA ⊗ ρB .
3. Evolution: Any isolated evolution of a subsystem of a composite system over a
fixed time interval [t0 , t1 ] corresponds to a unitary on the state space H of the
subsystem. For a composite system with state space HA ⊗HB and isolated evolutions
on both subsystems described by UA and UB , respectively, any state ρAB at time t0
is transformed into the state10

ρ0AB = (UA ⊗ UB )(ρAB )(UA∗ ⊗ UB∗ ) (4.12)

at time t1 .11

4. Measurement: Any isolated measurement on a subsystem of a composite system is


specified
P by a Hermitian operator, called observable. When applying a measurement
OA = x xPx on the first subsystem of a composite system HA ⊗ HB whose state
is ρAB , the probability of observing outcome x is

PX (x) = tr(Px ⊗ idB ρAB ) (4.13)

and the post-measurement state conditioned on this outcome is


1
ρ0AB,x = (Px ⊗ idB )ρAB (Px ⊗ idB ) . (4.14)
PX (x)

It is straightforward to verify that these postulates are indeed compatible with those of
Section 4.2. What is new is merely the fact that the evolution and measurements can be
restricted to individual subsystems of a composite system. As we shall see, this extension
is, however, very powerful because it allows us to examine parts of a subsystem without
the need of keeping track of the state of the entire system.

4.3.3 Partial trace and purification


Let HA ⊗HB be a composite quantum system which is initially in a state ρAB = |ΨihΨ| for
some Ψ ∈ HA ⊗HB . Consider now an experiment which is restricted to the first subsystem.
More precisely, assume that subsystem A undergoes an isolated evolution, described by
aPunitary UA , followed by an isolated measurement, described by an observable OA =
x xPx .
According to the above postulates, the probability of observing an outcome x is then
given by

PX (x) = tr (Px ⊗ idB )(UA ⊗ UB )ρAB (UA∗ ⊗ UB∗ )




10 In particular, if HB = C is trivial, this expression equals ρ0A = UA ρA UA


∗.
11 By induction, this postulate can be readily generalized to composite systems with more than two parts.

35
where UB is an arbitrary isolated evolution on HB . Using rules (4.6) and (4.4), this can
be transformed into
PX (x) = tr Px UA trB (ρAB )UA† ,


which is independent of UB . Observe now that this expression could be obtained equiv-
alently by simply applying the above postulates to the reduced state ρA := trB (ρAB ). In
other words, the reduced state already fully characterizes all observable properties of the
subsystem HA .
This principle, which is sometimes called locality, plays a crucial role in many information-
theoretic considerations. For example, it implies that it is impossible to influence system
HA by local actions on system HB . In particular, communication between the two subsys-
tems is impossible as long as their evolution is determined by local operations UA ⊗ UB .
In this context, it is important to note that the reduced state ρA of a pure joint state
ρAB is not necessarily pure. For instance, if the joint system is in state ρAB = |ΨihΨ| for
Ψ defined by (4.11) then
1 1
ρA = |e0 ihe0 | + |e1 ihe1 | , (4.15)
2 2
i.e., the density operator ρA is fully mixed. In the next section, we will give an interpre-
tation of non-pure, or mixed, density operators.
Conversely, any mixed density operator can be seen as part of a pure state on a larger
system. More precisely, given ρA on HA , there exists a pure density operator ρAB on a
joint system HA ⊗ HB (where the dimension of HB is at least as large as the rank of ρA )
such that
ρA = trB (ρAB ) (4.16)
A pure density operator ρAB for which (4.16) holds is called a purification of ρA .

4.3.4 Mixtures of states


Consider a quantum system HA whose state depends on a classical value Z and let ρzA ∈
S(HA ) be the state of the system conditioned on the event Z = z. Furthermore, consider
an observer who does not have access to Z, that is, from his point of view, Z can take
different values distributed according to a probability mass function PZ .
P now that the system HA undergoes an evolution UA followed by a measurement
Assume
OA = x xPx as above. Then, according to the postulates of quantum mechanics, the
probability mass function of the measurement outcomes x conditioned on the event Z = z
is given by
PX|Z=z (x) = tr(Px UA ρzA UA∗ ) .
Hence, from the point of view of the observer who is unaware of the value Z, the probability
mass function of X is given by
X
PX (x) = PZ (z)PX|Z=z (x) .
z

36
By linearity, this can be rewritten as
PX (x) = tr(Px UA ρA UA∗ ) . (4.17)
where
X
ρA := PZ (z)ρzA .
z

Alternatively, expression (4.17) can be obtained by applying the postulates of Sec-


tion 4.3.2 directly to the density operator ρA defined above. In other words, from the
point of view of an observer not knowing Z, the situation is consistently characterized by
ρA .
We thus arrive at a new interpretation of mixed density operators. For example, the
density operator
1 1
ρA = |e0 ihe0 | + |e1 ihe1 | (4.18)
2 2
defined by (4.15) corresponds to a situation where either state e0 or e1 is prepared, each
with probability 21 . The decomposition according to (4.18) is, however, not unique. In
fact, the same state could be written as
1 1
ρA = |ẽ0 ihẽ0 | + |ẽ1 ihẽ1 |
2 2
where ẽ0 := √12 (e0 + e1 ) and ẽ1 := √12 (e0 − e1 ). That is, the system could equivalently be
interpreted as being prepared either in state ẽ0 or ẽ1 , each with probability 21 .
It is important to note, however, that any predictions one can possibly make about
observations restricted to system HA are fully determined by the density operator ρA ,
and, hence do not depend on the choice of the interpretation. That is, whether we see
the system HA as a part of a larger system HA ⊗ HB which is in a pure state (as in
Section 4.3.3) or as a mixture of pure states (as proposed in this section) is irrelevant as
long as we are only interested in observable quantities derived from system HA .

4.3.5 Hybrid classical-quantum states


We will often encounter situations where parts of a system are quantum mechanical
whereas others are classical. A typical example is the scenario described in Section 4.3.4,
where the state of a quantum system HA depends on the value of a classical random
variable Z.
Since a classical system can be seen as a special type of a quantum system, such sit-
uations can be described consistently using the density operator formalism introduced
above. More precisely, the idea is to represent the states of classical values Z by mutually
orthogonal vectors on a Hilbert space. For example, the density operator describing the
scenario of Section 4.3.4 would read
X
ρAZ = PZ (z)ρzA ⊗ |ez ihez | ,
z

37
where {ez }z is a family of orthonormal vectors on HZ .
More generally, we use the following definition of classicality.

Definition 4.3.3. Let HA and HZ be Hilbert spaces and let {ez }z be a fixed orthonormal
basis of HZ . Then a density operator ρAZ ∈ S(HA ⊗ HZ ) is said to be classical on HZ
(with respect to {ez }z ) if12

ρAZ ∈ S(HA ) ⊗ span{|ez ihez |}z

4.3.6 Distance between states


Given two quantum states ρ and σ, we might ask how well we can distinguish them from
each other. The answer to this question is given by the trace distance, which can be seen
as a generalization of the corresponding distance measure for classical probability mass
functions as defined in Section 2.3.6.
Definition 4.3.4. The trace distance between two density operators ρ and σ on a Hilbert
space H is defined by
1
ρ − σ .
δ(ρ, σ) := 1
2
It is straightforward to verify that the trace distance is a metric on the space of density
operators. Furthermore, it is unitarily invariant, i.e., δ(U ρU ∗ , U σU ∗ ) = δ(ρ, σ), for any
unitary U .
The above definition of trace distance between density operators is consistent with the
corresponding
P classical definition
P of Section 2.3.6. In particular, for two classical states
ρ = z P (z)|ez ihez | and σ = z Q(z)|ez ihez | defined by probability mass functions P
and Q, we have

δ(ρ, σ) = δ(P, Q) .

More generally, the following lemma implies that for any (not necessarily classical) ρ
and σ there is always a measurement O that “conserves” the trace distance.
Lemma 4.3.5. Let ρ, σ ∈ S(H). Then

δ(ρ, σ) = max δ(P, Q)


O

where the maximum ranges over all observables O ∈ HermH and where P and Q are the
probability mass functions of the outcomes when applying the measurement described by O
to ρ and σ, respectively.

12 Ifthe classical system HZ itself has a tensor product structure (e.g., HZ = HZ 0 ⊗ HZ 00 ) we typically
assume that the basis used for defining classical states has the same product structure (i.e., the basis
vectors are of the form e = e0 ⊗ e00 with e0 ∈ HZ 0 and e00 ∈ HZ 00 ).

38
P
Proof. Define ∆ := ρ − σ and let ∆ = i αi |ei ihei | be a spectral decomposition. Further-
more, let R and S be positive operators defined by
X
R= αi |ei ihei |
i: αi ≥0
X
S=− αi |ei ihei | ,
i: αi <0

that is,

∆=R−S (4.19)
|∆| = R + S . (4.20)
P
Finally, let O = x xPx be a spectral decomposition of O, where each Px is a projector
onto the eigenspace corresponding to the eigenvalue x. Then
1 X 1 X
tr(Px ρ) − tr(Px σ) = 1
X
δ(P, Q) = P (x) − Q(x) = tr(Px ∆) . (4.21)
2 x 2 x 2 x

Now, using (4.19) and (4.20),



tr(Px ∆) = tr(Px R) − tr(Px S) ≤ tr(Px R) + tr(Px S) = tr(Px |∆|) , (4.22)
P
where the last equality holds because of (4.3). Inserting this into (4.21) and using x Px =
id gives
1X  1  1
δ(P, Q) ≤ tr Px |∆| = tr |∆| = k∆k1 = δ(ρ, σ) .
2 x 2 2

This proves that the maximum maxO δ(P, Q) on the right hand side of the assertion of
the lemma cannot be larger than δ(ρ, σ). To see that equality holds, it suffices to verify
that the inequality in(4.22) becomes an equality if for any x the projector Px either lies in
the support of R or in the support of S. Such a choice of the projectors is always possible
because R and S have mutually orthogonal support.
An implication of Lemma 4.3.5 is that the trace distance between two states ρ and σ can
be interpreted as the maximum distinguishing probability, i.e., the maximum probability
by which a difference between ρ and σ can be detected (see Lemma 2.3.1). Another
consequence of Lemma 4.3.5 is that the trace distance cannot increase under the partial
trace, as stated by the following lemma.
Lemma 4.3.6. Let ρAB and σAB be bipartite density operators and let ρA := trB (ρAB )
and σA := trB (σAB ) be the reduced states on the first subsystem. Then

δ(ρA , σA ) ≤ δ(ρAB , σAB ) .

39
Proof. Let P and Q be the probability mass functions of the outcomes when applying a
measurement OA to ρA and σA , respectively. Then, for an appropriately chosen OA , we
have according to Lemma 4.3.5

δ(ρA , σA ) = δ(P, Q) . (4.23)

Consider now the observable OAB on the joint system defined by OAB := OA ⊗ idB .
It follows from property (4.4) of the partial trace that, when applying the measurement
described by OAB to the joint states ρAB and σAB , we get the same probability mass
functions P and Q. Now, using again Lemma 4.3.5,

δ(ρAB , σAB ) ≥ δ(P, Q) . (4.24)

The assertion follows by combining (4.23) and (4.24).


The significance of the trace distance comes mainly from the fact that it is a bound
on the probability that a difference between two states can be seen. However, in certain
situations, it is more convenient to work with an alternative notion of distance, called
fidelity.
Definition 4.3.7. The fidelity between two density operators ρ and σ on a Hilbert space
H is defined by
1 1
F (ρ, σ) := ρ 2 σ 2 1
√ 
where kSk1 := tr S∗S .
To abbreviate notation, for two vectors φ, ψ ∈ H, we sometimes write F (φ, ψ) instead of
F (|φihφ|, |ψihψ|), and, similarly, δ(φ, ψ) instead of δ(|φihφ|, |ψihψ|). Note that the fidelity
is always between 0 and 1, and that F (ρ, ρ) = 1.
The fidelity is particularly easy to compute if one of the operators, say σ, is pure. In
fact, if σ = |ψihψ|, we have
q
1 1 1 1
p  p
F (ρ, |ψihψ|) = kρ 2 σ 2 k1 = tr σ 2 ρσ 2 = tr |ψihψ|ρ|ψihψ| = hψ|ρ|ψi .

In particular, if ρ = |φihφ|, we find

F (φ, ψ) = |hφ|ψi| . (4.25)

The fidelity between pure states thus simply corresponds to the (absolute value of the)
scalar product between the states.
The following statement from Uhlmann generalizes this statement to arbitrary states.
Theorem 4.3.8 (Uhlmann). Let ρA and σA be density operators on a Hilbert space HA .
Then

F (ρA , σA ) = max F (ρAB , σAB ) .


ρAB ,σAB

where the maximum ranges over all purifications ρAB and σAB of ρA and σA , respectively.

40
Proof. Because any finite-dimensional Hilbert space can be embedded into any other
Hilbert space with higher dimension, we can assume without loss of generality that HA
and HB have equal dimension.
Let {ei }i and {fi }i be orthonormal bases of HA and HB , respectively, and define
X
Θ := ei ⊗ fi .
i

Furthermore, let W ∈ Hom(HA , HB ) be the transformation of the basis {ei }i to the basis
{fi }i , that is,
W : ei 7→ fi .
Writing out the definition of Θ, it is easy to verify that, for any SB ∈ End(HB ),
0
(idA ⊗ SB )Θ = (SA ⊗ idB )Θ (4.26)
0
where SA := W −1 SB
T T
W , and where SB denotes the transpose of SB with respect to the
basis {fi }i .
Let now ρAB = |ΨihΨ| and let
X
Ψ= αi e0i ⊗ fi0
i

be a Schmidt decomposition of Ψ. Because the coefficients αi are the square roots of the
eigenvalues of ρA , we have

Ψ = ( ρA ⊗ idB )(UA ⊗ UB )Θ
where UA is the transformation of {ei }i to {e0i }i and, likewise, UB is the transformation
of {fi }i to {fi0 }i . Using (4.26), this can be rewritten as

Ψ = ( ρA V ⊗ idB )Θ
for V := UA W −1 UBT W unitary. Similarly, for σAB = |Ψ0 ihΨ0 |, we have

Ψ0 = ( σA V 0 ⊗ idB )Θ
for some appropriately chosen unitary V 0 . Thus, using (4.25), we find
√ √ √ √
F (ρAB , σAB ) = |hΨ|Ψ0 i| = hΘ|V ∗ ρA σA V 0 |Θi = tr(V ∗ ρA σA V 0 ) ,
where the last equality is a consequence of the definition of Θ. Using the fact that any
unitary V 0 can be obtained by an appropriate choice of the purification σAB , this can be
rewritten as
√ √
F (ρAB , σAB ) = max tr(U ρA σA ) .
U

The assertion then follows because, by Lemma 4.1.2,


√ √ √ √
F (ρA , σA ) = k ρA σA k1 = max tr(U ρA σA ) .
U

41
Uhlmann’s theorem is very useful for deriving properties of the fidelity, as, e.g., the
following lemma.
Lemma 4.3.9. Let ρAB and σAB be bipartite states. Then
F (ρAB , σAB ) ≤ F (ρA , σA ) .
Proof. According to Uhlmann’s theorem, there exist purifications ρABC and σABC of ρAB
and σAB such that
F (ρAB , σAB ) = F (ρABC , σABC ) . (4.27)
Trivially, ρABC and σABC are also purifications of ρA and σA , respectively. Hence, again
by Uhlmann’s theorem,
F (ρA , σA ) ≥ F (ρABC , σABC ) . (4.28)
Combining (4.27) and (4.28) concludes the proof.
The trace distance and the fidelity are related to each other. In fact, for pure states,
represented by normalized vectors φ and ψ, we have
p
δ(φ, ψ) = 1 − F (φ, ψ)2 . (4.29)
To see this, let φ⊥ be a normalized vector orthogonal to φ such that ψ = αφ + βφ⊥ , for
some α, β ∈ R+ such that α2 +β 2 = 1. (Because the phases of both φ, φ⊥ , ψ are irrelevant,
the coefficients α and β can without loss of generality assumed to be real and positive.)
The operators |φihφ| and |ψihψ| can then be written as matrices with respect to the basis
{φ, φ⊥ },
 
1 0
|φihφ| =
0 0
|α| αβ ∗
 2 
|ψihψ| =
α∗ β |β|2
In particular, the trace distance takes the form
−αβ ∗
 2

1 |φihφ| − |ψihψ| = 1 1 − |α|

δ(φ, ψ) = ∗
.
2 1 2 −α β −|β|2 1
The eigenvalues of the matrix on the right hand side are α0 = β and α1 = −β. We thus
find
1 
δ(φ, ψ) = |α0 | + |α1 | = β .
2
Furthermore, by the definition of β, we have
p
β = 1 − |hφ|ψi|2 .
The assertion (4.29) then follows from (4.25).
Equality (4.29) together with Uhlmann’s theorem are sufficient to prove one direction
of the following lemma.

42
Lemma 4.3.10. Let ρ and σ be density operators. Then
p
1 − F (ρ, σ) ≤ δ(ρ, σ) ≤ 1 − F (ρ, σ)2 .

Proof. We only prove the second inequality. For a proof of the first, we refer to [1].
Consider two density operators ρA and σA and let ρAB and σAB be purifications such
that

F (ρA , σA ) = F (ρAB , σAB )

as in Uhlmann’s theorem. Combining this with equality (4.29) and Lemma 4.3.6, we find
p p
1 − F (ρA , σA )2 = 1 − F (ρAB , σAB )2 = δ(ρAB , σAB ) ≥ δ(ρA , σA ) .

A set of states which are often used in quantum information are the Bell states. As we
will use them later in the course we state here the definition.
Definition 4.3.11. The Bell states or EPR pairs are four specific two-qubit states
β0 , ..., β3 defined by
X 1
|βµ i := √ (σµ )ab |a, bi.
a,b∈{0,1}
2

Having defined what Bell states are we can define the ebit as follows.
Definition 4.3.12. An ebit is one unit of bipartite entanglement, the amount of entan-
glement that is contained in a maximally entangled two-qubit state, i.e. a Bell state.

In other words this means that if we speak about an ebit, we mean one of the four Bell
states.

4.4 Evolution and measurements


Let HA ⊗ HB be a composite system. We have seen in the previous sections that, as long
as we are only interested in the observable quantities of subsystem HA , it is sufficient
to consider the corresponding reduced state ρA . So far, however, we have restricted our
attention to scenarios where the evolution of this subsystem is isolated.
In the following, we introduce tools that allow us to consistently describe the behavior of
subsystems in the general case where there is interaction between HA and HB . The basic
mathematical objects to be introduced in this context are completely positive maps (CPMs)
and positive operator valued measures (POVMs), which are the topic of this section.

43
4.4.1 Completely positive maps (CPMs)
Let HA and HB be the Hilbert spaces describing certain (not necessarily disjoint) parts
of a physical system. The evolution of the system over a time interval [t0 , t1 ] induces a
mapping E from the set of states S(HA ) on subsystem HA at time t0 to the set of states
S(HB ) on subsystem HB at time t1 . This and the following sections are devoted to the
study of this mapping.
Obviously, not every function E from S(HA ) to S(HB ) corresponds to a physically
possible evolution. In fact, based on the considerations in the previous sections, we have
the following requirement. If ρ is a mixture of two states ρ0 and ρ1 , then we expect that
E(ρ) is the mixture of E(ρ0 ) and E(ρ1 ). In other words, a physical mapping E needs to
conserve the convex structure of the set of density operators, that is,

E pρ0 + (1 − p)ρ1 = pE(ρ0 ) + (1 − p)E(ρ1 ) , (4.30)

for any ρ0 , ρ1 ∈ S(HA ) and any p ∈ [0, 1].


As we shall see, any mapping from S(HA ) to S(HB ) that satisfies (4.30) corresponds
to a physical process (and vice versa). In the following, we will thus have a closer look at
these mappings.
For our considerations, it will be convenient to embed the mappings from S(HA ) to
S(HB ) into the space of mappings from End(HA ) to End(HB ). The convexity require-
ment (4.30) then turns into the requirement that the mapping is linear. In addition, the
requirement that density operators are mapped to density operators will correspond to
two properties, called complete positivity and trace preservation.
The definition of complete positivity is based on the definition of positivity.

Definition 4.4.1. A linear map E ∈ Hom(End(HA ), End(HB )) is said to be positive if


E(S) ≥ 0 for any S ≥ 0.
An simple example of a positive map is the identity map on End(HA ), in the following
denoted IA . A more interesting example is TA defined by

TA : S 7→ S T ,

where S T denotes the transpose with respect to some fixed basis. To see that TA is
positive, note that S ≥ 0 implies hφ̄|S|φ̄i ≥ 0 for any vector φ̄. Hence hφ|S T |φi =
hφ|S̄ ∗ |φi = hφ|S̄|φi = hφ̄|S|φ̄i ≥ 0, from which we conclude S T ≥ 0.
Remarkably, positivity of two maps E and F does not necessarily imply positivity of the
tensor map E ⊗ F defined by

(E ⊗ F)(S ⊗ T ) := E(S) ⊗ F(T ) .

In fact, it is straightforward to verify that the map IA ⊗TA0 applied to the positive operator
ρAA0 := |ΨihΨ|, for Ψ defined by (4.11), results in a non-positive operator.
To guarantee that tensor products of mappings such as E ⊗ F are positive, a stronger
requirement is needed, called complete positivity.

44
Definition 4.4.2. A linear map E ∈ Hom(End(HA ), End(HB )) is said to be completely
positive if for any Hilbert space HR , the map E ⊗ IR is positive.
Definition 4.4.3. A linear map E ∈ Hom(End(HA ), End(HB )) is said to be trace pre-
serving if tr(E(S)) = tr(S) for any S ∈ End(HA ).
We will use the abbreviation CPM to denote completely positive maps. Moreover, we
denote by TPCPM(HA , HB ) the set of trace-preserving completely positive maps from
End(HA ) to End(HB ).

4.4.2 The Choi-Jamiolkowski isomorphism


The Choi-Jamiolkowski isomorphism is a mapping that relates CPMs to density operators.
Its importance results from the fact that it essentially reduces the study of CPMs to
the study of density operators. In other words, it allows us to translate mathematical
statements that hold for density operators to statements for CPMs (and vice versa).
Let HA and HB be Hilbert spaces, let HA0 be isomorphic to HA , and define the nor-
malized vector Ψ = ΨA0 A ∈ HA0 ⊗ HA by
d
1 X
Ψ= √ ei ⊗ ei
d i=1

where {ei }i=1,...,d is an orthonormal basis of HA ∼


= HA0 and d = dim(HA ).
Definition 4.4.4. The Choi-Jamiolkowski mapping (relative to the basis {ei }i ) is the
linear function τ from Hom(End(HA ), End(HB )) to End(HA0 ⊗ HB ) defined by

τ : E 7→ (IA0 ⊗ E)(|ΨihΨ|) .

Lemma 4.4.5. The Choi-Jamiolkowski mapping

τ : Hom(End(HA ), End(HB )) −→ End(HA0 ⊗ HB )

is an isomorphism. Its inverse τ −1 maps any ρA0 B to


 
τ −1 (ρA0 B ) : SA 7→ d · trA0 TA→A0 (SA ) ⊗ idB ρA0 B ,


where TA→A0 : End(HA ) → End(HA0 ) is defined by


X
TA→A0 (SA ) := |ei iA0 hej |A SA |ei iA hej |A0 .
i,j

Proof. It suffices to verify that the mapping τ −1 defined in the lemma is indeed an inverse
of τ . We first check that τ ◦ τ −1 is the identity on End(HA0 ⊗ HB ). That is, we show that
for any operator ρA0 B ∈ End(HA0 ⊗ HB ), the operator
 
τ (τ −1 (ρA0 B )) := d · (IA0 ⊗ trA0 ) (IA0 ⊗ TA→A0 )(|ΨihΨ|) ⊗ idB (idA0 ⊗ ρA0 B )

(4.31)

45
equals ρA0 B (where we have written IA0 ⊗ trA0 instead of trA0 to indicate that the trace
only acts on the second subsystem HA0 ). Inserting the definition of Ψ, we find
X
τ (τ −1 (ρA0 B )) = (IA0 ⊗ trA0 )

(|ei ihej |A0 ⊗ |ej ihei |A0 ⊗ idB )(idA0 ⊗ ρA0 B )
i,j
X
= (|ei ihei |A0 ⊗ idB )ρA0 B (|ej ihej |A0 ⊗ idB ) = ρA0 B ,
i,j

which proves the claim that τ ◦ τ −1 is the identity.


It remains to show that τ is injective. For this, let SA ∈ End(HA ) be arbitrary and
note that

(TA→A0 (SA ) ⊗ idA )Ψ = (idA0 ⊗ SA )Ψ .

Together with the fact that trA0 (|ΨihΨ|) = d1 idA this implies

E(SA ) = d · E SA trA0 (|ΨihΨ|)
 
= d · trA0 (IA0 ⊗ E) (idA0 ⊗ SA )|ΨihΨ|
 
= d · trA0 (IA0 ⊗ E) (TA→A0 (SA ) ⊗ idA )|ΨihΨ|
 
= d · trA0 (TA→A0 (SA ) ⊗ idA )(IA0 ⊗ E)(|ΨihΨ|) .

Assume now that τ (E) = 0. Then, by definition, (IA0 ⊗ E)(|ΨihΨ|) = 0. By virtue of the
above equality, this implies E(SA ) = 0 for any SA and, hence, E = 0. In other words,
τ (E) = 0 implies E = 0, i.e., τ is injective.
In the following, we focus on trace-preserving CPMs. The set TPCPM(HA , HB ) ob-
viously is a subset of Hom(End(HA ), End(HB )). Consequently, τ (TPCPM(HA , HB )) is
also a subset of End(HA0 ⊗ HB ). It follows immediately from the complete positivity
property that τ (TPCPM(HA , HB )) only contains positive operators. Moreover, by the
trace-preserving property, any ρA0 B ∈ τ (TPCPM(HA0 , HB )) satisfies
1
trB (ρA0 B ) = idA0 . (4.32)
d
In particular, ρA0 B is a density operator.
Conversely, the following lemma implies13 that any density operator ρA0 B that satis-
fies (4.32) is the image of some trace-preserving CPM. We therefore have the following
characterization of the image of TPCPM(HA , HB ) under the Choi-Jamiolkowski isomor-
phism,

τ (TPCPM(HA , HB )) = {ρA0 B ∈ S(HA0 ⊗ HB ) : trB (ρA0 B ) = d1 idA0 } .

13 See the argument in Section 4.4.3.

46
1
Lemma 4.4.6. Let Φ ∈ HA0 ⊗ HB such that trB (|ΦihΦ|) = d idA .
0 Then the mapping
E := τ −1 (|ΦihΦ|) has the form

E : SA 7→ U SA U ∗

where U ∈ Hom(HA , HB ) is an isometry, i.e., U ∗ U = idA .


Proof. Using the expression for E := τ −1 (|ΦihΦ|) provided by Lemma 4.4.5, we find, for
any SA ∈ End(HA ),

E(SA ) = d · trA0 (TA→A0 (SA ) ⊗ idB )|ΦihΦ|
X
=d· hei |SA |ej i(hei | ⊗ idB )|ΦihΦ|(|ej i ⊗ idB )
i,j
X
= Ei SA Ej∗ ,
i,j

√ P
where Ei := d · (hei | ⊗ idB )|Φihei |. Defining U := i Ei , we conclude that E has the
desired form, i.e., E(SA ) = U SA U ∗ .
To show that U is an isometry, let
1 X
Φ= √ ei ⊗ fi
d i

be a Schmidt decomposition of Φ. (Note that, because trB (|ΦihΦ|) is fully mixed, the
basis {ei } can be chosen to coincide with the basis used for the definition of τ .) Then
(hei | ⊗ idB )|Φi = |fi i and, hence,
X
U ∗U = d |ej ihΦ|(|ej i ⊗ idB )(hei | ⊗ idB )|Φihei | = idA .
i,j

Motivated by the Choi-Jamiolkowski isomorphism we sometimes use for a CPM the


7 SρS ∗ , for some arbitrary operator S.
notation adS : ρ →

4.4.3 Stinespring dilation


The following lemma will be of crucial importance for the interpretation of CPMs as
physical maps.
Lemma 4.4.7 (Stinespring dilation). Let E ∈ TPCPM(HA , HB ). Then there exists an
isometry U ∈ Hom(HA , HB ⊗ HR ), for some Hilbert space HR , such that

E : SA 7→ trR (U SA U ∗ ) .

47
Proof. Let EA→B := E, define ρAB := τ (E), and let ρABR be a purification of ρAB . We then
define E 0 = EA→(B,R)
0
:= τ −1 (ρABR ). According to Lemma 4.4.6, because trBR (ρABR ) is
0
fully mixed, EA→(B,R) has the form

0
EA→(B,R) : SA 7→ U SA U ∗ ,

where U is an isometry. The assertion then follows from the fact that the diagram below
commutes, which can be readily verified from the definition of the Choi-Jamiolkowski
isomorphism. (Note that the arrow on the top corresponds to the operation E 0 7→ trR ◦ E 0 .)
tr0
EA→B ←−−R−− EA→(B,R)
 x

τy
 −1
τ
ρA0 B −−−−→ ρA0 BR
purif.

We can use Lemma 4.4.7 to establish a connection between general trace-preserving


CPMs and the evolution postulate of Section 4.3.2. Let E ∈ TPCPM(HA , HA ) and let U ∈
Hom(HA , HA ⊗HR ) be the corresponding Stinespring dilation, as defined by Lemma 4.4.7.
Furthermore, let Ũ ∈ Hom(HA ⊗ HR , HA ⊗ HR ) be a unitary embedding of U in HA ⊗ HR ,
i.e., Ũ is unitary and, for some fixed w0 ∈ HR , satisfies

Ũ : v ⊗ w0 7→ U v .

Using the fact that U is an isometry, it is easy to see that there always exists such a Ũ .
By construction, the unitary Ũ satisfies

E(SA ) = trR Ũ (SA ⊗ |w0 ihw0 |)Ũ ∗




for any operator SA on HA . Hence, the mapping E on HA can be seen as a unitary on


an extended system HA ⊗ HR (with HR being initialized with a state w0 ) followed by a
partial trace over HR . In other words, any possible mapping from density operators to
density operators that satisfies the convexity criterion (4.30) (this is exactly the set of
trace-preserving CPMs) corresponds to a unitary evolution of a larger system.

4.4.4 Operator-sum representation


As we have seen in the previous section, CPMs can be represented as unitaries on a larger
system. In the following, we consider an alternative and somewhat more economic14
description of CPMs.

14 In the sense that there is less redundant information in the description of the CPM.

48
Lemma 4.4.8 (Operator-sum representation). For any E ∈ TPCPM(HA , HB ) there ex-
ists a family {Ex }x of operators Ex ∈ Hom(HA , HB ) such that
X
E : SA 7→ Ex SA Ex∗ (4.33)
x

and x Ex∗ Ex = idA .


P
Conversely, any mapping E of the form (4.33) is contained in TPCPM(HA , HB ).
Proof. By Lemma 4.4.7, there exists an isometry U ∈ Hom(HA , HB ⊗ HR ) such that
X
E(SA ) = trR (U SA U ∗ ) = (idB ⊗ hfx |)U SA U ∗ (idB ⊗ |fx i) ,
x

where {fx }x is an orthonormal basis of HR . Defining

Ex := (idB ⊗ hfx |)U ,

the direct assertion follows from the fact that


X X
Ex∗ Ex = U ∗ (idB ⊗ |fx i)(idB ⊗ hfx |)U = U ∗ U = id ,
x x

which holds because U is an isometry.


The converse assertion can be easily verified as follows. The fact that any mapping of
the form (4.33) is positive follows from the observation that Ex SA Ex∗ is positive whenever
SA is positive. To show that the mapping is trace-preserving, we use
X X
tr(E(SA )) = tr(Ex SA Ex∗ ) = tr(Ex∗ Ex SA ) = tr(idA SA ) .
x x

Note that the family {Ex }x is not uniquely determined by the CPM E. This is easily
seen by the following example. Let E be the trace-preserving CPM from End(HA ) to
End(HB ) defined by

E : SA 7→ tr(SA )|wihw|

for any operator SA ∈ End(HA ) and some fixed w ∈ HB . That is, E maps any density
operator to the state |wihw|. It is easy to verify that this CPM can be written in the
form (4.33) for

Ex := |wihex |

where {ex }x in an arbitrary orthonormal basis of HA .

49
4.4.5 Measurements as CPMs
An elegant approach to describe measurements is to use the P notion of classical states.
Let ρAB be a density operator on HA ⊗ HB and let O = x xPx be an observable on
HA . Then, according to the measurement postulate of Section 4.3.2, the measurement
process produces a classical value X distributed according to the probability distribution
PX specified by (4.13), and the post-measurement state ρ0AB,x conditioned on the outcome
x is given by (4.14). This situation is described by a density operator
X
ρ0XAB := PX (x)|ex ihex | ⊗ ρ0AB,x .
x

on HX ⊗ HA ⊗ HB which is classical on HX (with respect to some orthonormal basis


{ex }x ). Inserting the expressions for PX and ρ0AB,x , this operator can be rewritten as
X
ρ0XAB = |ex ihex | ⊗ (Px ⊗ idB )ρAB (Px ⊗ idB ) .
x

Note that the mapping E from ρAB to ρ0XAB can be written in the operator-sum repre-
sentation (4.33) with

Ex := |xi ⊗ Px ⊗ idB ,

where
X X
Ex∗ Ex = Px ⊗ idB = idAB .
x x

It thus follows from Lemma 4.4.8 that the mapping

E : ρAB 7→ ρ0XAB

is a trace-preserving CPM.
This is a remarkable statement. According to the Stinespring dilation theorem, it tells
us that any measurement can be seen as a unitary on a larger system. In other words, a
measurement is just a special type of evolution of the system.

4.4.6 Positive operator valued measures (POVMs)


When analyzing a physical system, one is often only interested in the probability distri-
bution of the observables (but not in the post-measurement state). Consider a system
that first undergoes an evolution characterized by a CPM and, after that, is measured.
Because, as argued above, a measurement can be seen as a CPM, the concatenation of
the evolution and the measurement is again a CPM E ∈ TPCPM(HA , HX ⊗ HB ). If the
measurement outcome X is represented by orthogonal vectors {ex }x of HX , this CPM has
the form
X
E : SA 7→ |ex ihex | ⊗ Ex SA Ex∗ .
x

50
In particular, if we apply the CPM E to a density operator ρA , the distribution PX of the
measurement outcome X is given by

PX (x) = tr(Ex ρA Ex∗ ) = tr(Mx ρA ) ,

where Mx := Ex∗ Ex .
From this we conclude that, as long as we are only interested in the probability distri-
bution of X, it suffices to characterize the evolution and the measurement by the family
of operators Mx . Note, however, that the operators Mx do not fully characterize the full
evolution. In fact, distinct operators Ex can give raise to the same operator Mx = Ex∗ Ex .
It is easy to see from Lemma 4.4.8 that the family {Mx }x of operators defined as above
satisfies the following definition.
Definition 4.4.9. A positive operator valued measure (POVM) (on H) is a family {Mx }x
of positive operators Mx ∈ Herm(H) such that
X
Mx = idH .
x

Conversely, any POVM {Mx }x corresponds to a (not unique) physically possible evo-
lution followed by a measurement. This can easily
√ be seen by defining a CPM by the
operator-sum representation with operators Ex := Mx .

4.4.7 The diamond norm of CPMs


Let E and F be arbitrary CPMs from S(H) to S(H0 ). The defining demand on the
definition of the wanted distance measure d(., .) between the CPMs E and F is that it
is proportional to the maximal probability for distinguishing the maps E and F in an
experiment. After our discussion of the trace distance between states in an earlier chapter
it is natural to propose the distance measure
˜ F) :=
d(E, max kE(ρ) − F(ρ)k1
ρ∈S(H(in) )

if one recalls the ”maximal distinguishing probability property” of the trace distance. Up
to a factor 1/2 this is the maximal probability to distinguish the CPMs E and F in an
experiment which works with initial states in the Hilbert space H. But this is not the best
way to distinguish the CPMs E and F in an experiment! Note that in our naive definition
above we have excluded the possibility to consider initial states in ”larger” Hilbert spaces
in the maximization-procedure. The probability to distinguish the CPMs E and F in an
experiment may increase if we ”enlarge” the input Hilbert space H by an additional tensor
space factor;
H H ⊗ HE ;
and apply the CPMs E and F as E ⊗ IE and F ⊗ IE to states in S(H ⊗ HE ). These
replacements lead to a simultaneous replacement of the output Hilbert space:

H0 H 0 ⊗ HE .

51
In Section 4.4.8, an explicit example is discussed which shows that there exist situations
in which
˜ F) < d(E
d(E, ˜ ⊗ IE , F ⊗ IE )
for some Hilbert space HE . This shows why we discard the immediate use of d(E, ˜ F) but
˜
use a distance measure of the form d(E ⊗ IE , F ⊗ IE ) instead. We will still have to figure
out the optimal choice for the Hilbert space HE which will lead to the definition of the so
called ”diamond norm”.
As motivated above we consider
d(E, F) := max kE ⊗ IE (ρ) − F ⊗ IE (ρ)k1
ρ∈S(H⊗HE )

instead of our naive approach. Next one asks how the distinguishing probability depends
on the choice of the Hilbert space HE . To that purpose we are stating and proving
Lemma 4.4.10. In the final definition of distance between CPMs we will then use a Hilbert
space HE which maximizes the probability for distinguishing the CPMs E and F.
Lemma 4.4.10. Let ρAB be a pure state on a Hilbert space HA ⊗ HB and let ρ0AB 0 be an
arbitrary state on a Hilbert space HA ⊗ HB 0 , such that
trB ρAB = trB 0 ρ0AB 0 .
Then there exists a CPM E : S(HB ) → S(HB 0 ), such that
ρ0AB 0 = IA ⊗ E(ρAB ).
Proof. Assume that ρ0AB 0 is pure. Since ρAB is pure (and by the assumption made in the
lemma) there exist states ψ ∈ HA ⊗ HB and ψ 0 ∈ HA ⊗ HB 0 , such that ρAB = |ψihψ| and
ρ0AB 0 = |ψ 0 ihψ 0 |. Let Xp
|ψi = λi |vi iA ⊗ |wi iB
i
and Xp
|ψ 0 i = λ0i |vi0 iA ⊗ |wi0 iB 0
i
be the Schmidt decompositions of |ψi and |ψ 0 i. Without loss of generality we assume
|vi iA = |vi0 iA because vi and vi0 are both eigenvectors of the operator ρA := trB ρAB =
trB 0 ρ0AB 0 . Define the map X
U := |wi0 iB 0 hwi |B .
i
This map U is an isometry because {wi }i and {wi0 }i are orthonormal systems in HB and
HB 0 , respectively. Consequently,
ψ 0 = (idA ⊗ U )(ψ)
which proves the lemma for ρ0AB 0 being pure.
Now let’s assume that ρ0AB 0 isn’t pure and consider the purification ρ0AB 0 R of ρ0AB 0 . Then
(according to the statement proved so far) there exists a map
U : HB → HB 0 ⊗ HR ,

52
such that
ρ0AB 0 R = (idA ⊗ U )ρAB (idA ⊗ U ∗ ).
Now we simply define E := trR ◦ adidA ⊗U and thus

ρ0AB 0 = IA ⊗ E(ρAB )

which concludes the proof.


Let us come back to the question about the best choice for the Hilbert space HE ap-
pearing in the definition of the distance measure in the space of CPMs. Let E1 and E2 be
two CPMs from S(HA ) to S(HA0 ) and let ρA be a state in S(HA ), ρAR ∈ S(HA ⊗ HR ) be
the purification of ρA with dim(R) = dim(A), ρ0AB be a state in S(HA ⊗ HB ) such that
ρA = trB ρ0AB . According to the lemma we just proved there exists a CPM G : S(HR ) →
S(HB ) such that
ρ0AB = IA ⊗ G(ρAR ).
The CPMs E1 and E2 act only on states in S(HA ) and thus they act on the states ρAR
and ρ0AB as

E1 ⊗ IB (ρ0AB ) = (IA ⊗ G) ◦ (E1 ⊗ IA )(ρAR )


E2 ⊗ IB (ρ0AB ) = (IA ⊗ G) ◦ (E2 ⊗ IA )(ρAR ).

We have proved in an earlier chapter about quantum states and operations that trace
preserving CPMs can never increase the distance between states. We thus get

kE1 ⊗ IB (ρ0AB ) − E2 ⊗ IB (ρ0AB )k1 ≤ kE1 ⊗ IA (ρAR ) − E2 ⊗ IA (ρAR )k1 .

This inequality holds for any choice of HB and states in S(HA ⊗ HB ). We conclude that
the right hand sight of our the inequality describes the best way to distinguish the CPMs
E1 and E2 in an experiment. Consequently, this is the best choice for the distance measure
between CPMs. This distance measure is induced by the following norm.
Definition 4.4.11 (Diamond norm for CPMs). Let H and G be two Hilbert spaces and
let
E : S(H) → S(G)
be a CPM. Then the diamond norm kEk♦ of E is defined as

kEk♦ := kE ⊗ IH k1 ,

where k · k1 denotes the so called trace norm for resources which is defined as

kΨk1 := max kΨ(ρ)k1


ρ∈S(L1 ⊗L2 )

where Ψ : S(L1 ) → S(L2 ) denotes an arbitrary CPM.

53
4.4.8 Example: Why to enlarge the Hilbert space
We consider an explicit example to recognize that situations occur in which
˜ F) < d(E
d(E, ˜ ⊗ IE , F ⊗ IE )

for some Hilbert space HE . Let H ∼


= H0 ∼
= HE ∼
= C2 , define

E: S(C 2 ) → S(C 2 )
ρ 7→ E(ρ) = (1 − p)ρ + p2 IC2

and set F := I := IC2 . We are trying to show that


˜ I) < d(E
d(E, ˜ ⊗ IE , I ⊗ IE ).

We first compute the left hand side explicitly and prove the inequality afterwards building
on the explicit result derived for the left hand side.
According to the proposed distance measure d(., ˜ .),

˜ I) = max kE(ρ) − I(ρ)k1


d(E,
ρ∈S(H)

To compute this expression we first prove two claims.


Claim 4.4.12. The distance kE(ρ)−F(ρ)k1 is maximal for pure states ρ = |ψihψ|, ψ ∈ H.
Proof. The state ρ can be written in the form

ρ = pρ1 + (1 − p)ρ2 ,

where ρ1 and ρ2 have support on orthogonal subspaces. Therefore, we observe

kE(ρ) − F(ρ)k1 ≤ pkE(ρ1 ) − F(ρ1 )k1 + (1 − p)kE(ρ2 ) − F(ρ2 )k1


≤ max{kE(ρ1 ) − F(ρ1 )k1 , kE(ρ2 ) − F(ρ2 )k1 },

where we have used the linearity of CPMs and the triangle inequality in the first step.
The application of this to smaller and smaller subsystems leads to pure states in the end.
This proves the claim.
Claim 4.4.13. The distance kE(ρ) − I(ρ)k1 is invariant under unitary transformations
of ρ, i.e.,
kE(ρ) − ρk1 = kE(U ρU ∗ ) − U ρU ∗ k1 .
Proof. Because of the invariance of the trace norm under unitaries,

kE(ρ) − ρk1 = kU E(ρ)U ∗ − U ρU ∗ k1


= kE(U ρU ∗ ) − U ρU ∗ k1 ,

where we have used the explicit definition of the map E in the second step. This proves
the claim.

54
Together, these two claims imply that we can use any pure state ρ = |ψihψ| to maximize
kE(ρ) − ρk1 . We chose |ψi = |0i where {|0i, |1i} is the computational basis of C2 . We get
 p 
− 0
˜
d(E, I) =
2
p
= p.
0 2

1

Now that we have computed d(E,˜ I) we have a closer look at an experiment where the
experimentalist implements the maps E and I as E ⊗ IE = E ⊗ I and I ⊗ IE = I ⊗ I,
respectively. We thus have to show that
˜ ⊗ IE , I ⊗ IE ).
p < d(E
˜ .) it is sufficient to find a state ρ ∈ S(C2 ⊗ C2 ) such that
According to the definition of d(.,
˜ I) = p.
kE ⊗ I(ρ) − I ⊗ I(ρ)k1 ≥ d(E,

For simplicity, we assume p = 1/2. Our ansatz for ρ is the Bell state |β0 ihβ0 |, where
1
|β0 i = √ (|00i + |11i),
2
as introduced in Definition 4.3.11. For p = 1/2 we obtain
1
E ⊗ I(ρ) − I ⊗ I(ρ) = (−|00ih00| + |10ih10| − 2|00ih11| + |00ih01| + |10ih11|
8
−2|11ih00| + |01ih00| + |11ih10| − |11ih11| + |01ih01|) .

From there it is easy to compute



1+ 5 1 ˜ I),
kE ⊗ I(ρ) − I ⊗ I(ρ)k1 = > = d(E,
4 2
which completes the example.

55
5 The Completeness of Quantum
Theory
In this section we prove that based on two weak assumptions,
• that quantum theory is correct and

• the compatibility with free choices,


quantum theory is complete. In other words this means that given these two assumptions
there cannot exist a theory that makes better predictions than quantum theory does. For
a more rigorous treatment of the material presented in this chapter consider [4].

5.1 Motivation

⊃ detector

|φi

⊃ detector

We consider the Stern-Gerlach experiment. If we input a state | ↓i quantum mechanics


predicts that we will measure a hit on the detector below, whereas for an input state | ↑i
we will measure a hit on the detector at the top. However if we input a state √12 (| ↑i+| ↓i),
according to quantum mechanics we will detect a hit at each detector with probability 1/2.
Therefore it is legitimate to ask whether there exists a theory that makes more precise
predictions. Imagine there is an additional parameter λ ∈ {up, down} such that for an
input √12 (| ↑i + | ↓i) and λ = up always the detector on the top registers a hit, and
for λ = down the detector on the bottom always detects the particle. If λ is uniformly
distributed but hidden, this would not contradict quantum theory. Nevertheless, the
existence of lambda would mean that quantum theory is not complete.

56
5.2 Preliminary definitions
Before stating the result we have to introduce some notation. Note that we do not make
any restrictions or assumptions in this section.
A physical theory usually takes some parameters A and then makes a prediction about
some observable quantity X.
Example 5.2.1. Classical mechanics can predict how long it takes until a mass hits the
ground when falling freely from a certain altitude. In this case A is the altitude and X is
the time that it takes to hit the ground.

Example 5.2.2. In the Stern-Gerlach experiment above, A would be the state in which
the particle is prepared and the angle of the apparatus. X denotes the coordinate where
the particle hits the screen.
More generally, we may have a set of parameters and outcomes. A physical theory
generally imposes a causal structure on these values. Mathematically, we will look at a set
of random variables {A, B, X, Y } with a partial order A → X. In the following we often
consider the following example situation.

X. .Y
A. .B
If A → X, we say that A and X are causally ordered, or that X is in the causal future of
A. If A → X does not hold we write A 6→ X.

t t

X. Y . X .
A. . A . .
Y
B
x
B . x

Assume that any random variable is associated to a space-time point (the point where a
parameter is chosen or where it is observed).

Definition 5.2.3. A causal structure is called compatible with relativity theory if A → X


holds if and only if X lies in the future light cone of A.
Note that the causal order is compatible with relativity theory. Let now Γ = {A, B, X, Y, Z}
be a set of random variables with an arbitrary causal order.

Definition 5.2.4. A parameter A is called freely random if and only if PAΓA = PA × PΓA ,
where ΓA := {W ∈ Γ : A 6→ W }.

57
t t
ΓA
X. Y . X . ΓA

A. B.
A . Y.
B.
x x

Example 5.2.5. For the same scenario as above, the dashed ellipses denote the set ΓA .

5.3 The informal theorem


Consider a physical theory, i.e. a law that, given a set of parameters (A, Λ) allows us to
compute (probabilistic) predictions PX|AΛ about the outcomes X. If this theory is

1. compatible with quantum theory, i.e. if parameter


P Λ is dropped, we retrieve the
predictions of quantum theory, i.e. PX|A = λ X|AΛ=λ PΛ|A (λ) corresponds to
P
quantum theory.
2. compatible with free randomness with respect to a causal structure compatible with
relativity theory.
then PX|AΛ=λ = PX|A for all λ.
In other words the theorem tells us that it is not possible to improve quantum theory
without giving up the idea of free choice. The precise theorem can be found in [4].

5.4 Proof sketch


Note that we prove the theorem for the special case where a measurement on a maximally
entangled state is carried out. A general proof can be found in [4].
We consider the special case of an experiment with two choices A and B and two
outcomes X and Y , where we have the following causal structure:

X. .Y
Alice Bob
A. .B
Λ

Let A and B be freely random. Note that

PBX|AΛ = PX|BAΛ PB|AΛ and (5.1)


PBX|AΛ = PX|AΛ PB|XAΛ . (5.2)

58
Because B is freely random PBAΛX = PB PAΛX . In particular we have PB|AΛX PB|AΛ =
PB . Hence (5.1) and (5.2) are equivalent to
PBX|AΛ = PX|BAΛ PB and (5.3)
PBX|AΛ = PX|AΛ PB . (5.4)
This allows us to conclude that
PX|BAΛ = PX|AΛ . (5.5)
Note that (5.5) is called a non-signalling condition. (It implies that an agent choosing B
cannot use his choice to communicate to an agent that chooses A and sees X.)
By symmetry we also have a second non-signalling condition
PY |BΛA = PY |BΛ . (5.6)
Consider now an experiment where the two systems (Alice and Bob) are prepared in a
state |Ψi = √12 (| ↑↑i + | ↓↓i) and N denotes a large integer. On Alice’s side, we perform
a measurement in the basis
{|αi, |α⊥ i}, where |αi = cos α2 | ↑i + sin α2 | ↓i (5.7)
π
and where α = A 2N with A ∈ {0, 2, 4, . . . 2N − 2}. Bob measures with respect to the basis

{|βi, |β ⊥ i}, where|βi = cos β2 | ↑i + sin β2 | ↓i (5.8)


π
and where β = B 2N with B ∈ {1, 3, 5, . . . 2N − 1}. Quantum theory prescribes the
following  
2 α−β
Pr[X = Y |A = a, B = b] = cos . (5.9)
2
In particular if |A − B| = 1 then
 π   π 2
Pr[X 6= Y |A = a, B = b] = sin2 ≤ . (5.10)
4N 4N
If a = 0, b = 2N − 1, then
 π 2
Pr[X = Y |A = a, B = b] ≤ . (5.11)
4N
Exercise 5.4.1. Using the setup introduced above, we define
X
IN := Pr[X 6= Y |A = a, B = b] + Pr[X = Y |A = 0, B = 2N − 1] (5.12)
|a−b|=1
 π 2
≤ 2N (5.13)
 4N
1
=O → 0, for N → ∞. (5.14)
N
If you try to reproduce the outcome of these experiments with two local classical comput-
ers, what is the minimum value of IN that can be achieved? [Hint: You will not be able
to achieve IN = O( N1 ).]

59
Let Z be an arbitrary value computed from Λ, i.e. Z = f (Λ) for some function f (·).
The intuition behind Z is that it can be viewed as a guess for the outcome X if A = 0
and B = 1. We define

p := Pr[Z = X|A = 0, B = 1] (5.15)


= Pr[Z = X|A = 0], (5.16)

where the equation follows by the non-signalling condition (5.5).


We now use that since X = Y holds with high probability, Z is also a good guess for
Y , i.e.,  
1
Pr[Z = Y |A = 0, B = 1] ≥ p −  with  = O , (5.17)
N2
where the inequality can be justified by the following probability table that is constructed
by using (5.10) and (5.15).1

A B
Z=X Z=X
p X=Y X 6= Y
⇒ Z = Y ≤ p ⇒ Z 6= Y

Z 6= X Z 6= X
X=Y X 6= Y ≤
⇒ Z 6= Y ⇒Z=Y
≤

From the non-signalling condition and (5.17) we obtain

Pr[Z = Y |B = 1] ≥ p − . (5.18)

Using once more the non-signalling condition we can write Pr[Z = Y |A = 2, B = 1] ≥ p−.
We next apply the same step recursively. Considering the same argumentation we used
above (cf. probability table) we obtain

Pr[Z = Y |A = 1, B = 2] ≥ p − 2. (5.19)

Continuing in the same manner gives

Pr[Z = Y |B = 2N − 1] ≥ p − (2N − 1). (5.20)

Using the non-signalling condition leads to

Pr[Z = Y |A = 0, B = 2N − 1] ≥ p − (2N − 1). (5.21)

1 Ifthe doted sets are denoted by A and B basic probability theory tells us that P (A ∪ B) = P (A) +
P (B\A) ≤ p − .

60
We can also find an upper bound for Pr[Z = Y |A = 0, B = 2N − 1] using (5.11). We can
use a similar argument as was used before with the diagram on the previous page, but now
the left column of the diagram will be upper bounded by  as dictated by (5.11). The top
row will still be equal to p. The bottom row clearly is (using the non-signalling condition
and (5.15))
Pr[Z 6= X|A = 0, B = 2N − 1] = 1 − p. (5.22)
Now we just need to find the maximum probability for the top left and bottom right
squares together. This occurs when the top left square’s probability is  and the bottom
left square’s probability is 0. This gives an upper bound of

Pr[Z = Y |A = 0, B = 2N − 1] ≤ 1 − p + . (5.23)

Finally Combining (5.21) and (5.23) we obtain

1 − p +  ≥ p − (2N − 1), (5.24)

which is equivalent to
2p ≤ 1 + 2N . (5.25)
Since this must hold for arbitrary values of N and  = O( N12 ) we conclude that we have
p ≤ 21 . Ergo we have found that for any Z (computed from Λ) Pr[Z = X|A = 0, B =
1] ≤ 21 . This means we cannot guess X better than with probability of success upper
bounded by 1/2. If one would take X as guess, the prediction would be defined via the
measurement outcomes which contradicts the free choice of Λ.
In particular note that PXΛ|A = PX × PΛ|A . This can be proven by contradiction. The
main idea is that if we assume that Λ depends on X, then there exists a function f such
that Z = f (Λ) depends on X. This proves that Λ is useless for predicting the outcome X
for the particular measurement that we considered.
Note that with an additional proof step (cf. [4]) the statement can be extended to
arbitrary measurements on arbitrary states.

5.5 Concluding remarks


An interesting question in quantum mechanics is whether the mathematical objects used
to characterize states (e.g., some wave function Ψ) and the ”elements of reality” are in a
one-to-one relation. The theorem described in this section implies that a wave function
uniquely describes reality. However it has been an open question up to the year 2011, if
the opposite direction is also true, i.e. if reality uniquely describes the wave function. This
is indeed the case as first described in [5] under additional assumptions, and later also in
[6] without these additional assumptions. Recall that all this is based on the assumption
that quantum mechanics is correct. More information about this can be found in [6].

61
Theorem above

Model Ψ 1:1 Reality

See [5, 6]

In the following we give a very brief overview of earlier work about the question whether
quantum theory is complete or not.

Einstein, Podolsky and Rosen (1935) In 1935 Einstein, Podolsky and Rosen tried to
come up with an argument for the incompleteness of quantum theory [7]. They considered
an entangled state between two system A and B, √12 (|00iAB +|11iAB ).2 Furthermore they
considered to have a measurement device that measures the state with respect to a certain
basis {|αi, |α⊥ i} (as introduced earlier in this chapter) and outputs the measurement
outcome X. Assume that we perform such a measurement on the system A. Let ϕα,X
denote the post measurement state in B. We obtain that if X = 0, we have ϕα,0 = |αi
and if X = 1, we get ϕα,1 = |α⊥ i. If we now (after having measured on A) measure on B
in a basis rotated by α, with an outcome Y , we can predict the outcome perfectly since
we have Y = X. Their argumentation now consists of three steps.
1. Since Y can be predicted with certainty, it is considered as an element of reality.

2. A theory can only be considered to be complete if each element of reality is repre-


sented by a parameter.
3. In quantum theory, when preparing the entangled state, there is no such parameter.
Hence they concluded that quantum theory is incomplete. From a viewpoint we have
today, it can be said that their notion of ”element of reality” may be ambiguous, because
it does not specify when the value should be predictable (before of after the measurement
at point A.)

Kochen and Specker (1967) Kochen and Specker [8] considered a natural property
called non-contextuality.

2 Intheir original work they considered a different state. However it is convienient to use the Bell state
for their argumentation.

62
meas. in basis a: meas. in basis a0 :
3 rotation in xy-plane 3
z z
2 2
y y x 1
x 1

Let Λ be an additional parameter in an extended theory. The theory is said to be non-


contextual if the probability of an outcome depends only on the measured vector of that
outcome, e.g. PX|A=a (3) = PX|A=a0 (3). They only considered deterministic theories, i.e.
PX|AΛ (x) ∈ {0, 1}.
Informally, their theorem states that there cannot exist a theory with the following
properties:
1. non-conextual

2. deterministic
3. free randomness
4. compatible with quantum theory.
Note that the theorem we discussed in this chapter tells us that Properties 3 and 4 imply
that Λ is independent of X and therefore cannot determine X.

Bell (1964) In 1964, Bell published a theorem [9] which tells us that there cannot exist
a theory with the following properties:
1. non-signalling
2. deterministic
3. free randomness

4. compatible with quantum theory.


Note that the theorem we have seen in this chapter is a strict generalization of this state-
ment (in particular, non-signalling is no longer an assumption). His proof idea was the
following. He showed that there exists an inequality (called Bell’s inequality) involving
the outcomes of two separate measurements Xα , Yβ such that I2 ≥ 1 for any theory that
satisfies the non-signalling assumption and is deterministic. (Note that I2 is as defined in
(5.12).) A simple calculation then shows that for quantum theory I2 < 1. Therefore he
could conclude that the assumptions of non-signalling, determinism and free randomness
contradicts quantum theory.

63
Aspect (see also Zeilinger & Gisin) Since the 1980s experimentalists are trying to come
up with experimental evidence of the theoretical theorems we have seen [10]. Note that all
theorems we have seen in this chapter assume compatibility with quantum theory. This
assumption can be replaced by actual experimental data. It has been showed that I2 < 1
holds experimentally, without assuming the correctness of quantum theory. It then follows
from Bell’s argument that these experimental data cannot be explained by any theory
(not necessarily compatible with quantum theory) that is non-signalling, deterministic
and compatible with free randomness.

64
6 Basic Protocols
6.1 Teleportation
Bennett, Brassard, Crépeau, Jozsa, Peres, Wootters, 1993.
“An unknown quantum state |φi can be disassembled into, then later recon-
structed from, purely classical information and purely nonclassical Einstein-
Podolsky-Rosen (EPR) correlations. To do so the sender, Alice, and the re-
ceiver, Bob, must prearrange the sharing of an EPR-correlated pair of particles.
Alice makes a joint measurement on her EPR particle and the unknown quan-
tum system, and sends Bob the classical result of this measurement. Knowing
this, Bob can convert the state of his EPR particle into an exact replica of the
unknown state |φi which Alice destroyed.”
With EPR correlations, Bennett et al. mean our familiar ebit √1 |00 + 11i. In more
2
precise terms, we are interested in performing the following task:

Task: Alice wants to communicate the unknown state ρ of one qubit in system S to Bob.
They share one Bell state. She can also send him two classical bits.

The protocol that achieves this, makes integral use of the Bell measurement. This is a
measurement of two qubits and consists of projectors onto the four Bell states
1
|ψ 00 i = √ |00 + 11i
2
1
|ψ 01 i = √ |00 − 11i
2
1
|ψ 10 i = √ |01 + 10i
2
1
|ψ 11 i = √ |01 − 10i.
2
More compactly, we can write
|ψ ij i = id ⊗ σ ij |ψ00 i
where σ ij = σxi σzj . For simplicity of the exposition, let ρ = |φihφ| be a pure state,
|φi = α|0i + β|1i (the more general case of mixed ρ follows then by linearity of the
protocol). The global state before the protocol is therefore given by |φiS ⊗ |ψ 00 iAB . The
protocol is as follows:

65
Protocol
1. Alice measures S and A (her half of the entangled state) in the Bell basis.

Alice’s outcome Global projector Resulting global state


00 : |ψ 00 iSA |ψ 00 ihψ 00 |SA ⊗ idB |ψ 00 iSA ⊗ (α|0i + β|1i)B

01 : |ψ 01 iSA |ψ 01 ihψ 01 |SA ⊗ idB |ψ 01 iSA ⊗ (α|0i − β|1i)B

10 : |ψ 10 iSA |ψ 10 ihψ 10 |SA ⊗ idB |ψ 10 iSA ⊗ (β|0i + α|1i)B

11 : |ψ 11 iSA |ψ 11 ihψ 11 |SA ⊗ idB |ψ 11 iSA ⊗ (β|0i − α|1i)B

2. Alice sends the classical bits that describe her outcome, i, j, to Bob.
3. Bob applies σ ij on his qubit.
The resulting state is |φi as one easily verifies.

ρ ψij
Bell
i,j

ψ00
Alice
Bob
ρ
σ ij

Note that each outcome is equally probable and that entanglement between ρ and the
rest of the universe is preserved.
Diagrammatically, we can summarise the teleportation as the following conversion of
resources:
2
g
→ 1
1 ≥
where the straight arrow represents the sending of a classical bit, the wiggly line an
ebit and the wiggly arrow the sending of a qubit. The inequality sign means that there
exists a protocol that can transform the resources of one ebit and two bits of classical
communication into the resource of sending one qubit.

66
6.2 Superdense coding
Superdense coding answers the question of how many classical bits we can send with one
use of a quantum channel if we are allowed to use preshared ebits.

Task Alice wants to send two classical bits, i and j, to Bob. They share one Bell state.
She can also send him one qubit.

Protocol
1. Alice applies a local unitary operation, σ ij , on her half of the entangled state.

i, j Global operation Resulting state


|00i+|11i |00i+|11i
00 idA ⊗ idB √
2

2
= |ψ 00 i
x |00i+|11i |01i+|10i
01 σA ⊗ idB √
2

2
= |ψ 10 i
y |00i+|11i |01i−|10i
10 σA ⊗ idB √
2

2
= |ψ 11 i
z |00i+|11i |00i−|11i
11 σA ⊗ idB √
2

2
= |ψ 01 i

Recall, that the states |ψ ij i form a basis for two qubits: the Bell basis.
2. Alice sends her qubit to Bob.
3. Bob measures the two qubits in the Bell basis. Outcome of his measurement: i, j.
i,j

σ ij
unitary operation
ψ00
Alice
Bob measurement
ψij
Bell
i,j

We can summarise the task of superdense coding in the following diagram:


1
g1
2
≥→
In order to show that this inequality is tight, i.e. that we cannot send more than two
classical bits with one ebit and one use of a qubit channel, we will need some more
technology - in particular the concept of quantum entropy.

67
6.3 Entanglement conversion
With teleportation and superdense coding we have seen two tasks that can be solved nicely
when we have access to ebits. In a realistic scenario, unfortunately, it is difficult to obtain
or generate ebits exactly. It is therefore important to understand when and how we can
distill ebits from other quantum correlations or more generally, how to convert one type
of quantum correlation into another one. In this section, we will consider the simplest
instance of this problem, namely the conversion of one bipartite pure state into another
one. Before we state the main result, we need to do some preparatory work and introduce
the concept of majorisation.

6.3.1 Majorisation
Given two d-dimensional real vectors x and
P y with P entries in non-increasing order (i.e.
xi ≥ xi+1 and yi ≥ yi+1 ) which satisfy i xi = i yi we say that y majorises x, and
write x ≺ y if
Xk Xk
xi ≤ yi
i=1 i=1

for all k ∈ {1, . . . , d}.


Lemma 6.3.1. If y majorises x, then there exists a set of permutation matrices with
associated probability {πi , pi } such that
X
x= pi πi y.
i

Proof. We prove lemma inductively. Clearly the case d = 1 is true and we will therefore
focus on the inductive step d − 1 7→ d.
y  x implies that x1 ≤ y1 , which in turn implies that there exists j such that yj ≤
x1 ≤ yj−1 ≤ y1 . Consequently, there is a t ∈ [0, 1] such that x1 = ty1 + (1 − t)yj . Let T
be the transposition that interchanges places 1 and j and let P = tid + (1 − t)T . Then
P y = (x1 , y2 , . . . , yj−1 , (1 − t)y1 + tyj , . . .). It remains to show that ỹ  x̃, where the latter
| {z }

is just x without x1 , since then the result follows by applying the inductive hypothesis to
x̃ and ỹ. This is shown as follows. For k < j:
k−1
X k
X k
X k
X k
X k−1
X
x̃i = xi ≤ x1 ≤ yj−1 ≤ yi = ỹi .
i=1 i=2 i=2 i=2 i=2 i=1

68
For k ≥ j:
k−1
X k
X
x̃i = xi
i=1 i=2
k
!
X
≤ yi + (y1 − x1 )
i=2
 
k
X
= yi  + (y1 − (1 − t)yj − ty1 + yj )
i=2:i6=j
 
k
X
= yi  + ((1 − t)y1 + tyj )
i=2:i6=j
k−1
X
= ỹi
i=1

Lemma 6.3.2. Let A and B and C = A + B be Hermitian operators with eigenvalues a,


b and c ordered non-increasingly, then c ≺ a + b
Proof.
k
X
ci = max trPV (A + B)
V :|V |=k
i=1
≤ max trPV A + max trPW B
V :|V |=k W :|W |=k
k
X k
X
= ai + bi
i=1 i=1

where we used Ky Fan’s principle which characterises the largest (and also the largest k)
eigenvalues in a variational way.
Corollary 6.3.3. Let r and s be the eigenvalues (incl. multiplicities) of density matrices
ρ and σ in non-increasing order. Then r ≺ s iff there exists a finite set of unitaries and
associated probabilities {Ui , pi } such that
X
ρ= pi Ui σUi−1
i

Proof. If s  r, then according to Lemma 6.3.1 there exists a set of permutation


P matrices
πi (which are in particular unitary) and probabilities pi such that r = i pi πi s. Inserting
U ρU −1 = diag(r) and V σV −1 = diag(s) for unitaries U and V arising from the spectral
decomposition we find X
U ρU −1 = pi πi V σV −1 πi−1
i

69
which is equivalent to the claim for Ui :=PU −1 πi V .
Conversely, Lemma 6.3.2 applied to ρ = i pi Ui σUi−1 implies
X
s = EV(σ) = EV(pi Ui σUi−1 )  EV(ρ) = r,
i

where EV(σ) denotes the non-increasingly ordered vector containing the eigenvalues of
σ.
We now want to argue that any measurement on Bob’s side of the state |ψi can be
replaced by a measurement on Alice’s side and a unitary on Bob’s side dependent on
Alice’s measurement outcome. Note that this is only possible since we know the state on
which the measurement will be applied – without this knowledge this is impossible. In
order to see how it works, we write |ψi in its Schmidt decomposition
X
|ψi = ψi |iiA |iiB
i

Bk† Bk = id) in his


P
and express the Kraus operators of Bob’s measurement Bk (i.e. k
Schmidt basis X
Bk = bk,ji |jihi|B .
ij

We now define measurement operators for Alice with respect to her Schmidt basis
X
A0k = bk,ji |jihi|A
ij

and note that


id ⊗ Bk |ψi = F (A0k ⊗ id)|ψi
where F is the operator exchanging the two systems.1 This shows in particular that the
Schmidt coefficients of id ⊗ Bk |ψi and A0k ⊗ id|ψi are identical. Therefore, there exist
unitaries Uk and Vk such that

id ⊗ Bk |ψi = Uk ⊗ Vk · A0k ⊗ id|ψi

which means that we can simulate the measurement on Bob’s side on |ψi by a measurement
on Alice’s side (with Kraus operators Ak = Uk A0k ) followed by a unitary Vk on Bob’s side.
This way we can reduce an arbitrary LOCC2 protocol between Alice and Bob (applied
to |ψi) by a measurement on Alice’s side followed by a unitary on Bob’s side conditioned
on Alice’s measurement outcome.
This preparation allows us to prove the following result due to Nielsen.
Theorem 6.3.4. |φi can be transformed into |ψi by LOCC iff r ≺ s, where r and s are
the local eigenvalues of |ψi and |φi, respectively.
1F
P
= ij |jihi| ⊗ |iihj|
2 local operations and classical communication

70
Proof. Define ρAB = |ψihψ|AB and σAB = |φihφ|AB with reduced states ρA and σA . By
the above it suffices to consider protocols where Alice performs a measurement with Kraus
operators Ak followed by a unitary Vk on Bob’s side. Since the protocol must transform
Alice’s local state for each measurement outcome into the local part of the final state, we
have

Ak ρA A†k = pk σA (6.1)

for all k, where pk is the probability to obtain outcome k. Let


√ √ p
Ak ρA = |Ak ρA |Uk = Ak ρA Ak Uk

be the polar decomposition of the LHS. Multiplying this equation with its hermitian
conjugate and using (6.1) we find
√ √
ρA A†k Ak ρA = pk Uk† σA Uk .

Summing over k yields

pk Uk† σA Uk
X
ρA = (6.2)
k

which by Corollary 6.3.3 implies that r ≺ s.


In order to see the opposite direction, note that r ≺ s implies that there exist probabil-
ities pk and unitaries Uk such that (6.2) holds. We then define
√ √ −1
Ak := pk σA Uk ρA

where we assume for simplicity that ρA is invertible (the other case can be considered a
limiting case). It is easy to verify that k A†k Ak = id. Clearly
P

Ak ρA A†k = pk σA

and therefore there exist unitaries Vk on Bob’s side such that the final state is |φi.

71
7 Entropy of Quantum States
In Chapter 3 we have discussed the definitions and properties of classical entropy mea-
sures and we have learned about their usefulness in the discussion of the channel coding
theorem. After the introduction of the quantum mechanical basics in Chapter 4 and after
Chapter 5 about the completeness of quantum theory, we are ready to introduce the notion
of entropy in the quantum mechanical context. Textbooks usually start the discussion of
quantum mechanical entropy with the definition of the so called von Neumann entropy
and justify the explicit expression as being the most natural analog of the classical Shan-
non entropy for quantum systems. But this explanation is not completely satisfactory.
Hence a lot of effort is made to replace the von Neumann entropy by the smooth min-
and max-entropies which can be justified by its profound operational interpretation (re-
call for example the discussion of the channel coding theorem where we worked with the
min-entropy and where the Shannon entropy only appears as a special case).

One can prove that the smooth min-entropy of a product state ρ⊗n converges for large
n to n-times the von Neumann entropy of the state ρ. The quantum mechanical min-
entropy thus generalizes the von Neumann entropy in some sense. But since this work is
still in progress we forgo this modern point of view and begin with the definition of the von
Neumann entropy and only indicate at the end of the chapter these new developments.

7.1 Motivation and definitions


Let HZ be a Hilbert space of dimension n which is spanned by the linearly independent
family {|zi}z and consider an arbitrary state ρ on HZ which is classical with respect to
{|zi}z . Hence, X
ρ= PZ (z)|zihz|,
z

where PZ (z) is the probability distribution for measuring |zi in a measurement of ρ in the
basis {|zi}z . Our central demand on the definition of the entropy measures of quantum
states is that they generalize the classical entropies. More precisely, we demand that the
evaluation of the quantum entropy on ρ yields the corresponding classical entropy of the
distribution PZ (z). The following definitions meet these requirements as we will see below.

Definition 7.1.1. Let ρ be an arbitrary state on a Hilbert space HA . Then the von
Neumann entropy H is the quantum mechanical generalization of the Shannon entropy.
It is defined by
H(A)ρ := −tr(ρ log ρ).

72
The quantum mechanical min-entropy Hmin generalizes the classical min-entropy. It is
defined by
Hmin (A)ρ := − log2 kρk∞ .
The quantum mechanical max-entropy Hmax generalizes the classical max-entropy. It is
defined by
Hmax (A)ρ := log2 |supp(ρ)|,
where supp(ρ) denotes the support of the operator ρ.
Now, we check if our requirement from above really is fulfilled. To that purpose we
consider again the state X
ρZ = PZ (z)|zihz|.
z
Since the map ρ → ρ log ρ is defined through the eigenvalues of ρ,
X
H(Z)ρ = −tr(ρ log ρ) = − PZ (z) log2 PZ (z),
z

which reproduces the Shannon entropy as demanded. Recall that kρk∞ is the operator
norm which equals the greatest eigenvalue of the operator ρ. Thus, the quantum mechan-
ical min-entropy reproduces the classical min-entropy:
Hmin (Z)ρ = − log2 kρk∞ = − log max PZ (z).
z∈Z

To show that the classical max-entropy emerges as a special case from the quantum me-
chanical max-entropy we make the simple observation
Hmax (Z)ρ = log2 |suppρ| = log2 |supp PZ |.
Notation. Let ρAB be a density operator on the Hilbert space HA ⊗ HB and let ρA
and ρB be defined as the partial traces
ρA := trB ρAB , ρB := trA ρAB .
Then the entropies of the states ρAB ∈ S(HA ⊗ HB ), ρA ∈ S(HA ) and ρB ∈ S(HB ) are
denoted by
H(AB)ρ := H(AB)ρAB , H(A)ρ := H(A)ρA , H(B)ρ := H(B)ρB .

7.2 Properties of the von Neumann entropy


In the present section we state and prove some basic properties of the von Neumann
entropy.
Lemma 7.2.1. Let ρ be an arbitrary state on S(HA ). Then,
H(A)ρ ≥ 0,
with equality iff ρ is pure.

73
Proof. Let {|ji}j be a complete orthonormal system which diagonalizes ρ, i.e.,
X
ρ= pj |jihj|,
j
P
with j pj = 1. Therefore,
X
H(A)ρ = − pj log pj . (7.1)
j

The function −x log x is positive on [0, 1]. Consequently, the RHS above is positive which
shows that the entropy is non-negative. It is left to show that H(A)ρ = 0 iff ρ is pure.

Assume H(A)ρ = 0. Since the function −x log x is non-negative on [0, 1] each term in
the summation in (7.1) has
Pto vanish separately. Thus, either pk = 0 or pk = 1 for all k.
Because of the constraint j pj = 1 exactly one coefficient pm is equal to one whereas all
the others vanish. We conclude that ρ describes the pure state |mi.

Assume ρ is the pure state |φi. Hence,

ρ = |φihφ|

which yields H(A)ρ = 0.


Lemma 7.2.2. The von Neumann entropy is invariant under similarity transformations,
i.e.,
H(A)ρ = H(A)U ρU −1
for U ∈ GL(HA ).
Proof. Let f : R → R be a function and let M be an operator on a Hilbert space H.
Recall that
f (M ) := V −1 f (V M V −1 )V,
where V ∈ GL(H) diagonalizes M . Now we show that

f (U M U −1 ) = U f (M )U −1

for U ∈ GL(H) arbitrary. Let D denote the diagonal matrix similar to M . The operator
V U −1 diagonalizes U M U −1 . According to the definition above,

f (U M U −1 ) = U V −1 f (V U −1 U M U −1 U V −1 )V U −1 = U V −1 f (V M V −1 )V U −1 .

On the other hand


U f (M )U −1 = U V −1 f (V M V −1 )V U −1 .
This claims the assertion from above. Since the trace is unaffected by similarity transfor-
mations we conclude the proof by setting M = ρ and f (x) = −x log(x).

74
Lemma 7.2.3. Let HA and HB be Hilbert spaces, let |ψi be is a pure state on HA ⊗ HB
and let ρAB := |ψihψ|. Then,
H(A)ρ = H(B)ρ .
Proof. According to the Schmidt decomposition there exist orthonormal families {|iA i}
P {|i
and
2
B i} in HA and HB , respectively, and positive real numbers {λi } with the property
λ
i i = 1 such that
X
|ψi = λi |iA i ⊗ |iB i.
i

Hence, trB (ρAB ) and trA (ρAB ) have the same eigenvalues and thus, H(A)ρAB = H(B)ρAB .

Lemma 7.2.4. Let ρA and ρB be arbitrary states. Then,

H(AB)ρA ⊗ρB = H(A)ρA + H(B)ρB .

Proof. Let {pA B


i }i ({pj }j ) and {|iA i}i ({|jB i}j ) be the eigenvalues and eigenvectors of the
operators ρA (ρB ). Hence,
X
ρA ⊗ ρB = pA B
i pj |iA ihiA | ⊗ |jB ihjB |.
ij

We deduce
X
H(AB)ρA ⊗ρB = − pA B A B
i pj log(pi pj )
ij
= H(A)ρA + H(B)ρA .

Lemma 7.2.5. Let ρ be a state on a Hilbert space HA of the form

ρ = p1 ρ1 + ... + pn ρn

density operators {ρi }i having support on pairwise orthogonal subspaces of H and


with P
with j pj = 1. Then,
X
H(A)ρ = Hclass ({pi }i ) + pj H(A)ρj ,
j

where {Hclass ({pi }i )} denotes the Shannon entropy of the probability distribution {pi }i .
(i)
Proof. Let {λj } and {|j (i) i} the eigenvalues and eigenvectors of the density operators
{ρi }. Thus,
(i)
X
ρ= pi λj |j (i) ihj (i) |
i,j

75
and consequently,
(i) (i)
X
H(A)ρ = − pi λj log(pi λj )
i,j
 
X X (i) X X (i) (i)
= −  λj  pi log(pi ) − pi λj log(λj )
i j i j
X
= Hclass ({pi }) + pi H(A)ρi .
i

A consequence of this lemma is that the entropy is concave. More precisely, let ρ1 , ..., ρn
be density operators on the same Hilbert space HA . Consider a mixture P of those density
operators according to a probability distribution {pj }j on {1, ..., n}, ρ = j pj ρj .
Then X
H(A)ρ ≥ pj H(A)ρj .
j

Proof. Let HZ be an auxiliary Hilbert space of dimension n which is spanned by the


linearly independent family {|ii}i and let ρ̃ be the state
(j)
X
ρ̃ := pj |jihj| ⊗ ρA
j

on HZ ⊗ HA which is classical on HZ with respect to {|ii}i . According to the strong


subadditivity property
H(Z|A)ρ̃ ≤ H(Z)ρ̃
or equivalently,
H(ZA)ρ̃ ≤ H(Z)ρ̃ + H(B)ρ̃ .
Using Lemma 7.2.6, we get
(j)
X
H(ZA)ρ̃ = H({pj }j ) + pj H(ρA )
j
H(Z)ρ̃ = H({pj }j )
(1) (n)
H(B)ρ̃ = H(p1 ρA + ... + ρA ),

and thus,
(1) (n) (1) (n)
p1 H(ρA ) + ... + pn H(ρA ) ≤ H(p1 ρA + ... + ρA )

Lemma 7.2.6. Let HA and HZ be Hilbert spaces and let ρAZ be a state on HA ⊗ HZ
which is classical on HZ with respect to the basis {|zi}z of HZ , i.e., ρAZ is of the form
(z)
X
ρAZ = PZ (z)ρA ⊗ |zihz|.
z

76
Then X
H(AZ)ρ = Hclass ({PZ (z)}z ) + PZ (z)H(A)ρ(z) .
A
z

Proof. Define
(z)
ρ̃z := ρA ⊗ |zihz|,
apply Lemma 7.2.5 with ρi replaced by ρ̃z , use lemma 7.2.4 and apply Lemma 7.2.1.

7.3 The conditional entropy and its properties


We have encountered the identity

Hclass (X|Y ) = Hclass (XY ) − Hclass (Y )

for classical entropies in the chapter about classical information theory. We use exactly
this identity to define conditional entropy in the context of quantum information theory.
Definition 7.3.1. Let HA and HB be two Hilbert spaces and let ρAB be a state on
HA ⊗ HB . Then, the conditional entropy H(A|B)ρ is defined by

H(A|B)ρAB := H(AB)ρAB − H(B)ρAB .

Recasting this defining equation leads immediately to the so called chain rule:

H(AB)ρAB = H(A|B)ρAB + H(B)ρAB .

Lemma 7.3.2. Let ρAB be a pure state on a Hilbert space HA ⊗HB . Then H(A|B)ρAB < 0
iff ρAB is entangled, i.e. H(AB)ρAB 6= H(A)ρAB + H(B)ρAB .
Proof. Observe that
H(A|B)ρAB = H(AB)ρAB − H(B)ρAB .
Recall from Lemma 7.2.1 that the entropy of a state is zero iff it is pure. The state
trA (ρAB ) is pure iff ρAB is not entangled. Thus, indeed H(A|B)ρAB is negative iff ρAB is
entangled.
Hence, the conditional entropy can be negative.
Lemma 7.3.3. Let HA , HB and HC be Hilbert spaces and let ρABC be a pure state on
HA ⊗ HB ⊗ HC . Then,
H(A|B)ρABC = −H(A|C)ρABC .
Proof. We have seen in Lemma 7.2.3 that ρABC pure implies that

H(AB)ρ = H(C)ρ , H(AC)ρ = H(B)ρ , H(BC)ρ = H(A)ρ .

Thus,
H(A|B)ρ = H(AB)ρ − H(B)ρ = H(C)ρ − H(AC)ρ = −H(A|C)ρ.

77
Lemma 7.3.4. Let HA and HZ be Hilbert spaces, let {|zi}z be a complete orthonormal
basis in HZ and let ρAZ be classical on HZ with respect to the basis {|zi}z , i.e.,
(z)
X
ρAZ = PZ (z)ρA ⊗ |zihz|.
z

Then the entropy conditioned on Z is


(z)
X
H(A|Z)ρ = PZ (z)H(ρA ).
z

Moreover,
H(A|Z)ρ ≥ 0.
Proof. Apply Lemma 7.2.6 to get

H(A|Z)ρ = H(AZ)ρ − H(Z)ρ


(z)
X
= Hclass (PZ (z)) + PZ (z)H(ρA ) − Hclass (PZ (z))
z
(z)
X
= PZ (z)H(ρA ).
z

In Lemma 7.2.1 we have seen that H(ρ) ≥ 0 for all states ρ. Hence, H(A|Z)ρ ≥ 0.

Now it’s time to state one of the central identities in quantum information theory: the
so called strong subadditivity.
Theorem 7.3.5. Let ρABC be a state on HA ⊗ HB ⊗ HC . Then,

H(A|B)ρABC ≥ H(A|BC)ρABC .

In textbooks you presently find complex proofs of this theorem based on the Araki-Lieb
inequality (see for example [1]) . An alternative shorter proof can be found in [11].
Lemma 7.3.6. Let ρ be an arbitrary state on a d-dimensional Hilbert space H. Then,

H(ρ) ≤ log2 d,
1
with equality iff ρ is a completely mixed state, i.e., a state similar to d idH .

Proof. Let ρ be a state on H which maximizes the entropy and let {|ji} the diagonalizing
basis, i.e., X
ρ= pj |jihj|.
j

The entropy does only depend on the state’s eigenvalue, thus, in order to maximize the
entropy, we are allowed to consider the entropy H as a function mapping ρ’s eigenvalues
(p1 , ..., pd ) ∈ [0, 1]d to R. Consequently, we have to maximize the function H(p1 , ..., pd )

78
under the constraint p1 + ... + pd = 1. This is usually done using Lagrange multipliers.
One gets pj = 1/d for all j = 1, ..., d and therefore,
1
ρ= idH
d
(this is the completely mixed state). This description of the state uniquely characterizes
the state independently of the choice of the basis the matrix above refers to since the
identity idH is unaffected by similarity transformations. This proves that ρ is the only
state that maximizes the entropy. The immediate observation that

S(ρ) = log2 d

concludes the proof.


Lemma 7.3.7. Let HA and HB be two Hilbert spaces and let d := dim HA . Then,

|H(A|B)ρ | ≤ log2 (d).

Proof. Use Lemma 7.3.6 to get

H(A|B)ρ ≤ H(A)ρ ≤ log2 (d)

and Lemma 7.3.3 to get

H(A|B)ρAB = H(A|B)ρABC = −H(A|C)ρABC ≥ − log(d),

where ρABC is a purification of ρAB .


Lemma 7.3.8. Let HX and HB be Hilbert spaces, {|xi}z be a complete orthonormal basis
in HX and let ρXB be a state on HX ⊗ HB which is classical with respect to {|xi}x . Then,

H(X|B)ρ ≥ 0

which means that the entropy of a classical system is non-negative.


Proof. Let HX 0 be a Hilbert space isomorphic to HX and let ρBXX 0 be a state on HB ⊗
HX ⊗ HX 0 defined by
(x)
X
ρBXX 0 := PX (x)ρB ⊗ |xihx| ⊗ |xihx|.
x,j

Hence,
H(X|B)ρBXX 0 = H(BX)ρBXX 0 − H(B)ρBXX 0
and
H(X|BX 0 )ρBXX 0 = H(BXX 0 )ρBXX 0 − H(BX 0 )ρBXX 0 .
According to the strong subadditivity

H(X|B)ρBXX 0 ≥ H(X|BX 0 )ρBXX 0 .

79
To prove the assertion we have to show that the RHS vanishes or equivalently that
H(BXX 0 )ρBXX 0 is equal to H(BX 0 )ρBXX 0 . Let ρBX 0 denote the state which emerges from
ρBXX 0 after the application of trX (·). Hence, H(BX 0 )ρBXX 0 = H(BX 0 )ρBX 0 . Further,

H(BX 0 )ρBX 0 = H(BX 0 )ρBX 0 ⊗|0ih0| ,

where |0i is a state in the basis {|xi}z of the Hilbert space HX . Define the map

S : HX ⊗ HX 0 → HX ⊗ HX 0

by

S(|z0i) := |zzi
S(|zzi) := |z0i
S(|xyi) := |xyi, (otherwise).

We observe,
[IB ⊗ S]ρBX 0 ⊗ |0ih0|[IB ⊗ S]−1 = ρBXX 0 .
Obviously, [IB ⊗ S] ∈ GL(HX ⊗ HX 0 ) (the general linear group) and thus does not change
the entropy:

H(BX 0 )ρBXX 0 = H(BX 0 )ρBX 0 ⊗|0ih0| = H(BXX 0 )ρBXX 0 .

Lemma 7.3.9. Let HA , HB and HB 0 be Hilbert spaces, let ρAB be a state on HA ⊗ HB ,


let
E : HB → HB 0
be a TPCPM(HB , HB 0 ) and let

ρAB 0 = [IA ⊗ E](ρAB )

be a state on HA ⊗ HB 0 . Then,

H(A|B)ρAB ≤ H(A|B 0 )ρAB0 .

Proof. Let |0i be a state in an auxiliary Hilbert space HR . Then

H(A|B)ρAB = H(AB)ρAB − H(B)ρAB


= H(ABR)ρAB ⊗|0ih0| − H(BR)ρAB ⊗|0ih0| .

According to the Stinespring dilation the Hilbert space HR can be chosen such that there
exists a unitary U with the property

trR ◦ adU (ξ ⊗ |0ih0|) = E(ξ),

80
where adU (·) := U (·)U −1 and ξ ∈ S(HB ). Since the entropy is invariant under similarity
transformations we can use this transformation U to get

H(A|B)ρAB = H(AB 0 R)[IA ⊗adU ](ρAB ⊗|0ih0|) − H(B 0 R)[IA ⊗adU ](ρAB ⊗|0ih0|)
= H(A|B 0 R)[IA ⊗adU ](ρAB ⊗|0ih0|)
≤ H(A|B 0 )[IA ⊗trR ◦adU ](ρAB ⊗|0ih0|)
= H(A|B 0 )[IA ⊗E](ρAB )
= H(A|B 0 )ρAB0 ,

where we have used the strong subadditivity and the Stinespring dilation. We get

H(A|B)ρAB ≤ H(A|B 0 )ρAB0 ,

which concludes the proof.

7.4 The mutual information and its properties


Definition 7.4.1. Let ρAB a state on a Hilbert space HA ⊗ HB . Then, the so called
mutual information I(A : B) is defined by

I(A : B) := H(A)ρAB + H(B)ρAB − H(AB)ρAB = H(A)ρAB − H(A|B)ρAB

Let ρABC a state on a Hilbert space HA ⊗ HB ⊗ HC . Then, the so called conditional


mutual information I(A : B|C) is defined by

I(A : B|C) := H(A|C)ρABC − H(A|BC)ρABC

We observe that the definition of quantum mutual information and the definition of
classical mutual information are formally identical. Next we prove a small number of
properties of the mutual information.
Lemma 7.4.2. Let ρABC a state on a Hilbert space HA ⊗ HB ⊗ HC . Then,

I(A : B|C) ≥ 0.

This Lemma is a direct corollary of the strong subadditivity property of conditional


entropy.
Lemma 7.4.3. Let HA , HB , HB 0 be Hilbert spaces, let ρAB a state on a Hilbert space
HA ⊗ HB and let
E : HB → HB 0
be a TPCPM. Then,
I(A : B) ≥ I(A : B 0 ).
This is an immediate consequence of Lemma 7.3.9.

81
Lemma 7.4.4. Let HA , HB , HC be Hilbert spaces and let ρABC be a state on a Hilbert
space HA ⊗ HB ⊗ HC . Then,
I(A : BC) = I(A : B) + I(A : C|B).
To prove this statement we simply have to plug in the definition of mutual information
and conditional mutual information.

Exercise (Bell state). Compute the mutual information I(A : B) of a Bell state ρAB .
You should get H(A) = 1, H(A|B) = −1 and thus I(A : B) = 2.

Exercise (Cat state). Let HA , HB , HC and HD be Hilbert spaces of quantum


mechanical 2-level systems which are spanned by {|0iA , |1iA }, {|0iB , |1iB }, {|0iC , |1iC }
and {|0iD , |1iD }, respectively. Then, the so called cat state is defined the pure state
1
|ψi := √ (|0iA |0iB |0iC |0iD + |1iA |1iB |1iC |1iD ).
2
Hence ρABCD = |ψihψ| is the corresponding density matrix. Compute the expressions
I(A : B), I(A : B|C), I(A : B|CD) and I(A : BCD). During your calculations you should
get
H(A)ρ = H(B)ρ = H(C)ρ = H(D)ρ = 1,
H(AB)ρ = H(AC)ρ = 1,
H(ABC)ρ = H(D)ρ = 1,
H(ABCD)ρ = 0.

7.5 Conditional min-entropy


In this section, we will introduce conditional min-entropy and discuss some of its properties
and uses. The definition is a quantum generalisation of the classical conditional min-
entropy which we have discussed some time ago, i.e. the maximum value of a conditional
probability distribution is replaced by a maximum eigenvalue of a conditional operator –
the only change that we have to maximise over different versions of a conditional operator:
−1/2 −1/2
Hmin (A|B)ρ = max(− log λmax (idA ⊗ σB ρAB idA ⊗ σB ))
σB

−1
where the maximisation is taken over density operators σB . σB denotes the pseudo-inverse
of σB , i.e. the operator
−1
σB = U diag(λ−1 −1 †
1 , . . . , λ` , 0, . . . , 0)U ,

where σB = U diag(λ1 , . . . , λ` , 0, . . . , 0)U † with λ1 ≥ · · · ≥ λ` > 0 is the spectral decom-


position of σB . There is an alternative way of writing the conditional min-entropy which
often comes in handy when doing computations:
Hmin (A|B)ρ = max(− log min{λ : λidA ⊗ σB ≥ ρAB }).
σB

82
The following lemma shows that conditional min-entropy characterises the maximum
probability of guessing a value X correctly giving access to quantum information in a
register B.
P
Lemma 7.5.1. Let ρXB = x |xihx| ⊗ ρx , then
Hmin (X|B) = − log pguess (X|B)
where X
pguess (X|B) = max tr[ρx Ex ]
{Ex }POVM
x
is the maximum probability of guessing X correctly given access to B.
Proof. The proof uses semidefinite programming,
P an extension
P of linear programming. For
a review see [12]. Defining C = − x |xihx| ⊗ ρx , X̃ = x |xihx| ⊗ Ex , Aij = id ⊗ eij
where eij denotes a matrix with a one in column i and row j and bij := δij , pguess takes
the classic form of a primal semidefinite programme:
X
− min trC X̃ : X̃ ≥ 0, trAij X̃ = bij .
ij

The dual programme is X X


max bij yij : yij Aij ≤ C
ij ij
Setting yij := −σij this SDP reads
X
max −trσ : id ⊗ σ ≥ |xihx| ⊗ ρx .
x

Both programmes are strictly feasible, since the points X = id and σ = id are feasible
points, respectively. By semidefinite programming duality, the two programmes therefore
have the same value. This proves the claim.
Recall that by definition the conditional von Neumann entropy satisfies H(A|B) =
H(AB) − H(B). From the definition of the conditional min-entropy such an inequality is
certainly non-obvious and indeed false when taken literally. For most purposes, a set of
inequalities replaces this important equality (which is often known as a chain rule). To
give you the flavor of such inequalities we will prove the most basic one:
Lemma 7.5.2.
Hmin (A|B) ≥ Hmin (AB) − Hmax (B)
Proof.
Hmin (A|B)ρ = max(− log min{λ : λidA ⊗ σB ≥ ρAB }) (7.2)
σB

ρ0B
≥ − log min{λ : λidA ⊗ ≥ ρAB } (7.3)
|suppρB |
= − log min{µ|suppρB | : µidA ⊗ idB ≥ ρAB } (7.4)
= − log min{µ : µidA ⊗ idB ≥ ρAB } − log |suppρB | (7.5)
= Hmin (AB)ρ − Hmax (B)ρ (7.6)

83
where ρ0B denotes the projector onto the support of ρB .
Strong subadditivity of von Neumann entropy is the inequality:

H(AB) + H(BC) ≥ H(ABC) + H(B).

Using the definition of the conditional von Neumann entropy, this is equivalent to the
inequality
H(A|B) ≥ H(A|BC)
which is often interpreted as “conditioning reduces entropy”. In this form, it has a direct
analog for conditional min entropy:

Lemma 7.5.3.
Hmin (A|B) ≥ Hmin (A|BC)
Proof. Since λσBC ≥ ρABC implies λσB ≥ ρAB we find for the σBC that maximises the
expression for Hmin (A|BC)

Hmin (A|BC)ρ = − log min{λ : λidA ⊗ σBC ≥ ρABC } (7.7)


≤ max(− log min{λ : λidA ⊗ σB ≥ ρAB }) = Hmin (A|B)ρ . (7.8)
σB

In the exercises, you will show how these two lemmas also hold for the smooth min and
max-entropy. Combined with the asymptotic equipartition property that we discussed
in the part on classical information theory you will then prove strong subadditivity of
von Neumann entropy. The very fundamental result by the mathematical physicist Beth
Ruskai and Elliot Lieb was proven in 1973 and remains the only known inequality for the
von Neumann entropy — there may be more, we just haven’t discovered them yet!

84
8 Resources Inequalities
We have seen that ebits, classical communication and quantum communication can be seen
as valuable resources with which we can achieve certain tasks. An important example was
the teleportation protocol which shows one ebit and two bits of classical communication
can simulate the transmission of one qubit. In the following we will develop a framework
for the transformation resources and present a technique that allows to show the optimality
of certain transformations.

8.1 Resources and inequalities


We will consider a setup with two parties, Alice and Bob, who wish to convert one type
of resource to another (one may also consider more than two parties, but this is a little
outside the scope of this course). The resources we consider are:
n
• perfect quantum channel
(Alice sends n qubits to Bob)
n
• → perfect classical channel

g
(Alice sends n bits to Bob)
• n shared entanglement, or ebits

f
(Alice and Bob share n Bell pairs)
• n shared bits

A resource inequality is a relation X ≥ Y which is to be interpreted as “we can obtain


Y using X”. Formally, there exists a protocol to simulate resources Y using only resources
X and local operations. The example to keep in mind is the teleportation protocol which
achieves:
2
g
→ 1
1 ≥
Sometimes, our resources are noisy and we do not require the resource conversion to be
perfect. We can then still use resource inequalities to formulate our results as you can see
in the case of Shannon’s noiseless coding theorem for a channel PY |X :

n n(maxPX I(X;Y )−)


→ ≥ → ,
PY |X

85
for all  > 0 and n large enough.
In the remainder we will only be concerned with an exact conversion of perfect resources
with the main goal to show that the teleportation and superdense coding protocols are
optimal.

8.2 Monotones
Given a class of quantum operations, a monotone M is a function from states into the real
numbers that has the property that it does not increase under any operations from the
class. Rather than making this definition too formal (e.g. by specifying exactly on which
systems the operations act), we will consider a few characteristic examples.
Example 8.2.1. For bipartite states, the quantum mutual information is a monotone
for the class of local operations. More precisely, given a bipartite state ρAB and a local
quantum operation (CPTP map), say on Bob’s side, Λ : End(B) 7→ End(B 0 )

I(A : B) ≥ I(A : B 0 ).

This can be verified as follows. Let UB→B 0 B 00 be a Stinespring dilation of Λ. Since an


isometry does not change the entropy, we have

I(A : B) = I(A : B 0 B 00 )

The RHS can be expanded as

I(A : B 0 B 00 ) = I(A : B 0 ) + I(A : B 00 |B 0 ).

Strong subadditivity implies that the second term is nonnegative which leads us to the
desired conclusion.
A similar argument shows that

I(A : B|E) ≥ I(A : B 0 |E).

where ρABE is an arbitrary extension of ρAB , i.e. satisfies trE ρABE = ρAB .
Example 8.2.2 (Squashed entanglement). The squashed entanglement of a state ρAB is
given by
1
Esq (A : B) := inf I(A : B|E)
2 E
where the minimisation extends over all extensions ρABE of ρAB . Note that we do not
impose a limit on the dimension of E. (That is why we do not know whether the mini-
mum is achieved and write inf rather than min.) Squashed entanglement is a monotone
under local operations and classical communication (often abbreviated as LOCC). That
squashed entanglement is monotone under local operations follows immediately from the
previous example. We just only need to verify that it does not increase under classical
communication.
Consider the case where Alice sends a classical system C to Bob (e.g. a bit string).

86
We want to compare Esq (AC : B) and Esq (A : BC). For any extension E, we have
I(B : AC|E) = H(B|E) − H(B|ACE)
≥ H(B|EC) − H(B|AEC) (strong subadditivity)
= I(B : A|EC)
= I(BC : A|EC) EC =: E 0
≥ min
0
I(BC : A|E 0 )
E

This shows that Esq (AC : B) ≥ Esq (A : BC). By symmetry Esq (AC : B) = Esq (A : BC)
follows.

8.3 Teleportation is optimal


We will first show how to use monotones in order to prove that any protocol for telepor-
tation of m qubits needs at least n ebits, regardless of how much classical communication
the protocol uses. In our graphical notation this reads:

gn


m
implies n ≥ m .
Note first that by sending m halves of ebits down the quantum channel on the RHS of

implies
g

n ≥
m

g

gn

≥ m

so we only need to show that we cannot increase the number of ebits by classical commu-
nication. This sounds easy, but in fact needs our monotone squashed entanglement. Since
every possible extension ρABE of a pure state ρAB (for instance the n ebits) is of the form
ρABE = ρAB ⊗ ρE we find
2Esq (A : B)g
n = inf I(A : B|E) = I(A : B) = 2n. (8.1)
E

According to (8.1), having n ebits can be expressed in term of squashed entanglement.


Since local operations and classical communication cannot increase the squashed entan-
glement as shown in Example 8.2.2, we conclude using again (8.1) that it is impossible to
increase the number of ebits by LOCC.
In fact, the statement also holds if one requires the transformation to only work approx-
imately. The proof is then a little more technical and needs a result about the continuity
of squashed entanglement.
One can also prove that one needs at least two bits of classical communication in order
to teleport one qubit, regardless of how many ebits one has available. But we will leave
this to the exercises.

87
8.4 Superdense coding is optimal
We want to prove that we need at least one qubit channel in order to send two classical
bits, regardless of how many ebits we have available:
n 2m
g
∞ ≥ g∞ →
implies m ≤ n

Note that concatenation of


n 2m
g∞ ≥ g→∞
with teleportation yields
n m
g∞ ≥ g∞ .

Now we have to prove that this implies n ≥ m, i.e. entanglement does not help us to
send more qubits. For this, we consider an additional player Charlie who holds system
C and shares ebits with Alice. Let Bi be Bob’s initial system, Q an n qubit system that
Alice sends to Bob, Λ Bob’s local operation and Bf Bob’s final system. Clearly, if an n
qubit channel could simulate an m qubit channel for m > n, then Alice could send m
fresh halves of ebits that she shares with Charlie to Bob, thereby increasing the quantum
mutual information between Charlie and Bob by 2m.

Charlie
8

n
Alice Bob
...
8

We are now going to show that the amount of quantum mutual information that Bob
and Charlie share cannot increase by more than two times the number of qubits that he
receives from Alice, i.e. by 2n. For this we bound Bob’s final quantum mutual information
with Charlie by

I(C : Bf ) ≤ I(C : Bi Q)
= I(C : Bi ) + I(C : Q|Bi )
≤ I(C : Bi ) + 2n

88
Therefore m ≤ n. This concludes our proof that the superdense coding protocol is optimal.
Interestingly, for this argument, we did not use a monotone such as squashed entan-
glement from above. We merely used the property that the quantum mutual information
cannot increase by too much under communication. Quantities that have the opposite
behaviour (i.e. can increase sharply when only few qubits are communicated) are known
as lockable quantities and have been in the focus of the attention in quantum information
theory in recent years. So, we might also say that the quantum mutual information is
nonlockable.

8.5 Entanglement
We have already encountered the word entanglement many times. Formally, we say that
a quantum state ρAB is separable if it can be written as a convex combination of product
states, i.e. X
ρAB = pk τk ⊗ σk
k

where the pk form a probability distribution and the ρk are states on A and the σk are
states on B. A state is then called entangled if it is not separable.
Characteristic examples of separable states are
• ρAB = |φihφ|A ⊗ |ψihψ|B

• ρAB = idAB = 41 idA ⊗ idB

• ρAB = 21 (|00ih00| + |11ih11|)


Characteristic examples of entangled states are
• In most situations (e.g. teleportation), ebits are the most useful entangled states.
They are therefore also knownP as maximally entangled states (as well as all pure
states of the form U ⊗ V √1d i |iiiAB |A| = |B| = d.)

P
• Non-maximally entangled pure states of the form i αi |iii, where the αi are not all
of equal magnitude. In certain cases they can be converted (distilled) into maximally
entangled states (of lower dimension) using Nielsen’s majorisation criterion [13].

1
P
• The totally antisymmetric state ρAB = d(d−1) i<j |ij − jiihij − ji|AB can be seen
to be entangled, since every pure state supported on the antisymmetric subspace is
entangled.
Theorem 8.5.1. For any state ρAB we have that Esq (A : B) = 0 iff ρAB is separable.

89
Proof. We only prove here that a separable state ρAB implies that Esq (A : B) = 0. The
converse is beyond the scope of this course and has been proven recently [14]. We consider
the following separable classical-quantum state
X
ρABC = pi ρiA ⊗ ρiB ⊗ |iihi|C ,
i
P
for pi being a probability (i.e. pi ≥ 0 and i pi = 1). Using the definition of the mutual
information we can write

I(A : B|C) = H(A|C) − H(B|AC)


= Ei [H(A)ρiA ] − Ei [H(A|B)ρiA ⊗ρiB ]
= 0.

The first two equalities follow by definition and the final step is can be verified by the
chain rule which gives

H(A|B)ρA ⊗ρB = H(AB)ρA ⊗ρB − H(B)ρB


= H(A)ρA .

Since ebits are so useful, we can ask ourselves how many ebits we can extract per given
copy of ρAB , as the number of copies approaches infinity. Formally, this number is known
as the distillable entanglement of ρAB :
m
ED (ρAB ) = lim lim sup { : hebit|⊗m Λ(ρ⊗n
AB )|ebiti
⊗m
≥ 1 − }
7→0 n7→∞ Λ LOCC n

This number is obviously very difficult to compute, but there is a whole theory of
entanglement measures out there with the aim to provide upper bounds on distillable
entanglement. A particularly easy upper bound is given by the squashed entanglement.

Esq (ρAB ) ≥ ED (ρAB ).

The proof uses only the monotonicity of squashed entanglement under LOCC operations
and the fact that the squashed entanglement of a state that is close to n ebits (in the
purified distance) is close to n. In the exercise you will show that squashed entanglement
of separable state is zero. This then immediately implies that one cannot extract any ebits
from separable states.

8.6 Cloning
The very important no-cloning theorem [15, 16] states that there cannot exist a quantum
operation that takes a state |ψi to |ψi ⊗ |ψi for all states |ψi. It has far-reaching con-
sequences and there exist several different proofs. It is desirable to have a proof that as

90
independent of the underlying theory as possible. For example a proof based on the lin-
earity of quantum mechanics is problematic as the proof would become invalid if someone
detects non-linear quantum effects - which in principle could exist.
We next present two different proofs of the non-cloning theorem. Recall that for any
state ρABC we have1
H(A|B) + H(A|C) ≥ 0. (8.2)
Assume that we have a machine that takes some system Q and outputs two copies Q1
and Q2 . Furthermore let R denote a reference system (e.g. a qubit). Let I(R : Q) = 2,
then after the cloning we must have I(R : Q1 ) = I(R : Q2 ) = 2. Using the definition of
the mutual information we obtain H(R) + H(R|Q1 ) = 2 and H(R) + H(R|Q2 ) = 2. Let
H(R) = 1, we then have H(R|Q1 ) = H(R|Q2 ) = −1 which contradicts (8.2) and hence
proves that such a cloning machine cannot exist.
We next present an even more theory independent proof. Consider the following exper-
iment.
Alice Q1 Bob1
angle α R cloning
Q2 Bob2
X

The state of Bob is assumed to be |αi with probability 1


2 and |α⊥ i with probability 1
2.
The state of Bob is therefore
1 1 1
ρB = |αihα| + |α⊥ ihα⊥ | = id.
2 2 2
As this state is independent of α we have I(B : α) = 0. The joint state of Bob1 and Bob2
is
1 1
ρB1 B2 = |αihα|B1 ⊗ |αihα|B2 + |α⊥ ihα⊥ |B1 ⊗ |α⊥ ihα⊥ |B2 ,
2 2
which is not a maximally mixed state that depends on α. We thus have I(α : B1 B2 ) > 0
which contradicts relativity, as it would allow us to communicate faster than the speed of
light.

1 Let
ρABCD be a purification, then H(A|B) + H(A|CD) = 0. Using the data processing inequality gives
H(A|B) + H(A|C) ≥ 0.

91
9 Quantum Key Distribution
9.1 Introduction
In this chapter, we introduce the concept of quantum key distribution. Traditionally,
cryptography is concerned with the problem of securely sending a secret message from A
to B. Note however that secure message transmission is only one branch of cryptography.
Another example for a problem studied in cryptography is coin tossing. There the problem
is that two parties, Alice and Bob, which are physically separated and do not trust each
other want to toss a coin over the telephone. Blum showed that this problem cannot be
solved as long as one does not introduce additional asssumptions [17]. Note that coin
tossing is possible using quantum communication.
To start, we introduce the concept of cryptographic resources. A classical insecure
communication channel is denoted by
Alice Bob

Eve

The arrows to the adversary, Eve, indicate that she can receive all messages sent by Alice.
Furthermore Eve is able to modify the message which Bob finally receives. This channel
does not provide any guarantees. It can be used to model for example email traffic.
A classical authentic channel is denoted by

and guarantees that messages received by Bob are sent by Alice. It can be used to describe
e.g. a telephone conversation with voice authentification.
The most restrictive classical channel model we consider is the so-called secure channel
which has the same guarantees as the authentic channel and ensures in addition that no
information leaks. It is denoted by

In the quantum setup, an insecure quantum channel that has no guarantees is repre-
sented by

Note that an authentic quantum channel is automatically also a secure quantum channel
since reading out a message always changes the message.
The following symbol

92
k

denotes k classical secret bits, i.e. k bits that are uniformly distributed and maximally
correlated between Alice and Bob.
A desirable goal of quantum cryptography would be to have a protocol that simulates
a secure classical channel using an insecure quantum channel, i.e.,
?

However, such a protocol cannot exist since this scenario has complete symmetry between
Alice and Eve, which makes it impossible for Bob to distinguish between them. If we add
a classical authentic channel in addition to the insecure quantum channel, it is possible as
we shall see to simulate a classical secret channel, i.e.,
n
n

is possible.
In classical cryptography there exists a protocol [18], called authentication protocol, that
achieves the following
k
n
≥ n  k.

Thus, if Alice and Bob have a (short) password, they can use an insecure channel to
simulate an authentic channel. This implies
k
n
≥ n  k.

9.2 Classical message encryption


The one-time pad protocol achieves the following using purely classical technology.
k
k

93
Let M be a message bit and S a secret key bit. The operation ⊕ denotes an addition
modulo 2. Alice first computes C = M ⊕ S and sends C over a classical authentic channel
to Bob. Bob then computes M 0 = C ⊕ S. The protocol is correct as

M 0 = C ⊕ S = (M ⊕ S) ⊕ S = M ⊕ (S ⊕ S) = M.

To prove secrecy of the protocol, we have to show that PM = PM |C which is equivalent


to PM C = PM × PC . In information theoretic terms this condition can be expressed as
I(M : C) = 0 which means that the bit C which is sent to Bob and may be accessible
to Eve does not have any information about the message bit M . This follows from the
observation that PC|M =m is uniform for all m ∈ {0, 1}. Therefore, PC|M = PC which is
equivalent to PCM = PC × PM and thus proves that the protocol is secret.
As the name one-time pad suggests, a secret bit can only be used once. For example
consider the scenario where someone uses a single secret bit to encrypt 7 message bits such
that we have e.g. C = 0010011. Eve then knows that M = 0010011 or M = 1101100.
Shannon proved in 1949 that in a classical scenario, to have a secure protocol the key
must be as long as the message [19], i.e.
Theorem 9.2.1.
k
n
≥ implies k ≥ n.

Proof. Let M ∈ {0, 1}n be the message which should be sent secretly from Alice to Bob.
Alice and Bob share a secret key S ∈ {0, 1}k . Alice first encrypts the message M and
sends a string C over a public channel to Bob. Bob decrypts the message, i.e. he computes
a string M 0 out of C and his key S. We assume that the protocol fulfills the following two
requirements.
1. Reliability: Bob should be able to reproduce M (i.e. M 0 = M ).
2. Secrecy: Eve does not gain information about M .
We consider a message that is uniformly distributed on {0, 1}n . The secrecy requirement
can be written as I(M : C) = 0 which implies that H(M |C) = H(M ) = n. We thus
obtain
I(M : S|C) = H(M |C) − H(M |CS) = n, (9.1)
where we also used the reliability requirement H(M |CS) = 0 in the last equality. Using
the data processing inequality and the non negativity of the Shannon entropy, we can
write
I(M : S|C) = H(S|C) − H(S|CM ) ≤ H(S). (9.2)
Combining (9.1) and (9.2) gives n ≤ H(S) which implies that k ≥ n.
Shannon’s result shows that information theoretic secrecy (i.e. I(M : C) ≈ 0) cannot
be achieved unless one uses very long keys (as long as the message).

94
In computational cryptography, one relaxes the security criterion. More precisely, the
mutual information I(M : C) is no longer small, but it is still computationally hard (i.e.
it takes a lot of time) to compute M from C. In other words, we no longer have the
requirement that H(M |C) is large. In fact, for public key cryptosystems (such as RSA
and DH), we have H(M |C) = 0. This implies that there exists a function f such that
M = f (C), which means that it is in principle possible to compute M from C. Security
is obtained because f is believed1 to be hard to compute. Note, however, that for the
protocol to be practical, one requires that there exists an efficiently computable function
g, such that M = g(C, S).

9.3 Quantum cryptography


In this section, we explain why Theorem 9.2.1 does not hold in the quantum setup. As we
will prove later, having a quantum channel we can achieve
n
≈n
≥ ()
2n

Note that this does not contradict Shannon’s proof of Theorem 9.2.1, since in the quantum
regime the no-cloning theorem (cf. Section 8.6) forbids that Bob and Eve receive the same
state, i.e., the ciphertext C is not generally available to both of them. Therefore, Shannon’s
proof is not valid in the quantum setup, which allows quantum cryptography to go beyond
classical cryptography.
As we will see in the following, it is sufficient to consider quantum key distribution
(QKD), which does the following.
n
≈n
≥ (4)
n

The protocol (4) implies () as we can concatenate it with the one-time pad encryption.
More precisely,
n ≈n
QKD OTP ≈n
2n ≥ n ≥

1 Inclassical cryptography one usually makes statements of the following form. If f was easy to compute
then some other function F is also easy to compute. For example F could be the decomposition of a
number into its prime factors.

95
which is the justification that we can focus on the task of QKD in the following.
We next define more precisely what we mean by a secret key, as denoted by SA and SB .
In quantum cryptography, we generally consider the following three requirements where
≥0
1. Correctness: Pr[SA 6= SB ] ≤ 
2. Robustness: if the adversary is passive, then2 Pr[SA =⊥] ≤ 

3. Secrecy: kρSA E − (pρ⊥ ⊗ ρE⊥ + (1 − p)ρk ⊗ ρEk )k1 ≤ , where ρE⊥ , ρEk are arbitrary
density operators, ρ⊥ = | ⊥ih⊥ | and ρk is a completely mixed state on {0, 1}n , i.e.
−n
P
ρk = 2 s∈{0,1}n |sihs|. The cq-state ρSA E describes the key SA together with
the system E held by the adversary after the protocol execution. The parameter p
can be viewed as the failure probability of the protocol.

The secrecy condition implies that there is either a uniform and uncorrelated (to E) key
or there is no key at all.

9.4 QKD protocols


9.4.1 BB84 protocol
In the seventies, Wiesner had the idea to construct unforgeable money based on the fact
that quantum states cannot be cloned [20]. However, the technology at that time was not
ready to start up on his idea. In 1984, Bennett and Brassard presented the BB84 protocol
for QKD [21] which is based on Wiesner’s ideas and will be explained next.
In the BB84 protocol, Alice and Bob want to generate a secret key which is achieved
in four steps. In the following, we choose a standard basis {|0i, |1i} and {|0̄i, |1̄i} where
|0̄i := √12 (|0i + |1i) and |1̄i := √12 (|0i − |1i).

BB84 Protocol:
Distribution step Alice and Bob perform the following task N times and let i =
1, . . . , N . Alice first chooses Bi , Xi ∈ {0, 1} at random and prepares a state of
a qubit Qi (with basis {|0i, |1i}) according to
B X Q
0 0 |0i
0 1 |1i
1 0 |0̄i
1 1 |1̄i.
Alice then sends Qi to Bob.
Bob next chooses Bi0 ∈ {0, 1} and measures Qi either in basis {|0i, |1i} (if
Bi0 = 0) or in basis {|0̄i, |1̄i} (if Bi0 = 1) and stores the result in Xi . Recall that
all the steps so far are repeated N -times.

2 The symbol ⊥ indicates that no key has been produced.

96
Sifting step Alice sends B1 , . . . , Bn to Bob and vice versa, using the classical au-
thentic channel. Bob discards all outcomes for which Bi 6= Bi0 and Alice does so
as well. For better understanding we consider the following example situation.
Q |1i |1i |1̄i |0̄i |0i |1̄i |0̄i |1i |1̄i
B 0 0 1 1 0 1 1 0 1
X 1 1 1 0 0 1 0 1 1
B’ 0 0 0 1 1 0 1 1 0
X’ 1 1 1 0 1 1 0 0 1
no. 1 2 3 4 5 6 7 8 9
Hence, Alice and Bob discard columns 3 , 5 , 6 , 8 and 9 .
Checking step Alice and Bob compare (via communication over the classical√ au-
thentic channel) Xi and Xi0 for some randomly chosen sample i of size n. If
there is disagreement, the protocol aborts, i.e. SA = SB =⊥.
Extraction step We consider here the simplest case where we assume to have no
errors (due to noise). The key SA is equal to the remaining bits of X1 , . . . , Xn
and the key SB are the remaining bits of X10 , . . . , Xn0 . Note that the protocol
can be generalized such that it also works in the presence of noise.

9.4.2 Security proof of BB84


It took almost 20 years until the security of BB84 could be proven [22, 23, 24, 11]. We
present in the following a proof sketch. The idea is to first consider an entanglement-
based protocol (called Ekert91 [25]) and prove that this protocol is equivalent to the BB84
protocol in terms of secrecy. Therefore, it is sufficient to prove security of the Ekert91
protocol which turns out to be easier to achieve. For this, we will use a generalized
uncertainty relation 3 [26] which states that

H(Z|E) + H(X|B) ≥ 1, (9.3)

where Z denotes a measurement in the basis {|0i, |1i}, X denotes a measurement in the
basis {|0̄i, |1̄i} and where B and E are arbitrary quantum systems.

Ekert91 protocol: Similarly to the BB84 protocol this scheme also consists of four different
steps.
Distribution step (repeated N times) Alice prepares entangled qubit pairs and sends
one half of each pair to Bob (over the insecure quantum channel). Alice and
Bob then measure their qubit in a random basis Bi (for Alice)4 and Bi0 (for
Bob). They report the outcomes Xi (for Alice) and Xi0 (for Bop).
Sifting step Alice and Bob discard all (Xi , Xi0 ) for which Bi 6= Bi0 .

3 The uncertainty relation was topic of one exercise.


4 Recall that Bi = 0 means that we measure in the {|0i, |1i} basis and if Bi = 1 we measure in the
{|0̄i, |1̄i} basis.

97
Checking step For a random sample of positions i Alice and Bob check whether
Xi = Xi0 . If the test fails they abort the protocol by outputting ⊥.
Extracting step Alice’s key SA consists of the remaining bits of X1 , X2 , . . .. Bob’s
key SB consists of the remaining bits X10 , X20 , . . ..

We next show that Ekert91 is equivalent to BB84. On Bob’s side it is easy to verify
that the two protocols are equivalent since Bob has to perform exactly the same tasks
for both. The following schematic figure summarizes Alice’s task in the Ekert91 and the
BB84 protocol.

Alice entang. Alice random


Ekert91 source Qi BB84 generator

random Bi random Bi prep.


generator meas. generator Qi

Bi Xi Bi Xi

In the Ekert91 protocol Alice’s task is described by a state


1 X 1 X
ρEkert91
Bi Xi Qi = |bihb|Bi ⊗ |xihx|Xi ⊗ |ϕb,x ihϕb,x |Qi , (9.4)
2 2
b∈{0,1} x∈{0,1}

where |ϕ0,0 i = |0i, |ϕ0,1 i = |1i, |ϕ1,0 i = |0̄i, and |ϕ1,1 i = |1̄i. The BB84 protocol leads to
the same state
1 X 1 X
ρBB84
Bi Xi Qi = |bihb|Bi ⊗ |xihx|Xi ⊗ |ϕb,x ihϕb,x |Qi . (9.5)
2 2
b∈{0,1} x∈{0,1}

We thus conclude that viewed from outside the dashed box the two protocols are equivalent
in terms of security and hence to prove security for BB84 it is sufficient to prove security
for Ekert91. Note that both protocols have some advantages and drawbacks. While for
Ekert91 it is easier to prove securtiy, the BB84 protocol is technologically simpler to
implement.

ρABE
A B
E
X Z X0 Z0
meas. basis meas. basis meas. basis meas. basis
{|0̄i, |1̄i} {|0i, |1i} {|0̄i, |1̄i} {|0i, |1i}

It remains to prove that the Ekert91 protocol is secure. The idea is to consider the state
of the entire system (i.e. Alice, Bob and Eve) after the sending the distribution of the

98
entangled qubit pairs over the insecure channel (which may be arbitrarily modified by
Eve) but before Alice and Bob have measured. The state ρABE is arbitrary except that
the subsystem A is a fully mixed state (i.e. ρA is maximally mixed). At this point the
completeness of quantum theory (cf. Chapter 5) shows up again. Since quantum theory is
complete, we know that anything Eve could possibly do is described within our framework.
We now consider two alternative measurements for Alice (B = 0, B = 1). Call the
outcome of the measurement Z if B = 0 and X if B = 1. The uncertainty relation (9.3)5
now implies that
H(Z|E) + H(X|B) ≥ 1, (9.6)
which holds for arbitrary states ρABE where the first term is evaluated for ρZBE and the
second term is evaluated for ρXBE . The state ρXBE is defined as the post-measurement
state when measuring ρABE in the basis {|0̄i, |1̄i} and the sate ρZBE is defined as the
post-measurement state when measuring ρABE in the basis {|0i, |1i}. Using (9.6), we can
bound Eve’s information as H(Z|E) ≥ 1 − H(X|B). We next show that H(X|B) = 0
which implies that H(Z|E) = 1, i.e. Eve has no information about Alice’s state. The data
processing inequality implies H(Z|E) ≥ 1 − H(X|X 0 ).
In the protocol, there is a step called the testing phase where two alternative things can
happen
• if Pr[X 6= X 0 ] > 0, then Alice and Bob detect a deviation in their sample and abort
the protocol.
• if Pr[X = X 0 ] ≈ 1, Alice and Bob output a key.
Let us therefore assume
√ that Pr[X 6= X 0 ] = δ for δ ≈ 0. In this case, we have H(Z|E) ≥
1 − h(δ) ≈ 1 − δ for small δ, where h(δ) := −δ log2 δ − (1 − δ) log2 (1 − δ) denotes
the binary entropy function. Note that also H(Z) = 1, which implies that I(Z : E) =
H(Z) − H(Z|E) ≤ h(δ). Recall that I(Z : E) = D(ρZE ||ρZ ⊗ ρE ). Thus, if I(Z : E) = 0,
we have D(ρZE ||ρZ ⊗ρE ) = 0 for δ → 0. This implies that ρZE = ρZ ⊗ρE which completes
the security proof.6

Important remarks to the security proof The proof given above establishes security
under the assumption that there are no correlations between the rounds of the protocol.
Note that if the state involved in the i-th round is described by ρAi Bi Ei we have in general

ρA1 A2 ...An B1 B2 ...Bn E1 E2 ...Bn 6= ρA1 B1 E1 ⊗ ρA2 B2 E2 ⊗ . . . ⊗ ρAn Bn En . (9.7)

Therefore, it is not sufficient to analyze the rounds individually and hence we so far only
proved security against i.i.d. attacks, but not against general attacks. Fortunately, there
is a solution to this problem. The De Finetti theorem shows that the proof for individual
attacks also implies security for general attacks. A rigorous proof of this statement is
beyond the scope of this course and can be found in [11] and [27] which uses a post
selection technique.
5 The uncertainty relation was topic of one exercise.
6 Inprinciple, we have to repeat the whole argument in the complementary basis, i.e. using the uncertainty
relation H(X|E) + H(Z|B) ≥ 1 (cf. (9.6))

99
Bibliography
[1] Micheal A. Nielsen and Isaac L. Chuang. Quantum Computation and Quantum In-
formation. Cambridge University Press, 2000.
[2] David H. Mellor. Probability: A Philosophical Introduction. Routledge, 2005.
[3] Claude E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27:379–423 and 623–656, 1948.
[4] Roger Colbeck and Renato Renner. The completeness of quantum theory for predict-
ing measurement outcomes. arXiv:1208.4123.
[5] Matthew F. Pusey, Jonathan Barrett, and Terry Rudolph. On the reality of the
quantum state. DOI:10.1038/nphys2309, arXiv:1111.3328.

[6] Roger Colbeck and Renato Renner. Is a system’s wave function in one-to-one cor-
respondence with its elements of reality? DOI:10.1103/PhysRevLett.108.150402,
arXiv:1111.6597.
[7] Albert Einstein, Boris Podolsky, and Nathan Rosen. Can quantum-mechanical de-
scription of physical reality be considered complete? DOI:10.1103/PhysRev.47.777.
[8] Simon B. Kochen and Ernst P. Specker. The problem of hidden variables in quantum
mechanics. Journal of Mathematics and Mechanics, 17, 1967.
[9] John S. Bell. On the Einstein Podolsky Rosen Paradox. Physics, 1, 1964.

[10] Alain Aspect, Jean Dalibard, and Gérard Roger. Experimental test of bell’s inequal-
ities using time- varying analyzers. DOI:10.1103/PhysRevLett.49.1804.
[11] Renato Renner. Security of quantum key distribution. PhD thesis, ETH Zurich,
December 2005. arXiv:0512258.
[12] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge Univer-
sity Press, 2004. available at:http://www.stanford.edu/ boyd/cvxbook/.
[13] Michael A. Nielsen. Conditions for a class of entanglement transformations. Phys.
Rev. Lett., 83:436–439, Jul 1999. DOI:10.1103/PhysRevLett.83.436.
[14] Fernando Brandao, Matthias Christandl, and Jon Yard. Faithful squashed
entanglement. Communications in Mathematical Physics, 306:805–830, 2011.
DOI:10.1007/s00220-011-1302-1.

100
[15] William K Wootters and Wojciech H. Zurek. A single quantum cannot be cloned.
Nature, 299:802 – 803, 1982. DOI:10.1038/299802a0.

[16] Dennis Dieks. Communication by epr devices. Physics Letters A, 92(6):271 – 272,
1982. DOI:10.1016/0375-9601(82)90084-6.
[17] Manuel Blum. Coin flipping by telephone a protocol for solving impossible problems.
SIGACT News, 15(1):23–27, January 1983. DOI:10.1145/1008908.1008911.

[18] Douglas R. Stinson. Cryptography: Theory and Practice. CRC Press, 2005.
[19] Claude E. Shannon. Communication theory of secrecy systems. Bell System Technical
Journal, 28:656–715, 1949.
[20] Stephen Wiesner. Conjugate coding. SIGACT News, 15(1):78–88, January 1983.
10.1145/1008908.1008920.

[21] Charles H. Bennett and Gilles Brassard. Quantum cryptography: Public key distribu-
tion and coin tossing. Proceedings International Conference on Computers, Systems
& Signal Processing, 1984.
[22] Dominic Mayers. Unconditionally secure quantum bit commitment is impossible.
Phys. Rev. Lett., 78:3414–3417, Apr 1997. DOI:10.1103/PhysRevLett.78.3414.
[23] Peter W. Shor and John Preskill. Simple proof of security of the BB84
quantum key distribution protocol. Phys. Rev. Lett., 85:441–444, Jul 2000.
DOI:10.1103/PhysRevLett.85.441.
[24] Eli Biham, Michel Boyer, P. Oscar Boykin, Tal Mor, and Vwani Roychowdhury. A
proof of the security of quantum key distribution. Journal of Cryptology, 19:381–439,
2006. DOI:10.1007/s00145-005-0011-3.
[25] Artur K. Ekert. Quantum cryptography based on bell’s theorem. Phys. Rev. Lett.,
67:661–663, Aug 1991. DOI:10.1103/PhysRevLett.67.661.

[26] Mario Berta, Matthias Christandl, Roger Colbeck, Joseph M. Renes, and Renato
Renner. The uncertainty principle in the presence of quantum memory. Nature
Physics, 6, 2010. DOI:10.1038/nphys1734.
[27] Matthias Christandl, Robert König, and Renato Renner. Postselection technique
for quantum channels with applications to quantum cryptography. Phys. Rev. Lett.,
102:020504, Jan 2009.

101

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy