0% found this document useful (0 votes)
22 views89 pages

136 hw11

hw 178

Uploaded by

erka303e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views89 pages

136 hw11

hw 178

Uploaded by

erka303e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

E178/ME292b

Statistics and Data Science for Engineers


Reader

Gabriel Gomes

August 27, 2024


2
Contents

1 Systems and Models 7

2 Probability theory 13
2.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Event space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Probability measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Probability density function . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.5 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . 23
2.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Properties of the expected value . . . . . . . . . . . . . . . . . . . . . 25
2.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Standardization of random variables . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Named distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Bernoulli distribution B(p) . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Binomial distribution Bin(n, p) . . . . . . . . . . . . . . . . . . . . . 31
2.5.3 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.4 Exponential distribution E( ) . . . . . . . . . . . . . . . . . . . . . . 34
2.5.5 Uniform distribution U (a, b) . . . . . . . . . . . . . . . . . . . . . . . 35
2
2.5.6 Gaussian (a.k.a. normal) distribution N (µ, ). . . . . . . . . . . . . 36
2.6 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3
2.6.1 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.2 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.4 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.6.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 Optimization theory 67
3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.1 Global vs. local solutions . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Types of feasible points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.1 First order optimality condition . . . . . . . . . . . . . . . . . . . . . 74
3.3 Convex optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1 Properties of convex optimization problems . . . . . . . . . . . . . . 77
3.3.2 Examples of convex sets . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4.1 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . . . 83

4 Statistical inference 87

5 Supervised learning 89

4
Symbols
Sets
{} ... the empty set
{a, b, c} ... a set with elements a, b, and c
{ai }n ... the set {a1 , a2 , . . . , an }, indexed with i.
{a, . . . , b} ... the set of integers a through b (inclussive)
[a, b] ... the closed interval of the real numbers from a to b
A[B ... the union of sets A and B
A\B ... the intersection of sets A and B
A B ... the set that results from removing from A all of the elements of B.
[ni=1 ei ... the union of sets e1 , e2 , . . . , en .
a2A ... assert a is a member of the set A.
A✓B ... assert A is a subset of B, and possibly A equals B.
8a 2 A ... assert a condition holds for all elements a in the set A

Named sets
N ... the natural numbers, not including zero
N0 ... the natural numbers including zero
Z ... the integers
R ... the real numbers
R+ ... the positive real numbers, not including zero
RD ... the D-dimensional vector space of real numbers

Functions
f :A!B ... f is a function with domain A and codomain B.
f (x; ✓) ... f is a function with inputs x and parameters ✓

5
Probability

Y ... A random variable


⌦Y ... Sample space of Y
EY ... Event space of Y
PY ... Probability measure of Y
pY ... Probability density function of Y
FY ... Cumulative distribution function of Y
E[Y ], µY ... Expected value (mean) of Y
2
V ar[Y ], Y ... Variance of Y
Y ... Standard deviation of Y
Cov(X, Y ) ... Covariance of X and Y
y⇠Y ... y is a sample of Y
B(↵) ... Bernoulli distribution with parameter ↵
U (a, b) ... Uniform distribution on the interval [a, b]
2 2
N (µ, ) ... Normal distribution with mean µ and variance
iid
{yi }N ⇠ Y ... {yi }N is iid sampled from Y
iid
{Yi }N ⇠ Y ... {Yi }N are iid copies of Y
Y |X = x ... The random variable Y conditioned on X = x
⇢XY ... Correlation between random variables X and Y

Statistical learning
H ... A family of prediction functions
P ... The number of parameters that characterize H
✓ ... The vector of parameters ✓ 2 RP
h(x; ✓) ... A family of prediction functions

6
Chapter 1

Systems and Models

Much of the activity of engineers involves building models to predict and control the behavior
of physical systems. For example, we use models of solar panels and batteries to design
solar farms and predict their production. We use models of the drivetrain of a car to
design cruise control systems. We embed models in feedback control loops to guide robots
through uncertain environments. These models consist of mathematical equations or code
that capture the aspects of the system that we are interested in predicting or controlling.

To build a model of a system is to specify a function (mathematical or in code) that


maps the system’s “inputs” to its “outputs”. The outputs are the quanitities of interest; the
ones that we wish to predict or control. The inputs are quantities that we can measure (and
perhaps manipulate) and which a↵ect the outputs through the influence of the system. This
is illustrated in Figure 1.1.

Figure 1.1: A model transforms inputs to outputs.

7
Figure 1.2: Models range from purely data-based to purely mechanistic.

As an example of a system, consider a quadcopter or “drone”. The measurable quantities


include the voltages applied to each of its motors, its position, its linear and angular speeds,
and the masses and moments of inertia of its parts. If our goal is to steer the drone, then
the relevant model will be one that provides its position and speed (outputs) as a function
of the applied motor voltages (inputs).
The process of building a model begins with the specification of the type or family of
models that we wish to work with. Any given system may be described by many di↵erent
types of models, and which one we choose will depend on our goal. For the quadcopter,
the type of model used when our aim is simply to predict travel time will be di↵erent from
the type needed for real-time steering. The former requires only the average velocity and
approximate path of the drone, while the latter requires detailed knowledge of its position,
orientation, and speed, as well as its maneuvering capabilities.
Figure 1.2 illustrates a range of model types, organized from left to right with respect
to their “a-priori structure”. On the right-hand side we see the so-called “mechanistic”
or “open-box” models. These are ones with lots of a-priori structure, usually obtained by
applying the principles of science, such as Newton’s laws, the physical laws of electricity,
thermodynamics, fluid mechanics, etc. to the system. They are “open-box” because they
require that we understand the inner workings of the system. We build open-box models

8
by breaking the system down to its individual parts, then modeling each one of those us-
ing scientific or engineering principles (or, if not principles, some established techniques),
and then assembling these into a single model for the entire system. As an example, the
coupled system of ordinary di↵erential equations shown below captures the rotational dy-
namics of a quadcopter in terms of the torques exerted by its propellers and its principal
moments of intertia. It is obtained by applying Euler’s equations of rotational dynamics to
the configuration of the quadcopter.
2 3 2 3 2 3
ṗ n1/Jxx qr
6 7 6 7 Jxx Jzz 6 7
6 7 6n 7 6 7
6q̇ 7 = 6 2/Jxx 7 6 pr 7 (1.1)
4 5 4 5 Jxx 4 5
ṙ n 3/Jzz 0

In these equations, n1 , n2 , n3 are net torques generated by the propellers, about principle
axes going trough the quadcopter’s center of mass; p, q, r are angular speeds about those
axes; and Jxx and Jzz are moments of intertia. These equations model only one aspect of
the movement of the quadcopter – how the net torques influence the angular speeds. If we
wished to control the trajectory of the quadcopter, we would need to model other aspects
as well, such as the influence of motor voltages on propeller thrust, and the influence of
propeller thrust on speed.
Let’s keep our focus on the rotational “sub-model” of Eq. 1.1. These equations apply to
a large number of quadcopters – all those that comply with the assumptions made in the
derivation of the equations. Of all of the properties of a drone, the only ones that a↵ect its
rotational dynamics (according to this model) are Jxx and Jzz – the moments of inertia about
its two main axes1 . These are the parameters of the model. The parameters are quantities
that disinguish two di↵erent systems that are both covered by the model. When we assign
a particular numerical value to the parameters (i.e. we tune or train the model), we are
selecting one from a family of models for a particular class of systems. We are selecting the
one that corresponds to our system.
1
The details of the drone model don’t matter much here, but if you are interested, please check out
MEXXX .

9
Figure 1.3: The propellers of the quadcopter produce torques with components n1 , n2 , and
n3

The number of parameters of a model can be taken as a measure of the size of its family.
The model family of Eq. 1.1 is fairly small – it has only two parameters. The space that we
have to search for our model (the one that best fits our system) is relatively small. This is
because the model equations contain a lot of a-priori structure (or knowledge, or information)
about the system. And this case is particularly simple because the two parameters, Jxx and
Jzz , can be directly measured. It is more difficult when the parameters are not measureable,
and must be inferred.

Moving from right to left in Figure 1.2 we encounter models with less a-priori structure.
On the extreme left-hand side we find the “data-centric” or “closed-box” models. Amongst
these we find the “machine learning” models. These are models that make very few assump-
tions about the system. They contain very little a-priori structure, and therefore apply to a
very wide range of systems. Such models typically contain many parameters, and require a
large amount of data to calibrate. They are most useful for “closed-box” systems, meaning
ones that cannot be broken down into manageable pieces, as we did with the quadcopter.
Below are two examples of closed-box systems, illustrated in Figure 1.4.

Humans are very good at recognizing cats in photographs. Even though it may seem like
a simple task, our current understanding of the visual cortex and other relevant parts of the
brain is insufficient to build a open-box (i.e. mechanistic) model of this system. Although
we understand the basic physics of the brain at the neuronal level, that understanding is not

10
Figure 1.4: Two closed-box models

helpful to the task of building a cat-recognition algorithm. Machine learning models on the
other hand, specifically convolutional neural networks, have been tremendously successful
with this task. These models have very little a-priori structure. They assume only that
the input is an image and the output is a binary “yes cat” or “no cat”. They only learn
about cats – that they consist of two eyes, a triangular nose with whiskers and pointy ears
– through a training process that involves many examples of images with and without cats.

The roll of a die is another example. In this case we have an excellent understanding of
the physics involved, including the 3D rigid-body dynamics of the die, the fluid dynamics of
the air, and the interaction of the die with the floor. However all of these are plagued with
uncertainties. There are uncertainties in the initial position and velocities of the die, the
temperature and vioscosity of the air, the restitution coefficient of the ground, etc. These
uncertainties are amplified through the dynamics of the die, producing a wide range of
possible outputs. The closed-box approach is more straightforward. It acknowledges that
the uncertainties are large and states simply that the output is a randomly chosen number
between 1 and 6!

In this course our focus will be on the left-hand side of Figure 1.2, with the data-centric
models. We will cover many techniques for building such models. These techniques invariably
begin with an “untrained” model whose parameter values are unspecified. As the model is
presented with input data, we compare its predicted output to the measured output, and
use the di↵erence (i.e. the prediction error) to adjust the model parameters. In this way the
model “learns” about the relationship between inputs and outputs in the real system. With

11
enough data, this process can often produce reliable models of systems whose inner function
is too complex for physics-based analysis.
Similar to humans, data-centric models begin life in a somewhat amorphous and un-
trained state. At this stage, their future is uncertain. Given an input, they can produce a
wide variey of outputs. We can think of the training process as one of progressively reducing
the uncertainty in the output by presenting the model with stimiluli and showing it the
correct response. The field of mathematics that studies uncertainty is probability theory,
and so that is where we will start.

12
Chapter 2

Probability theory

Probability theory concerns the study of uncertainty. Most things in life are uncertain.
Predictions about the future are uncertain because they are subject to many factors that
we do not fully know or control. Statements about the present are also uncertain, due to
the finite precision of our measurement devices. Probability theory allows us to quantify,
combine, and evolve interacting uncertainties. This helps us to get a sense of the confidence
that we can place on statements about the world. Large uncertainty means low confidence;
low uncertainty means high confidence.

To begin the discussion, consider the following statement: “The temperature outside my
office is between 60°F and 65°F”. This statement is either true or false, but I do not know
which because I have not looked at the thermometer. However my belief is that it is false,
based on what I see through my window. If asked to rate my belief on a scale from 0 to
1 – where 0 means complete certainty that the statement is false (the temperature is not
between 60°F and 65°F) and 1 means complete certainty that it is true – I would give it a
0.2. This is the so-called Bayesian interpretation of probability. Under this interpretation,
the probability quantifies our subjective belief in a proposition, with 0 signifying complete
certainty that the proposition is false and 1 signifying complete certainty that it is true.
Whenever a belief is based on measurements or perceptual experience, its probability (or
credence in the Bayesian terminology) may only take values in the open interval (0,1). The

13
extreme values of 0 and 1 are not allowed. When I look at the thermometer and see that
it reads 59°F, I obtain evidence that serves to decrease my credence in the statement, from
0.2 to perhaps 0.1. However no amount of evidence can bring the credence value to 0 or
1. There are always alternative possibilities with small but nonzero probability, such as the
possibility that the thermometer is broken, or that I have lost my mind. The extreme values
of 0 and 1 are reserved for statements that are defined or can be proven to be true within
some system of axioms. For example, the statement “there is no largest prime number” is
true with probability 1 within the rules of arithemetic.

Aside from quantifying belief, we can also use probabilities to gauge the uncertainty in
the outcome of a process or measurement. Take for example the tossing of a coin. When
we say that the “probability of heads is 0.5”, we mean that, if the coin were tossed a large
number of times, we expect that it would turn up heads about half of the time. More
precisely we mean that as the number of trials is increased, the ratio of heads to the total
number of tosses will converge to 0.5. We cannot say with any certainty what the sequence
of heads and tails will be, but we are certain that the ratio will approach to 0.5 – provided
the coin is fair. This is the frequentist interpretation of probability.

In both the Bayesian and frequentist interpretations, a probability is a real number in


the interval [0,1] which quantifies the uncertainty of a statement or event. However neither
of the interpretations are completely satisfactory because they do not suggest a practical
method for measuring probabilities. Psychologists have made progress in the design of
experiments that measure subjective beliefs, however it remains a difficult problem. The
frequentist definition, on the other hand, relies on an infinite experiment, which is also
difficult! Despite this, the theory of probability has been extremely successful in modeling
real-world uncertainties of both types. Furthermore, the mathematics of probability theory
applies equally to both interpretations. Thus, we can proceed without worrying too much
about whether our probabilities are “Bayesian” or “frequentist”, but keeping in mind that
both interpretations are available.

14
2.1 Random variables
We begin with an informal definition of a random variable as a symbol that represents some
uncertain measurement or outcome. We typically use upper case letters for random variables.
Here are three exmaples,

• T : the temperature measured by a thermometer outside of my office window in degrees


Fahrenheit. The thermometer ranges from -40°F to 120°F in increments of 1°F. Hence
T can take integer values from -40 to 120.

• D: the outcome of the roll of a die. D can take integer values 1 through 6.

• C: The response given by a person when they are given a photograph and asked
whether it shows a cat. They say either “yes” or “no”.

Apart from its symbol, a random variable has three components: a sample space ⌦, an event
space E, and a probability measure P . We often identify these with their corresponding
random variable using a subscript. Hence we have:

D = (⌦D , ED , PD ) (2.1)

T = (⌦T , ET , PT ) (2.2)

C = (⌦C , EC , PC ) (2.3)

This notation states for example that the random variable D consists of a sample space ⌦D ,
an event space ED , and a probability measure PD . We will define these three quantities next.

2.1.1 Sample space

The sample space of a random variable is the set of all of the values that it can take. In the
case of the die, the sample space is the integers from 1 to 6:

⌦D = {1, 2, 3, 4, 5, 6} (2.4)

15
The thermometer is a little trickier. Strictly speaking ⌦T are the integers from -40 to 120:

⌦T = { 40, . . . , 120} (2.5)

However we are free to be pragmatic and choose a simpler option, such as the interval of
real numbers:
⌦T = [ 40, 120] (2.6)

or even the entire real line:


⌦T = R (2.7)

It will become clear later when we introduce probability distributions why it can be preferable
to work with real-valued sample spaces such as Eq. 2.6 or Eq. 2.7 instead of discrete-valued
sample spaces such as Eq. 2.5. We will take ⌦T = R as the sample space for T from now on.
Both ⌦D and ⌦T are examples of numerical sample spaces. ⌦C on the other hand is a
categorical sample space because it consists of labels “yes” and “no”.

⌦C = {“yes”, “no”} (2.8)

2.1.2 Event space

An event e is any subset of a sample space: e ✓ ⌦. Here are some examples.

• {2, 4, 6} is an event for the random variable D. This event can be expressed verbally
as “roll an even number”.

• [60, 65] is the event for the random variable T corresponding to the statement “the
temperature is between 60°F and 65°F.

• {yes} is the even from the random variable C corresponding to the statement “the
picture shows a cat”.

• The empty set {} and the entire sample space ⌦ are events.

16
The event space E of a random variable is the set of all of its events, i.e. all subsets of the
sample space 1 . This is a much larger set than ⌦, known as its power set. If we use |⌦| for
the size of (i.e. the number of elements in) the sample space, and |E| for the size of the event
space, then
|E| = 2|⌦| (2.9)

For example, there are 26 = 64 possible events for the roll of a die, since ⌦D contains 6
elements.

2.1.3 Probability measure

A probability measure P of a random variable is a function that assigns a real number to


each event in its event space.
P :E !R (2.10)

P (e) is the probability of the event e. To qualify as a probability measure, the function P
must satisfy the following three properties, known as the axioms of probability.

A1. All probabilities are non-negative.

P (e) 0 8e 2 E (2.11)

A2. The probability of the sample space is 1.

P (⌦) = 1 (2.12)

A3. For any disjoint set of n events {ei }n , meaning that no two events intersect (ei \ej = {}
whenever i 6= j), the probability of the union of the events equals the sum of the
1
Actually we do not have to include all of the subsets in the event space, only enough to form a “ -
algebra”. That is, ⌦ must contain the complements of each of its elements, as well as the intersections and
unions of any number of its elements.

17
probabilities of the individual events.

n
X
P ([ni=1 ei ) = P (ei ) (2.13)
i=1

In the frequentist interpretation, disjoint events represent outcomes that cannot happen
simultaneously. For example, the roll of a die cannot be both even and odd. A photo either
has a cat or it doesn’t. Axiom A3 states that to find the probability that any of a set of
disjoint events occurs, we must add up the probabilities that each of them occur.
These axioms were stated in the 1930’s, long after the birth of probability theory in the
sixteenth century. They capture our intuitions for both the Bayesian and frequentist notions,
and they are a sufficient foundation for the full development of the theory. The following
properties are easily deduced from the axioms.

1. Nothing can’t happen:


P ({}) = 0 (2.14)

2. If e0 happens whenever e happens, then the probability of e cannot exceed that of e0 :

e ✓ e0 ) P (e)  P (e0 ) 8e, e0 2 E (2.15)

3. Everything either happens or it doesn’t:

P (e) + P (⌦ \ e) = 1 8e 2 E (2.16)

4. The probability of either e or e0 happening equals the sum of their probabilities, minus
the probability that they both happen:

P (e [ e0 ) = P (e) + P (e0 ) P (e \ e0 ) 8e, e0 2 E (2.17)

Example 2.1.1. A scale is used on a production line to monitor the weight of the widgets

18
produced. It finds that 30% weigh less than 120 g, 40% weigh more than 200 g, and 50%
weight between 120 g and 250 g. Describe this situation using a random variable and a
probability measure. What percentage of widgets weigh between 200 g and 250 g?

Solution. We define a random variable W to represent the weight of a randomly chosen


widget. The sample space for W is the real line: ⌦W = R. The problem statement gives the
probabilities for three events:

e1 = ( 1, 120) PW (e1 ) = 0.3 (2.18)

e2 = (120, 250) PW (e2 ) = 0.5 (2.19)

e3 = (200, 1) PW (e3 ) = 0.4 (2.20)

Figure 2.1: Example 2.1.1

Figure 2.1 illustrates the three events. Notice that they are not disjoint, and hence their
probabilities need not add up to one. Our goal is to find the probability of event e4 =
(200, 250), which can be expressed as e4 = e2 \ e3 . Noting that e1 , e2 , and e3 cover the entire
sample space, we have
PW (e1 [ e2 [ e3 ) = PW (⌦W ) = 1 (2.21)

On the other hand, using Eq. 2.17 with e = e1 [ e2 and e0 = e3 we find,

PW (e1 [ e2 [ e3 ) = PW (e1 [ e2 ) + PW (e3 ) PW ((e1 [ e2 ) \ e3 ) (2.22)

19
Since e1 and e3 do not intersect, we have (e1 [ e2 ) \ e3 = e2 \ e3 = e4 , and so,

1 = PW (e1 [ e2 ) + PW (e3 ) PW (e4 ) (2.23)

= PW (e1 ) + PW (e2 ) + PW (e3 ) PW (e4 ) (2.24)

= 0.3 + 0.5 + 0.4 PW (e4 ) (2.25)

which gives PW (e4 ) = 0.2. The second equality above is an application of the third axiom.

A note on notation

We can use P (X 2 e) instead of PX (e) to denote the probability that the random variable
X takes a value from the event e. If X is discrete-valued and e consists of a single item,
then we can use P (X = e). When e is a semi-infinite interval, we can use P (X x) instead
of PX ([x, 1)) and P (X  x) instead of PX ((1, x]).

2.1.4 Probability density function

The probability measure fully specifies the probabilities associated with the possible out-
comes of an experiment. However it is not a convenient mathematical object for practical
use. To program a probability measure, one would have to write a function that accepts
every possible subset of the sample space and returns a number for each one. Without some
simplifying property or rule, this would require storing the set of all possible events, which
as we have seen, grows exponentially with the size of the sample space.

Fortunately the axioms of probability ensure the existence of another function that cap-
tures the same information but is simpler to use. This is the probability density function
(pdf), or the distribution of the random variable. The pdf is simpler because it maps the
sample space (as opposed to the event space) to the reals. We denote the pdf with lower

20
case p, and with a subscript indicating its random variable:

pD : ⌦D ! R (2.26)

pT : ⌦T ! R (2.27)

pC : ⌦C ! R (2.28)

The defining property of the probability density function is that its integral over any event
e equals the probability of e.
Z
p(!)d! = P (e) 8e 2 E (2.29)
e

Or, if the sample space is discrete, it is the sum over the elements of e:

X
P (e) = p(!) 8e 2 E (2.30)
!2e

Although we do not do it here, it can be shown that this property uniquely defines the
function p.
Axiomatic definition of the probability density function
Alternatively, the probability density function can be defined axiomatically, without reference
to the probability measure. Below are the two axioms (a.k.a. properties) that characterize
a pdf.

1. Non-negativity:
p(!) 0 8! 2⌦ (2.31)

2. Sum to one:
Z
p(!)d! = 1 (continuous sample space) (2.32)

X
p(!) = 1 (discrete sample space) (2.33)
!2⌦

21
With this definition of the probability density function p, we can remove the more cumber-
some probability measure P from the definition of a random variable. A random variable
then becomes the collection of sample space, event space, and pdf.

D = (⌦D , ED , pD ) (2.34)

T = (⌦T , ET , pT ) (2.35)

C = (⌦C , EC , pC ) (2.36)

Figure 2.2: Example probability density functions.

A note on integrals and sums


Figure 2.2 shows probability density functions over continuous and discrete sample spaces.
Many texts on probability theory refer to the discrete version as a probability mass function.
However here we will dispense with this distinction and refer to both as probability density
functions. We will also only use integrals in our equations; no sums. This eliminates a lot of
repitition (such as in Eqs. 2.32 and 2.33 above) and should not create confusion. Whenever
the sample space is discrete, then the integral symbol should simply be interpreted as a sum.

Example 2.1.2. Let Y be a random variable with ⌦Y = [1, 1). Consider the function
pY (y) = a y b , where a and b are real numbers. Find conditions that a and b must satisfy for
pY to be a valid pdf for Y .

Solution. Non-negativity requires that a y b 0 for all y 2 [1, 1]. Thus we conclude a 0.
Secondly, we require that the integral of pY (y) over the sample space equal 1. The integral

22
is only defined when b < 1, so we adopt that assumption.
Z 1 1
b a b+1 a a
ay = y = (0 1) = =1 (2.37)
1 b+1 1 b+1 b+1

From which we obtain a third condition: a + b = 1.

2.1.5 Cumulative distribution function

The cumulative distribution function (cdf) of a random variable Y is a function from the
sample space ⌦Y to the interval [0,1]. It is denoted with FY .

FY : ⌦Y ! [0, 1] (2.38)

FY (t) us defined as the probability of obtaining a sample that is less or equal to t:

FY (t) = P (Y  t) (2.39)

which can be computed as the integral of the pdf from 1 to t.


Z t
FY (t) = pY (y) dy (2.40)
1

Notice that this definition only makes sense for numerical random variables, and not for
label-based random variables, since it utilizes the “” concept, which relies on the order
of the elements of ⌦Y . As always, the integral should be interpreted as a sum when Y is
discrete-valued, and in this case the sum must include pY (t). Figure 2.3 shows examples of
discrete and continuous pdfs with their respective cdfs.

The cdf is very useful for the computation of probabilities, since the probabilty of an

23
Figure 2.3: FY (t) is the probability of the event ( 1, t], which is obtained by integrating the
pdf from 1 to t (inclusive). Integrating over the discrete pdf on the left produces a series
of positive jumps. For the continuous distribution, the cdf is a continuous non-decreasing
function. Its value at t equals the area under the pdf to the left of t.

interval [a, b] equals the di↵erence between two values of the cdf:
Z b
PY ([a, b]) = pY (y) dy (2.41)
a
Z b Z a
= pY (y) dy pY (y) dy (2.42)
1 1

= FY (b) FY (a) (2.43)

In lieu of computers, one often uses lookup tables of cdfs to compute probabilities.

2.2 Expected value

The expected value of a random variable Y , also known as the expectation or the mean of
Y , is denoted with E[Y ] or µY and is defined as follows,
Z
E[Y ] = y pY (y) dy (2.44)
⌦Y

24
Figure 2.4: Means vs. median. pB is obtained by moving a portion of distribution pA to the
right. This action a↵ects the mean but not the median.

Graphically, the expected value can be understood as the “balance point” of the pdf. If
we cut a piece of cardboard into the shape of the pdf, then this shape will be balanced at
E[Y ], as shown in Figure 2.4. The figure also illustrates the “median”, which is any value
ym that satisfies P (Y < ym ) = P (Y > ym ). That is, a value ym is a median value if there
is equaly chance the a sample of Y will be greater than or less than ym . For a symmetric
distribution over a continuous space, such as distribution pA on the left hand side of Figure
2.4, the median coincides with the mean, and they are both at the point of symmetry. We
can appreciate a di↵erence between the median and the mean if we convert distribution pA
into distribution pB by taking a portion of the high-value outcomes and moving them further
to the right. Distribution pB is said to be “positively skewed”, or “right skewed”, since the
action causes the mean (the balance point) to also move to the right. However it does not
a↵ect the median, since the areas to its left and right remain unchanged.

2.2.1 Properties of the expected value

The properties below follow from the definition of the expected value.

1. E[·] is a linear operation. This means that the expected value of a linear combination of
random variables {Yi }n (all over the same sample space) equals the linear combination
of their expected values. " #
n
X n
X
E ↵ i Yi = ↵i E[Yi ] (2.45)
i=1 i=1

25
2. The expected value of a fixed number equals that number: E[↵] = ↵.

3. The expected value of a function g of a random variable Y is computed with,


Z
E[g(Y )] = g(y) pY (y)dy (2.46)
⌦Y

Example 2.2.1. Find the expected value of the distribution of Example 2.1.2.

1 a
Solution. In the example we found that a+b = 1, so the pdf is of the form pX (x) = ax .
Next we apply the definition of the expected value.
Z 1
1 a
E[X] = xax dx (2.47)
1
Z 1
=a x a dx (2.48)
1

The integral exists only if a > 1. Then

a a 1
E[X] = x1 1
(2.49)
1 a
a
= (2.50)
a 1

2.3 Variance
2
The variance of a random variable Y is denoted with V ar[Y ] or Y, and is defined with,
Z
2
V ar[Y ] = E[(Y µY ) ] = (y µY )2 pY (y) dy (2.51)
⌦Y

Here, (Y µY )2 is a random variable, a sample of which is obtained by squaring the distance


from a sample of Y to the fixed mean µY . The expected value of this squared distance is
the variance of Y . An alternate formula for the variance can be derived by expanding the

26
Figure 2.5: Small vs large variance

square and using properties 1 and 2 of the expected value.

V ar[Y ] = E(Y µY ) 2

= E[Y 2 2µY Y + µ2Y ]


(2.52)
= E[Y 2 ] 2µY E[Y ] + µ2Y

= E[Y 2 ] µ2Y

The variance of Y is therefore the di↵erence between the mean of Y 2 an the squared mean
of Y . The variance is a measure of the spread of a distribution (see Figure 2.5). When we
sample a random variable with small variance, we can be fairly sure that the outcome will
be close to the expected value. High-variance random variables on the other hand produce a
wide range of outcomes. Hence the variance quantifies the uncertainty captured by a random
variable.
The unit of variance is the square of the unit of the outcome. For example, if the
temperature is measured in °F , then its variance has units (°F )2 . For this reason we often
report the square root of variance, known as the standard deviation of Y , and denoted with
Y .

Example 2.3.1. Find the variance of the random variable of Example 2.1.2.

Solution. "✓ ◆2 #
⇥ 2
⇤ a
V ar[X] = E (X E[X]) =E X (2.53)
a 1

27
We apply Eq. 2.46 for the expected value of a function.
Z 1 ✓ ◆2
a 1 a
V ar[X] = x dx ax (2.54)
1 a 1
Z 1✓ ◆
2a a2
=a x2 x+ x 1 a
dx (2.55)
1 a 1 (a 1)2
.
= ..
a
= provided a > 2 (2.56)
(a 2)(a 1)2

In contrast with the expected value, the variance is not a linear function. Rather, the
variance of a linear combination of random variables {Yi }n is given by this more complicated
nonlinear formula (provided without proof):
" n
# n n X
n
X X X
V ar ↵ i Yi = ↵i2 V ar[Yi ] + 2 ↵i ↵j Cov(Yi , Yj ) (2.57)
i=1 i=1 i=1 j=i+1

This formula includes the covariance of two random variables Cov(Yi , Yj ), which is defined
later in part 2.6. We provide the formula here only to point out two sources of nonlinearity:
the squares on the ↵i ’s in the first term, and the cross-products ↵i ↵j in the second term. We
can then come up with a particular case in which the variance is linear: the variance of a
sum of uncorrelated random variables. We will see later that “uncorrelated” implies that the
covariance is zero, so the second term is removed. With all ↵i ’s set to 1, we obtain linearity:
" n
# n n X
n
X X X
V ar Yi = V ar[Yi ] + 2 ↵i ↵j Cov(Yi , Yj ) (2.58)
i=1 i=1 i=1 j=i+1
Xn
= V ar[Yi ] (2.59)
i=1

28
Figure 2.6: Normalization of a random variable X.

2.4 Standardization of random variables


We “standardize” a continuous random variable X when we define a new random variable
X̃ with:
X E[X]
X̃ = (2.60)
X

X̃ is standardized in the sense that it necessarily has zero mean and unit variance.
Proof : 
X E[X] E[X] E[X]
E[X̃] = E = =0 (2.61)
X X

X E[X] V ar[X]
V ar[X̃] = V ar = 2
=1 (2.62)
X X

Graphically, the operation has the e↵ect of shifting the distribution of X to the origin, and
squeezing or expanding it so that its variance becomes 1. This is illustrated in Figure 2.6.
Standardization (sometimes referred to as normalization) of random variables has prac-
tical benefits. It can improve the performance of numerical algorithms by bringing all of
the numbers into a common range. Furthermore, lookup tables used in statistics are usually
provided in terms of a standard distribution, and hence one must standardize the data before
consulting them.

2.5 Named distributions


Here we introduce a few important families of pdfs with proper names. You can find a long
list of such distributions in the article titled “List of probability distributions” in Wikipedia.

29
These are “parametric” families, in the sense that they are expressed with mathematical
formulas that include some number of tunable parameters. A “member” of a family is a
distribution obtained by setting the parameters to particular values. We denote this with
p(y; ✓1 , ✓2 , . . . , ✓P ). Here ✓i is a value assigned to the i’th parameter of a family with P
parameters. The semi-colon in the notation separates the arguments of the pdf (values in
the sample space) from its parameters.

2.5.1 Bernoulli distribution B(p)

The Bernoulli distribution is the simplest model for a discrete-valued random variable. Its
sample space consists of just two outcomes: “yes” and “no”, 0 and 1, “left” and “right”,
“true” and “false”, etc. Call them “success” and “failure”, and denote them with X and 7.

⌦ = {X, 7} (2.63)

The Bernoulli distribution has only one parameter which is the probability p of “success”.
Then the probability of “failure” is 1 p.
8
< p k=X
p(k; p) = (2.64)
: 1 p k=7

We use symbol B for the Bernoulli family, and B(p) for the particular distribution with
parameter value p. Y ⇠ B(p) means that Y is a Bernoulli random variable with parameter
value p.
To do computations with Bernoulli random variables, we need to assign numbers to
outcomes X and 7. There are two commonly used options: ⌦ = {0, 1} and ⌦ = { 1, 1}.
Which one to use is purely a matter of mathematical convenience.

{0, 1} encoding:
8
< p k=1
p(k; p) = (2.65)
: 1 p k=0

30
{ 1, 1} encoding:
8
< p y=1
p(k; p) = (2.66)
: 1 p y= 1

Later, for the purpose of taking derivatives, it will be convenient to define a smooth extension
of p(k; p). This is a function that passes through the two points of the discrete distribution
and also has a continuous first derivative. Figure 2.7 shows two options: linear and expo-
nential. Below are the formulas for both, in each of the two encodings. It is important to
note that these functions are not distributions, since they do not integrate to 1. They are
merely functions that coincide with p(k, p) at the discrete points of ⌦Y , and that also have
the convenient property of being di↵erentiable. We denote them with p̄(k; p) to emphasize
this distinction.

Extension functions for the {0, 1} encoding:

p̄(k; p) = pk (1 p)1 k
exponential (2.67)

p̄(k; p) = pk + (1 p)(1 k) linear (2.68)

Extension functions for the { 1, 1} encoding:

p̄(k; p) = p(1+k)/2 (1 p)(1 k)/2


exponential (2.69)
1+k 1 k
p̄(k; p) = p+ (1 p) linear (2.70)
2 2

2.5.2 Binomial distribution Bin(n, p)

The binomial distribution applies to the total number of successes in an independent set of
n Bernoulli trials B(p). For example, it applies to the number k of people with a particular
illness in a random sample of n people, when each person has a probability p of having the

31
Figure 2.7: Smooth extensions of the Bernoulli pdf, in each of two numerical encodings.

Figure 2.8: A sample sequence of Bernoulli trials with p = 0.25.

disease. The sample space for the binomial distribution is ⌦ = {0, . . . , n}, and its pdf is:
✓ ◆
n k
p(k; n, p) = p (1 p)n k
(2.71)
k

n is the total number of trials, and p is the probability of success in each trial. The pdf then
returns to probability of observing k successes. Notice that the formula reduces to Eq. 2.67
when n = 1 (use 0! = 1).

Figure 2.8 provides an example. It shows an outcome for a sequence of 32 Bernoulli trials
with a probability of success of 0.25 for each one. Of the 32 trials, 6 are successes (green
1’s). This is fewer than expected since np = 32 ⇥ 0.25 = 8. The probability of the outcome
shown in Figure 2.8 is obtained by evaluating the binomial pdf with parameters n = 32 and
p = 0.25 at k = 6:
✓ ◆
32 1 6 3 26 2.54 ⇥ 1012
p(k; n, p) = p(6 ; 32, 0.25) = ( /4) ( /4) ⇡ 9.06 ⇥ 105 ⇡ 0.125 (2.72)
6 1.84 ⇥ 1019

32
Figure 2.9: A Poisson process.

2.5.3 Poisson process

A stochastic process is a mathematical object for representing a system that generate a


stream of events over time. This is a similar concept as the sequence of events of Figure 2.8,
except that now they occur on the real line (the axis of time), as shown in Figure 2.9.
Stochastic processes are an interesting and large topic, and we only touch upon them briefly
in the chapter about time series data. The Poisson process is a very simple case that assumes
only that all events are independent, and that the expected number of events in any two
equally-sized intervals is the same. This implies that there is a positive number such that
the expected number of events in an interval of length t is t. We call this number the
rate of the process, and it is measured in events per unit of time.

The Poisson random variable or Poisson distribution counts the number of events in a
Poisson process during one unit of time. The expected value of a Poisson random variable
is clearly , by definition. Its sample space is all of the natural numbers, including 0:

⌦ = N0 (2.73)

Its pdf is
k
e
p(k; ) = (2.74)
k!
This formula can be obtained as the limit of a binomial distribution when the number n of
trials in one unit of time goes to infinity.

Proof. We imagine the Bernoulli trials of Figure 2.8 as occurring in time, one after the
other, over a period of one time unit (one second, for example). Define as the expected

33
number of successes in that period: = np. Then, applying Eq. 2.71,
✓ ◆
n k
p(k; n, p) = p (1 p)n k
(2.75)
k
✓ ◆k ✓ ◆n k
n!
= 1 (2.76)
k!(n k)! n n
k
✓ ◆n k
n(n 1) . . . (n k + 1)
= 1 (2.77)
k! nk n
k
✓ ◆n k
n(n 1) . . . (n k + 1)
= 1 (2.78)
k! | ⇥n⇥
n {z. . . ⇥ n} n
Ktimes

Take the limit as n ! 1.

k
✓ ◆ ✓ ◆✓ ◆ k ✓ ◆n
1 k 1
lim p(k; n, p) = lim 1 ... 1 1 1 (2.79)
n!1 k! n!1 n n n n

Notice that all of the terms except the last tend to 1 as n ! 1. So we can remove them
from the limit. This limit of the last term is a common identity in calculus which we will
not derive here. It equals e . Hence,

k
lim p(k; n, p) = e (2.80)
n!1 k!

2.5.4 Exponential distribution E( )

The exponential distribution models the waiting time between events in a Poisson process.
A sample waiting time is shown as w in Figure 2.9. The sample space for the exponential
distribution is the positive real numbers R+ (not including zero). The pdf for the waiting
time is:
w
p(w; ) = e (2.81)

34
Figure 2.10: Continuous and discrete uniform distributions

2.5.5 Uniform distribution U (a, b)

A uniform distribution is one whose sample space is an interval of real numbers or integers,
and whose pdf is a constant C on that interval. The uniform distribution is parameterized
by the limits a and b of the interval.
8
< C if y 2 ⌦
p(y; a, b) = (2.82)
: 0 otherwise

The value of C can be computed from the limits by noting that the integral must equal
1. For the continuous case, with ⌦ = [a, b], we obtain C = 1/(b a) (from the area of a
rectangle). For the discrete case, with ⌦ = {a, . . . , b}, we get C = 1/(b a + 1), since there
are b a + 1 elements in ⌦.

Figure 2.10 shows the continuous and discrete uniform distribution, along with their
respective means and variances. Examples of discrete uniform random quantities abound:
rolling a single die, flipping a fair coin, picking a card from a well-shu✏ed deck, etc. An
example of a continuous uniform variable is the angle of the seconds-hand on a clock when
you observe it at an arbitrary point in the day.

We use the notation U [a, b] for the continuous uniform random variable on the interval
[a, b], and U{a, b} for the discrete uniform random variable on the set {a, . . . , b}.

35
Figure 2.11: Normal probability density functions

2
2.5.6 Gaussian (a.k.a. normal) distribution N (µ, )

The Gaussian or normal distribution is widely used in engineering and the sciences to model
quantities whose value can in principle be any real number, but is expected to fall near a
particular value. An important example of this are measurements of real-valued quantities
taken with measurement devices. These values sometimes fall above the “true value”, some-
times below, and collectively they form a histogram that has a “bell shape” that is similar
to the Gaussian distribution.

The sample space for the Gaussian distribution is all of the real numbers (⌦ = R). The
2
Gaussian family of distributions is parameterized by two numbers: µ and . µ is allowed
2
to be any real number, while is required to be positive and non-zero. Here is the formula
for a Gaussian pdf: !
2 1 (y µ)2
p(y; µ, )= p exp 2
(2.83)
2⇡ 2 2

Examples of this function are shown in Figure 2.11 for three di↵erent setting of the param-
2
eters. We use Y = N (µ, ) to designate Y as a Gaussian variable with parameters µ and
2 2 2
. Next we prove that µ and turn out to be the mean and variance of N (µ, ).

2 2
Theorem With Y = N (µ, ), E[Y ] = µ and V ar[Y ] = .

36
Proof. Applying the definition from Eq. 2.44 to Eq. 2.83,
Z 1 ✓ ◆
1 (y µ)2
E[Y ] = yp exp 2
dy (2.84)
1 2⇡ 2 2

y µ
p
Change of variables: z = p . Then dz = p1 dy and y = 2 2 z + µ.
2 2 2 2

Z p 1 p
1
E[Y ] = ( 2 2 z + µ) p exp z 2 2 2 dz (2.85)
2⇡ 2
1
Z 1 p
1
=p ( 2 2 z + µ) exp z 2 dz (2.86)
⇡ 1
✓ Z Z 1 ◆
1 p 2 1 2 2
=p 2 z exp z dz + µ exp z dz (2.87)
⇡ 1 1

Noting that 1/2 exp ( z 2 ) is the antiderivative of z exp ( z 2 ), we find that the first term
1
in Eq. 2.87 equals 1/2 exp ( z 2 )| 1 = 0. For the second term, we use the result (without
R1 p
proving it), that 1 exp ( z 2 ) dz = ⇡. Therefore,

1 p
E[Y ] = p 0 + µ ⇡ = µ (2.88)

2
For V ar[Y ] = see https://proofwiki.org/wiki/Variance_of_Gaussian_Distribution.

2.6 Multivariate distributions

So far we have considered the individual random variable as a model for a single measure-
ment from a system. However most physical systems are not well described by a single
measurement. A real system may be characterized by multiple measurements, and therefore
multiple random variables are required. The main question that arises regarding multiple
measurements is whether the are related in some way. For example, measurements of ambi-
ent temperature and humidity exhibit decreasing relationship: the higher the temperature,
the lower the humidity. However the roll of a die is not related to the measurement of

37
Figure 2.12: Joint distribution of temperature and humidity. The event that T 2 [60, 65]
and H 2 [69, 72] is a rectangle IN R2 . The probability of this event is the integral of pZ over
the rectangle.

temperature. These relationships must be encoded in the random variables that we use to
model the system.

Let’s again use T for temperature, with sample space ⌦T = R and distribution pT , and
H for humidity with sample space ⌦H = R and distribution pH . We form a multivariate
random variable Z by grouping T and H into an array: Z = (T, H). The sample space of
Z is the two-dimensional plane: ⌦Z = ⌦T ⇥ ⌦H = R2 , and its distribution is, as expected,
a function from the sample space to the real numbers: pZ : R2 ! R. Figure 2.12 shows
a possible distribution of Z. The notion of an event also generalizes: an event of Y is any
subset of ⌦Z . Take for example the event that the temperature is between 60°F and 65°F,
while humidity is between 69% and 72%:

e = {(t, h) : t 2 [60, 65], h 2 [69, 72]} (2.89)

This event is depicted as a dark green rectangle in Figure 2.12. The probability of an event

38
is, as before, the integral of the pdf over the event. In the case of e it is a double integral:
Z Z 65 Z 72
PZ (e) = pZ dz = pZ (t, h) dh dt = 0.02 (2.90)
e 60 69

In general, a multivariate random variable may have D components, and its sample space is
the composition of the component sample spaces.

Z = (Z1 , Z2 , . . . , ZD ) . . . multivariate random variable (2.91)

⌦Z = ⌦Z1 ⇥ . . . ⇥ ⌦ZD . . . multivariate sample space (2.92)

The expected value of Y is defined as the vector of expected values of the components:

E[Z] = (E[Z1 ], E[Z2 ], . . . , E[ZD ]) 2 RD (2.93)

The variance of a multivariate random variable is a D ⇥ D matrix, and is referred to as the


covariance matrix.
⇥ ⇤
V ar[Z] = E (Z E[Z])T (Z E[Z]) (2.94)

Here we have assumed that Z and E[Z] are arranged as row vectors so that (Z E[Z])T
is a column vector and V ar[Z] is a square matrix. The diagonal entries in this matrix turn
out to be the variances of the individual Zi ’s. The non-diagonal entries are the covariances
Cov(Zi , Zj ) of the pair (Zi , Zj ):

Cov(Zi , Zj ) = E[(Zi E[Zi ])(Yj E[Zj ])] (2.95)

39
2 2
Denoting the variance of Zi with i and the covariance of Zi and Zj with i,j , we have,

2 3
2 2 2
1 1,2 ... 1,D 7
6
6 2 2 2 7
6 2,1 2 ... 2,D 7
V ar[Z] = 6
6 .. .. .. .. 7
7 (2.96)
6 . . . . 7
4 5
2 2 2
D,1 D,2 ... D

2 2
Notice from the definition that i,j = j,i , and therefore the covariance matrix is symmetric.

Figure 2.13: Sample space for Example 2.6.1

Example 2.6.1. Consider the multivariate variable Z with a triangular sample space ⌦Z
shown in Figure 2.13, and distribution pZ (x, y) = c (1 x y).
a) Compute c
b) Compute the probability of the event e = {(x, y) : x  0.5}

Solution.

Figure 2.14: Example 2.6.1

40
a) pZ (x, y) is the the gray triangular plane shown on the left side of Figure 2.14. We must
find the value of c such that the volumen under the triangle equals 1. The volume of any
pyramid is one third its base times its height. Therefore we require:

(1/3) (1/2) c = 1 (2.97)

Which implies c = 6.
b) The event e is shown in Figure 2.13. The probability of its complement ⌦Z e is the
volume of the green-shaded region in Figure 2.14. Again, we can compute this quatity using
the formula for the volume of a pyramid; one third the base times the height. In this case
the base has area 0.5 ⇥ 0.5 ⇥ 0.5 = 1/8. The height of the pyramid is c/2 = 3. The probability
of ⌦Z e is therefore:
PZ (⌦Z e) = (1/3) (1/8) 3 = 1/8 (2.98)

Therefore PZ (e) = 7/8

2.6.1 Correlation coefficient

The covariance between two random variables has been defined in Eq. 2.95. As the name
suggests, the covariance captures the degree to which the samples of the two quantities tend
to move together. As with the variance, the covariance can be difficult to interpret because
its unit is the product of the units of the two variables. So, for example, the covariance of
temperature and humidity has SI units of °K · P a. This defficiency was corrected in the case
of variance by defining the standard deviation. For the covariance we define the correlation
coefficient ⇢XY of two random variables X and Y .

Cov(X, Y )
⇢XY = (2.99)
X Y

Notice that the ⇢XY is dimensionless, which means that it is insensitive to the units with
which we measure X and Y . Indeed, it can be shown (using Eq. 2.58) that ⇢XY must lie in

41
Figure 2.15: Scatter plot of the normalized measurements.

the interval [ 1, 1].

Notice that the correlation coefficient can be expressed in terms of normalized versions
of X and Y , which we will denote with Xn and Yn .
✓ ◆✓ ◆
E[(X µX )(Y µY )] X µX Y µY
⇢XY = =E = E[Xn Yn ] (2.100)
X Y X Y

Figure 2.15 shows an example of possible samples of Xn and Yn . The value of Xn Yn for a
particular sample is the area of the rectangle with one corner at the origin and another at
the sample, with positive sign if the sample is in the first or third quadrant, and negative
sign if it is in the second or fourth quadrant. The mean of these areas, known as sample
correlation coefficient, coverges to ⇢XY as the number of samples increases. If there tend
to be more and larger such rectangles in quadrants two and four, as in the figure, then the
correlation coefficient will be negative. If there are more and larger rectangles in quadrants
one and three, then it will be positive. If the rectangles in quadrants two and four balance
out with rectangles in one and three, then the correlation coefficient is zero.

Figure 2.16 provides several examples. Each subplot shows a scatter plot of data sampled
from a joint distribution pXY . The top row shows Gaussian distributions. Positive correlation

42
Figure 2.16: Correlation coefficient

coefficients indicate data that has an increasing tendency. Negative correlation coefficients
indicate data that decreases. The middle plot on the top row shows uncorrelated Gaussian
data - here there is no descernable pattern.
The middle row shows examples of perfectly correlated data – that is, data from distri-
butions with correlation coefficient equal to 1 or -1. The exception is the middle plot which
lacks a correlation coefficient since Y = 0 in this case, and so ⇢XY is undefined.
The bottom row shows examples where the correlation coefficient is zero, even though
there is a discernable pattern in the data. Notice however that all of these cases present a
symmetry that balances rectangles in quadrants one and three with those in quadrants two
and four.

Example 2.6.2. Find the correlation coefficient for the distribution of Example 2.6.1.

p
Solution. The standard deviations were found in Example 2.6.5 to be X = Y = 3/80.

We can plug these into Eq. 2.99 and use the definition of the expectation to obtain the
answer. The integral is too tedious to do by hand, and so we resort to Python’s sympy,
which returns ⇢XY = 1/3

43
2.6.2 Marginalization

The joint distribution of a mutlivariate random variable Z = (Z1 , . . . , ZD ) encodes informa-


tion about all of the individual quantities Zi , as well as their correlations. We can extract the
individual distributions from the joint distribution through the process of marginalization.

As an example, take the joint distribution of temperature and humidity pZ (t, h), with
Z = (T, H). We obtain the distribution of T alone by integrating pZ over ⌦H :
Z
pT (t) = pZ (t, h)dh (2.101)
⌦H

Similarly, we obtain the distribution of H alone by integrating pZ over ⌦T :


Z
pH (h) = pZ (t, h)dt (2.102)
⌦T

Both of these formulas yield valid probability distributions in the sense that they are non-
negative and have unit integral. More generally, we can compute the distribution of any
number of components d < D (not just one) of a multivariate random variable with D
components, by integrating its joint distribution over the sample spaces of the other D d
components.

Example 2.6.3. For Example 2.6.1, compute the marginal distributions of X and Y .

Solution. For each value of x, the marginal probability pX (x) is found by integrating over

44
⌦Y . We use the fact that pXY (x, y) is only non-zero between 0 and 1 x.
Z
pX (x) = pZ (x, y) dy (2.103)
⌦Y
Z 1 x
= 6(1 x y) dy (2.104)
0
Z 1 x Z 1 x
= 6(1 x)dy 6 dy (2.105)
0 0
1 x
2 y2
= 6(1 x) 6 (2.106)
2 0

= 3(1 x)2 (2.107)

We can argue by symmetry that pY (y) = 3(1 y)2 .

Example 2.6.4. For Example 2.6.1, compute E[X].

Solution.

Z
E[X] = x pX (x)dx

Z 1X
= 3x(1 x)2 dx
0
Z 1
=3 x(1 2x + x2 )dx
0
✓ ◆1
1 2 2 3 1 4
=3 x x + x
2 3 4 0
1
=
4

Example 2.6.5. For Example 2.6.1, compute V ar[X].

Solution.

45
V ar[X] = E[(X E[X])2 ]
Z
= (x 1/4)2 pX (x)dx
⌦X
Z 1
= (x 1/4)2 3(1 x)2 dx
0
.
= ..

= 3/80

As has been mentioned, the important question regarding pairs of random variables is
whether knowing the value of one of them has an influence on our belief about the other.
The correlation coefficient gave a partial answer to this question: knowledge of one influences
belief of the other when ⇢XY 6= 0. Specifically, if ⇢XY > 0, then larger values of Y imply
larger expected values of X (and vice-versa). Conversely, when ⇢XY < 0, larger values of Y
imply smaller expected values of X (and vice-versa). However, ⇢XY = 0 does not imply that
the two are unrelated, as exemplified in the third row of Figure 2.16.
Upon further reflection we realize that a full answer to this question cannot possibly be a
single number such as the correlation coefficient, since it may depend on the value obtained
in the measurement. For example, the distribution of temperature may depend on humidity
when humidity is low, but not when it is high.
A full answer to the question of the relation between random variables must, in the
language of probability theory, tell us how the distribution of one random variable changes,
when something is known about another random variable. The function that gives us this
answer is the conditional probability.

2.6.3 Conditional probability

Consider two random variables X and Y , both categorical, with ⌦X = {a, b, c} and ⌦Y =
{↵, }. The joint pdf of Z = (X, Y ) is provided in tabular form in Figure 2.17.

46
Figure 2.17: Joint distribution of two categorical random variables.

Figure 2.18: Eq. 2.108

From this table we see that the marginal probability of the event (Y = ↵) is 0.55, and
for (Y = ) it is 0.45. However the odds change if we fix the value of X. If (X = a), then
the (Y = ↵)-to-(Y = ) odds are 1:3. The odds flip to 3:1 if X = b or X = c. These are
examples of conditional probabilities.

More generally, let Z be a (possibly multivariate) random variable with sample space ⌦Z
and probability measure PZ . Then for any two events and e and e0 in ⌦Z , we define the
conditional probability P (e0 |e) as the probability that e0 occurs, provided e occurs. It’s good
here to keep an open mind about what is meant by an event “occurring”, as well as with the
conjunctive “provided”. For example, we have not made any assumptions about the flow of
causality, whether e causes e0 , or e0 causes e, or neither. We have also not specified what
the events represent. They may be measurements from a system, for example e may be an
input event (i.e. a possible measurement of the inputs), and e0 an output event. In a more
Bayesian setting they could be hypotheses about the world, with P (e) being our credence
value for hypothesis e. The possibilties are vast. In any case, the numerical value of the

47
conditional probability is obtained with:

0 PZ (e0 \ e)
P (e | e) = (2.108)
PZ (e)

provided PZ (e) 6= 0. We can apply this formula to the example from Figure 2.17 to compute
several other conditional probabilities.

1. The probability that Y = ↵, provided X = a:

PZ (Y = ↵, X = a) 0.1
P (Y = ↵ | X = a) = = = 1/4 (2.109)
PZ (X = a) 0.4

2. The probability that X = c, provided Y = :

PZ (X = c, Y = ) 0.05
P (X = c | Y = ) = = = 1/9 (2.110)
PZ (Y = ) 0.45

3. The probability that Y = , provided ((Y = ) and (X = a)) or ((Y = ↵) and (X = c)).
Let’s use e for the conditioning event. e comprises the lower-left and upper-right
corners of the table in Figure 2.17.

PZ ((Y = ) \ e) 0.3
P (Y = |e) = = = 2/3 (2.111)
PZ (e) 0.45

The concept of conditional probability applies equally well to continuous random variables.
For example, we can define the probability that the temperature is between 62°F and 65°F
given that humidy is between 70% and 72%:

PZ (T 2 [62, 65] and H 2 [70, 72])


P (T 2 [62, 65] | H 2 [70, 72]) = (2.112)
PZ (H 2 [70, 72])

You may have noticed that we have not identified the conditional probabilities (left-hand
sides of the equations in this section so far) with a random variable. This was in order to
delay the question of the status of the conditional probability function P . Does it satisfy

48
the axioms of probability? The answer to this question is “yes” (proof left to the reader).
Since it is a probability measure, we can use it to define a random variable, which we will
call a conditioned random variable.

Let’s be more precise with the definition of the conditioned random variable. Again,
let Z be a (possibly multivariate) random variable with sample space ⌦Z and probability
measure PZ , and let e be an event of Z: e ✓ ⌦Z . e determines a new random variable Z|e
with sample space ⌦Z .
⌦Z|e = ⌦Z (2.113)

The probability measure for Z|e is defined with:

PZ (e0 \ e)
PZ|e (e0 ) = 8e0 2 ⌦Z (2.114)
PZ (e)

The simpler notation of Eq. 2.108 is often used. It is even acceptable to remove subscripts
altogether, as long as the meaning is unambiguous:

P (e0 \ e)
P (e0 |e) = 8e0 2 ⌦Z (2.115)
P (e)

Figure 2.19: Conditional probability density function

What is the probability density function for the random variable Z|e? Noting that the
probability of every event e0 ✓ ⌦Z equals the probability of its intersection with e divided

49
by PZ (e), we find that the conditional pdf is:
8
< pZ (z)
if z 2 e
PZ (e)
pZ|e (z) = (2.116)
: 0 if z 2
/e

This is illustrated in Figure 2.19. The portion of the joint pdf pZ that is within e is scaled
by 1/PZ (e), and the rest is “chopped o↵”. We can also use the simplified notation p(z|e) in
place of pZ|e (z), if the meaning is clear from context:
8
< p(z)
if z 2 e
P (e)
p(z|e) = (2.117)
: 0 if z 2
/e

This formula only applies when PZ (e) 6= 0, and hence it does not cover a case that is very
common in the sciences and engineering: the conditioning of a continuous random variable
with a measurement event. For example, when we seek the distribution of temperature given
that the humidity is known to be 70%: T |H = 70.

Conditioning a continuous pdf on a measurement.

Figure 2.20: Conditioning a continuous pdf on a measurement

Consider a joint random variable Z with components X and Y : Z = (X, Y ), both real-
valued. The joint sample space ⌦Z is the real plane, and the joint pdf pZ (x, y) is a function
from R2 to R. Now suppose that we take a measurement of the quantity X and obtain the
value xo . The conditioning event e is {(x, y) | x = xo }. Visually, e is a straight line through

50
⌦Z , as shown in Figure 2.20. Being a line, the event has no area, and therefore its probability
is zero: PZ (e) = 0. This presents a problem, because the formula of Eq. 2.116 for the pdf
does not apply. We need an alternative definition for conditional probabilities of this type.

To this end we define the conditional random variable Y |X = xo , as opposed to Z|X = xo ,


with sample space ⌦Y instead of ⌦Z . The pdf for Y |X = xo is computed with:

pZ (xo , y)
pY |X=xo (y) = 8y 2 ⌦Y (2.118)
pX (xo )

or, using the simplified notation with:

pZ (xo , y)
p(y | X = xo ) = 8y 2 ⌦Y (2.119)
pX (xo )

Let’s note the di↵erences between Eqs. 2.119 and 2.117. First, the domains are di↵erent.
The domain of p(z|e) in Eq. 2.117 is all of ⌦Z . The formula therefore must account for values
of z that are outside of e – hence the two cases. The domain of p(y|X = xo ) in Eq. 2.119
on the other hand is only ⌦Y . We do not need to specify two cases since all values of (xo , y)
fall within the event (X = xo ). Another di↵ernce is the denominator on the right-hand side.
The denominator in in Eqs. 2.117 is PZ (e) – the probability of event e. In Eq. 2.119 it is
pX (xo ) – the marginal pdf of X evaluated at xo . Next, we justify Eq. 2.119 by showing that
it corresponds to the limit of Eq. 2.117 as e becomes a line.

Consider the event e = {(x, y) : x 2 [xo , xo + ]}. This is the vertical blue bandv in
Figure2.21. Define p(z|e) using Eq. 2.117. Consider the case when z 2 e. We will transform
e into a line by letting ! 0.
pZ (z)
lim p(z|e) = lim (2.120)
!0 !0 PZ (e)

Hence we need to find the limits of pZ (z) and PZ (e) as ! 0. Start with pZ (z). We can
write z in (x, y) coordinates as z = (xo + ✏, y) for some ✏ 2 [0, ]. Then, assuming pZ is
continuous on the line X = xo , we have,

lim pZ (xo + ✏, y) = lim pZ (xo + ✏, y) = pZ (xo , y) (2.121)


!0 ✏!0

51
Next, we note that PZ (e) can be expressed in terms of the marginal cdf of X.
Z Z 1
PZ (e) = pZ (xo + x, y) dy dx (2.122)
0 1
Z
= pX (xo + x) dx (2.123)
0

= FX (xo + ) FX (xo ) (2.124)

PZ (e) clearly vanishes as ! 0, but PZ (e)/ becomes pX (xo ):

PZ (e) FX (xo + ) FX (xo )


lim = lim (2.125)
!0 !0

= FX0 (xo ) (2.126)

= pX (xo ) (2.127)

Hence, although p(z|e) explodes as ! 0, the limit of p(z|e) does exist:

pZ (z) pZ (xo , y)
lim p(z|e) = lim = (2.128)
!0 !0 PZ (e)/ pX (xo )

pZ (xo , y)/pX (xo ) has the same shape as p(z|e) in the limit. It is scaled by , which gives it
unit integral and thus makes it a valid pdf.

Figure 2.21: Derivation of conditioning Eq.2.119

Example 2.6.6. For Example 2.6.1, compute p(y|X = 1/2).

52
Figure 2.22: Joint distribution of weight and vehicle type

Solution. We apply the formula for the pdf of a conditional random variable:

pZ (x, y) 6(1 x y)
p(y|X = x) = = (2.129)
pX (x) 3(1 x)2

With x = 1/2 this becomes:

6(1 1/2 y)
p(y|X = 1/2) = = 4(1 2y) (2.130)
3(1 1/2)2

Example 2.6.7. A college campus has three types of vehicles: scooters, bicycles, and
mopeds. The scooters are the lightest of the three, and also the most popular, account-
ing for 50% of the total. Bicycles are heavier than scooters and account for 40%, while the
remaining 10% are mopeds and are the heaviest.

We define random variables V for the vehicle type and W for the vehicle weight. The
sample spaces are respectively ⌦V = {scooter, bicycle, moped} and ⌦W = R. The joint
distribution pZ of Z = (V, W ) is shown in Figure 2.22. It consists of three lines; if both were
continuous, pZ would be a surface. We can integrate pZ over its sample space to confirm

53
Figure 2.23: Marginal distributions

that it is a valid pdf.


Z XZ
pZ (z) dz = pZ (v, w)dw
⌦Z v2⌦V w2R
Z 1 Z 1 Z 1
= p(scooters, w)dw + p(bicycles, w)dw + p(mopeds, w)dw
1 1 1

= 0.5 + 0.4 + 0.1

=1

Figure 2.23 shows the marginal distributions of W and V , obtained by integrating the joint
distribution over ⌦V and ⌦W , respectively. For each w 2 R,

X
pW (w) = p(v, w) (2.131)
v2⌦V

= p(scooter, w) + p(bicycles, w) + p(moped, w) (2.132)

54
Figure 2.24: Distributions of weight conditioned on vehicle type.

pV is obtained by integrating p(v, w) over ⌦W .


Z 1
pV (scooter) = p(scooter, w)dw = 0.5 (2.133)
1
Z 1
pV (bicycle) = p(bicycle, w)dw = 0.4 (2.134)
1
Z 1
pV (moped) = p(moped, w)dw = 0.1 (2.135)
1

The conditional probability of weight given vehicle type is obtained by dividing the joint
pdf by the marginal of each vehicle class. This is shown in Figure 2.24. For each w 2 R,

p(scooter, w) p(bicycle, w)
p(w | V = scooter) = = (2.136)
pV (scooter) 0.5
p(bicycle, w) p(scooter, w)
p(w | V = bicycle) = = (2.137)
pV (bicycle) 0.4
p(moped, w) p(moped, w)
p(w | V = moped) = = (2.138)
pV (moped) 0.1

2.6.4 Bayes’ theorem

Bayes’ theorem gives a formula for swapping the roles of the two events in a conditional
probability. For any two events e and e0 in ⌦Z ,

P (e0 | e)P (e)


P (e | e0 ) = (2.139)
P (e0 )

55
This formula is easy to derive from the symmetry of set intersections. Since P (e \ e0 ) =
P (e0 \ e), applying the definition of Eq. 2.108 we get P (e|e0 )P (e0 ) = P (e0 |e)P (e), from
whence Eq. 2.139 immediately follows, provided P (e0 ) 6= 0.
Despite its simplicity, Bayes’ rule has many useful applications. It is often used as a rule
for updating our belief in a hypothesis or statement h when an observation or measurement
m has been made.
P (m|h) P (h)
P (h|m) = (2.140)
P (m)
On the right-hand side, P (h) is our belief in hypothesis h prior to observing m – we call this
the prior belief. P (m|h) is the probability of observing m assuming that the hypothesis h is
correct – we call this the likelihood of m. P (m) is the probability of observing m when we
make no assumptions about the veracity of h. This can be computed as long as there is a
finite number of alternatives to h, whose probabilities are known.

On the computation of P (m)

Suppose we have a finite number of events hypotheses {h1 , . . . , hn }. P (m) can then be
P
obtained by marginalization: P (m) = ni=1 P (m, hi ). Using the definition of the conditional
probability we obtain a useful formula, known as the law of total probability:

n
X
P (m) = P (m|hi )P (hi ) (2.141)
i=1

This formula states more generally that, given n mutually exclussive and exhaustive events
{h1 , . . . , hn }, the probability of any other event m can be obtained with Eq. 2.141.

Example 2.6.8. Take h to represent the event that my car starts the next time I turn the
ignition. Whether or not it starts will depend on many factors, including the amount of gas
in the tank. We will denote with m the observation that there is sufficient gas in the tank
to start the car. Suppose the car is fairly old, and it only starts about 90% of the time. My
prior belief that it will start is P (h) = 0.9. Furthermore, I sometimes forget to fill the tank,
so there is only a 95% chance that the tank has gas, regardless of whether the car starts or

56
Figure 2.25: Two urns with black and white marbles.

not: P (m) = 0.95. P (m|h) is the probability that the car has gas if it has been observed to
start. This must equal 1 since a car with no gas can never start. Applyting Bayes’ rule we
get,
1 ⇥ 0.9
P (h|m) = = 0.947 (2.142)
0.95
Upon observing that the car has gas, my belief that it will start increases from 0.9 to 0.947.

Example 2.6.9. One of the two urns shown in Figure 2.25 is placed in front of you, but you
do not know which. You are asked to pick a marble. Before looking at the marble, what is
the probability that you’ve picked from urn A? Urn B? How do these change if the marble
turns out to be white? Black?

The prior beliefs for urns A and B are both 0.5.

P (A) = P (B) = 0.5 (2.143)

The fact that A has half white and half black marbles, while B has all white is captured
with conditional probabilities.

P (white|A) = P (black|A) = 0.5 (2.144)

P (white|B) = 1 (2.145)

P (black|B) = 0 (2.146)

57
Apply Bayes’ rule:

P (white|A)P (A)
P (A|white) = (2.147)
P (white)
P (white|A)P (A)
= (2.148)
P (white|A)P (A) + P (white|B)P (B)
0.5 ⇥ 0.5
= (2.149)
0.5 ⇥ 0.5 + 1 ⇥ 0.5
= 1/3 (2.150)

P (B|white) is therefore 2/3. We can repeat the computation for the case that we choose a
black marble.

P (black|A)P (A)
P (A|black) = (2.151)
P (black)
P (black|A)P (A)
= (2.152)
P (black|A)P (A) + P (black|B)P (B)
0.5 ⇥ 0.5
= (2.153)
0.5 ⇥ 0.5 + 0 ⇥ 0.5
=1 (2.154)

And therefore P (B|black) = 0.

Bayes’ theorem for inference

Suppose X and Y are respectively the input and output of a system. Bayes’ rule says that
for all pairs (x, y) 2 ⌦Z :
p(y | X = x) pX (x)
p(x | Y = y) = (2.155)
pY (y)
For input-output systems, we can often say that the input “causes” the output. We can also
often gather data to estimate the distributions on the right-hand side of this equation. We
can then use the equation to build a distribution of the inputs that result in output y. Thus
we can infer something about the cause x of an observed output y.

Example 2.6.10. In Example 2.6.7 we computed the probability of the vehicle’s weight

58
Figure 2.26: Vehicle type conditioned on the weight, as a function of the weight.

conditioned on the vehicle’s type. We can use Eq. 2.155 to obtain the distribution of a
vehicle’s type given its weight:

p(w | V = v)
p(v | W = w) = pV (v) (2.156)
pW (w)

Figure 2.26 shows the result of this computation for each of the three vehicle types, and
expressed as a function of weight. Notice that for each weight, the sum of the three lines
equals 1. The plot suggests the following algorithm for guessing a vehicle’s type based on
its weight:
if w < 20:
guess scooter
elif w < 43:
guess bicycle
elif w < 66:
guess moped
else:
guess bicycle

It is surprising that, even though mopeds are on average heavier than bicycles, we should
expect that very heavy vehicles (w > 66) are bicycles. This is due to the fact that the
variance in the weight of bicycles is larger than that of mopeds. If the three conditional
random variables W | V = bicycle, W | V = scooter, and W | V = mopeds had equal variance,
then this sort of reversal would not occur. We will return to this issue when we study logistic

59
regression.

2.6.5 Independence

Two random variables are independent if learning the outcome of one does not change our
belief about the other. For example, the roll of a die and the temperature of the room
are independent random variables: knowing the temperature does not influence what we
believe about the die, and vice-versa. Conversely, two random variables are dependent when
observing one of them can inform our belief about the other. This is true of temperature and
humidity. We can also generalize this notion to more than two random variables. Multiple
random variables are pairwise independent when every pair is independent. And even more
strongly, multiple random variables are jointly independent when no measurements from
any subset of the variables provides information about any other variable. The di↵erence
between pairwise independence and joint independence is subtle, and in this course we will
use the term “independent” to mean “jointly independent”. The question of independence
is of crucial importance when working with several random quantities. If the quantities are
indepedendent, then they can be treated separately. When they are dependent, the goal is
often to infer the nature of their relationship from the data.

Before getting to independent random variables, let’s first define the notion of two in-
dependent events of a (possibly multivariate) random variable Z. Two events e ✓ ⌦Z and
e0 ✓ ⌦Z are independent if the probability that they both occur equals the product of the
probabilities that they each occur:

P (e \ e0 ) = P (e)P (e0 ) (2.157)

60
Figure 2.27: Independent random variables.

Using Eq. 2.108 we find that independent events satisfy:

P (e|e0 ) = P (e) (2.158)

P (e0 |e) = P (e0 ) (2.159)

This form makes clear that the prior and posterior probabilities are the same, and hence no
information about one event is gained by observing the other.

Next we will build a mathematical definition of independent random variables. Roughly


speaking, two random variables X and Y are independent (denoted as X ? Y ) if no event
in one can inform any event in the other. To make this precise, let’s define Z = (X, Y ). The
sample space of Z is the product of the sample spaces of X and Y : ⌦Z = ⌦X ⇥ ⌦Y . Now,
consider two arbitrary events in X and Y .

↵ x ✓ ⌦X (2.160)

↵ y ✓ ⌦Y (2.161)

61
From these we can construct events ex and ey :

ex = {(x, y) : x 2 ↵x } (2.162)

ey = {(x, y) : y 2 ↵y } (2.163)

This is illustrated in Figure 2.27. ex and ey are the projection of ↵x and ↵y into the joint
sample space ⌦Z . We say that X and Y are independent if ex and ey are independent events
in Z for any choice of ↵x and ↵y .

PZ (ex \ ey ) = PZ (ex ) PZ (ey ) = PX (↵x ) PY (↵y ) (2.164)

For independent random variables, the joint pdf decomposes into the product of the marginal
pdfs. For all (x, y) 2 ⌦Z :
pZ (x, y) = pX (x) pY (y) (2.165)

Proof: In Eq. 2.164, evaluate the probabilities using the respective pdfs:

Z ✓Z ◆ ✓Z ◆
pZ (x, y) dx dy = pX (x) dx pY (y) dy (2.166)
eX \eY ↵X ↵Y

Both sides of this equation can be written as a double integrals over x and y:

Z Z Z Z
pZ (x, y) dx dy = pX (x) pY (y) dx dy (2.167)
↵Y ↵X ↵Y ↵X

Since the intervals were arbitrary, this implies that the integrands must be equal to each
other. ⇤

The pdf of a random variable remains unchanged when it is conditioned on an event in a

62
variable to which it is independent. With X ? Y ,

p(y | X = x) = pY (y) (2.168)

p(x | Y = y) = pX (x) (2.169)

or more compactly:

p(y|x) = p(y) (2.170)

p(x|y) = p(x) (2.171)

Proof

The result is an immediate consequence of Eq. 2.119:

pZ (x, y) pX (x) pY (y)


p(y | X = x) = = = pY (y) (2.172)
pX (x) pX (x)

The covariance of two independent random variables is zero:

X?Y ) Cov(X, Y ) = 0 (2.173)

Proof: Start with the expected value of the product of X and Y :

Z
E[XY ] = pZ (x, y) dx dy (2.174)
⌦Z

Since X and Y are independent, we can apply Eq. 2.165:

Z
E[XY ] = pY (y) pX (x) dx dy (2.175)
⌦XY

63
The double integral can be separated into two simple integrals:

Z Z
E[XY ] = pX (x) dx pY (y) dy = E[X]E[Y ] (2.176)
⌦X ⌦Y

This is actually an interesting finding: that the expected value of a product of independent
random variables is the product of their expected values. Next we define µX = E[X] and
µY = E[Y ], and note the following identity, obtained using the linearity of the expected
value (Eq. 2.45):
2µX µY = E[µX Y + µY X] (2.177)

Combining equations 2.176 and 2.177:

E[XY ] + 2µX µY = µX µY + E[µX Y + µY X] (2.178)

Rearranging and again using the linearity of the expectation we get:

E[XY + µX µY µX Y µY X] = 0 (2.179)

which under further rearrangement becomes the covariance:

E[(X µX )(Y µY )] = 0 (2.180)

Independence vs. correlation

Independence is a binary (true/false) quantity that tells us whether two random variables
have any relation to each other. When two random variables are independent, then they are
also uncorrelated (correlation coefficient equals zero), and there is nothing else to be said

64
Figure 2.28: Multivariate normal pdf.

about their relationship. However when they are not independendent, then the correlation
coefficient tells us something about the manner in which they are related. Specifically it
tells us the degree to which their relationship is linear. With ⇢XY = 1 or ⇢XY = 1, the
relationship is perfectly linear. When ⇢XY = 0, the relationship is not at all linear. However
this does not preclude many nonlinear relationships, such as the ones exhibited in the bottom
row of Figure 2.16.

Multivariate Gaussian variables

A multivariate Gaussian random variable Y = (Y1 , . . . , Yn ) is one whose univariate marginals


2
are all Gaussian: Yi ⇠ N (µi , i) for i = 1, . . . , n.

Y = (Y1 , . . . , Yn ) ⇠ N (µ, ⌃2 ) (2.181)

Here µ 2 Rn is the mean, and ⌃2 2 Rn⇥n is the covariance matrix, which, like all covariance
matrices, is symmetric and positive-definite. Figure 2.28 shows the bell-shaped pdf of a
two-dimensional normal random variable. The level-sets (horizontal slices) of the pdf are

65
concentric ellipses in the sample space (horizontal plane). These are centered on µ.
The formula for the multivariate pdf can be found in the Wikipedia article titled ‘Mul-
tivariate normal distribution’, but we will not be needing it. The important point is that it
describes a set of measurements, each of which is Gaussian.
Gaussian variables are an important exception to the observation stated in the previous
part, that uncorrelated variables may not be independent. If two variables X and Y are
Gaussian, then uncorrelation implies independence:

X?Y () ⇢XY = 0 (2.182)

66
Chapter 3

Optimization theory

In this course we will learn several data-based techniques for building models of systems.
Many of the techniques will follow a common paradigm: first we propose a parameterized
family of models, then we choose the member of that family that best fits the data, according
to some criterion. The search for the best-fitting model will be cast as an optimization
problem, and we must therefore establish some of the basic concepts of optimization theory.

Optimization problems arise whenever we are faced with the task of choosing a best
option from a set of possible options. This is an extremely broad formulation, and indeed
optimization theory is useful in many di↵erent settings. Within engineering it can be applied
to problems as wide ranging as these:

• What shape should we give a part such that its cost in minimized while meeting a
specification?

• What voltage should we apply to each of the motors of a drone in order to stabilize its
flight?

• How should the weights of a neural network be set so that it reliably identifies cars in
images?

67
3.1 Problem formulation

The specification of an optimization problem has three parts.

1. The decision vector is an array of decision variables. Each of the these variables can
be real- or discrete-valued, however we will only encountered real-valued variables in
this course. Let D be the number of decision variables.

2. The search set or feasible set ⌦ ✓ RD . This set should not to be confused with the
sample space from the previous chapter. This is the set of permissible values for the
decision variables. The feasible set is typically specified by applying a series of equality
and inequality constraints to RD .

3. The objective function J : ⌦ ! R. This function assigns to each feasible decision


vector x 2 ⌦ a measure of its quality. We will assume here that more desireable values
of x have larger values of J(x).

Our goal will be to find the best possible decision vector, which we will denote with x⇤ . That
is, we seek a x⇤ 2 ⌦ that satisfies:

J(x⇤ )  J(x) 8x 2 ⌦ (3.1)

x⇤ is called the global solution to the problem Note that the global soluttion may not be
unique – there may be multiple x 2 ⌦, all with the same value of J(x) that is the smallest
in ⌦. Note also there may be no solution – in the same way as there is no smallest number,
there may be no smallest value of J(x) amongst x 2 ⌦.
We formulate an optimization problem using the following notation:

minimize J(x)
x2RD
(3.2)
subject to: x 2 ⌦

68
These are the elements in the notation:

• “minimize”: This is the main directive to find the smallest value of J(x). If the goal
is to maximize rather than minimize a function, we can use the “maximize” directive,
or we can equivalently minimize the negative of the function.

• “x 2 RD ” indicates the domain of the decision vector. As stated above, we focus


here on real numbers. However this is where one would indicate that the variables are
integer-valued, if that were the case.

• “J(x)”: The objective function.

• “subject to”, indicated that what follows is a list of constraints that carves the search
space ⌦ out from the domain RD . This is often abreviated as “s.t.”.

⌦ is specified as a set of n equality constraints fi (x) and m inequality constraints gj (x):

minimize J(x)
x2RD

subject to: fi (x) = 0 8i 2 {1, . . . , n} (3.3)

gj (x)  0 8j 2 {1, . . . , m}

Using “argmin” in place of “minimize” returns the optimal decision vector x⇤ .

x⇤ = argmin J(x)
x2RD

subject to: fi (x) = 0 8i 2 {1, . . . , n} (3.4)

gj (x)  0 8j 2 {1, . . . , m}

Example 3.1.1. Suppose you wish build a rectangular enclosure for your pet rabbit using
a fixed length ` of fence material, and you would like to know the width w and depth that
maximize the area of the rectangle. This can be posed as an optimization problem with

69
decision variable x = (w, ). The objective is to maximize the function J(w, ) = w . The
feasible values are all of the positive w and that add up to a perimeter of `. Here is the
formulation as an optimization problem.

Given `,
maximize
2
w
(w, )2R

subject to 2w + 2 = `
(3.5)
w 0

We can convert the problem to a minimization simply by flipping the sign of the objective
function. Objective functions in minimization problems are usually called “cost” functions.

minimize
2
w
(w, )2R

subject to 2w + 2 = `
(3.6)
w 0

This problem has D = 2 decision variables, n = 1 equality constraint, and m = 2 inequality


constraints. Figure 3.1 provides an illustration.

3.1.1 Global vs. local solutions

The optimal decision vector x⇤ is known as the global solution to the problem. A weaker sense
of solving an optimization problem is to find a local solution. This is a feasible point (a.k.a.
decision vector) that is best amongst its immediate feasible neighbors, but not necessarily
best overall. For many problems, this will be the best we can do. A vector x+ is a local

70
Figure 3.1: The feasible set is the red line segment, which is the restriction of the line
2w + 2 = ` to the positive quadrant. The solution (w⇤ , ⇤ ) must lie on this line. The cost
function J = w is a surface that dips into the page, and descends from the bottom left
corner to the upper right corner. Its level sets are the gray curves. The lowest point along
this surface on the red line is at the blue dot, i.e. at w⇤ = ⇤ = `/4. Thus, the optimal shape
is a square.

solution if,
J(x+ )  J(x) 8 x 2 ⌦ \ B✏ (x+ ) (3.7)

where B✏ (x+ ) is a small ball centered at x+ with radius ✏,

B✏ (x+ ) = {x : ||x x+ || < ✏} (3.8)

3.2 Types of feasible points

Next we list three ways of categorizing the elements of ⌦: interior vs. non-interior, di↵eren-
tiable vs. non-di↵erentiable, and stationary vs. non-stationary.

71
Interior vs. non-interior points

As the name suggests, an interior point of a set is one that is located “inside” the set. All
other points are non-interior (a.k.a boundary) points. The formal definition of an interior
point is one that can be placed at the center of a ball that is entirely contained in the set.
Points that cannot be put inside a ball that is contained in the set are non-interior. We will
use ⌦° to denote the set of interior points (aka the interior ) of ⌦.

We can characterize ⌦° in terms of the constraints of the problem. ⌦° is the set of points
for which no constraint is active. An active constraint is one in which the relation is satisfied
with the “=” symbol. Equality constraints are always active, and hence a problem with
equality constraints has no interior points. For problems with only inequality contraints, ⌦°
is the set of points that satisfies gj (x) < 0 for all j 2 {1, . . . , m}.

Di↵erentiable vs. non-di↵erentiable points

We say that a point x is a di↵erentiable point when the cost function J is continuously
di↵erentiable at x. This means that the gradient of J, denoted with rJ exists and is
continuous at x. Otherwise x is a non-di↵erentiable point. The gradient is a generalization
of the scalar derivative to functions with multiple inputs. rJ is a vector in RD that points
in the direction of most rapid increase of J.

Stationary vs. non-stationary points

A point x 2 ⌦ is a stationary point of J when rJ(x) = 0. This means that the function does
not increase nor decrease in any direction. It is locally flat. Stationary points are important
points to consider when solving optimization problems, as we will see in the next section.

Example 3.2.1. The feasible set in the enclosure example has no interior, because it has
an equality constraint.

72
Figure 3.2: Categorizing feasible points

Figure 3.2 illustrates these concepts. Plot (a) shows a non-di↵erentiable point x1 . This
is not a stationary point, since the gradient is not defined at x1 . In (b) there is a continuum
of stationary points between x2 and x3 . All are interior points, di↵erentiable, and also global
minima. Plot (c) shows two stationary points x4 and x5 . x4 is a local solution but not a
global solution, x5 is not a local solution. Plot (d) shows another example of a stationary
point that is neither a local nor a global solution. Finally, in plot (e), x7 is a non-stationary
non-interior point that is both a local and a global solution.

From these plots we can begin to see that the solutions to optimization problems are of
at least three types:

1. non-di↵erentiable points, as in (a),

2. non-interior points, as in (e), and

3. stationary points, as in (b).

The first order condition for optimality establishes that these are in fact the only possibilities.

73
3.2.1 First order optimality condition

The first order optimality condition states that any local solution that is both di↵erentiable
and interior must also be stationary.

x is a di↵erentiable, interior, local solution ) x is stationary (3.9)

At first glance, the statement may not seem very useful. It says that, if we know that a
point is a local solution, as well as di↵erentiable and interior, then we can assert that it
is stationary. However it is much easier to test for stationarity than for local optimality;
just evaluate the gradient. A better interpretation requires that we first realize that all
points are either 1) di↵erentiable and interior, or 2) non-di↵erentiable, or 3) non-interior
(some may be both 2 and 3). The first order condition then tells us that any solution that
falls into category 1) must be a stationary point. The condition therefore suggests that we
seek a solution amongst three sets of points: the non-di↵erentiable points, the non-interior
points, and the stationary points. It is most impactfull when J is continuously di↵erentiable
everywhere and the problem has no constraints. In this case the condition reduces to,

x is a local solution ) x is stationary (3.10)

That is, all solutions are to be found amongst the stationary points. Add to this the fact
that all global solutions are local solutions.

x is a global solution ) x is a local solution ) x is stationary (3.11)

In this context, stationarity is a necessary but not sufficient condition for local and hence
global optimality. Non-sufficiency is demonstrated by points x5 and x6 of Figure 3.2, which
are both stationary but not local solutions. Despite not being sufficient, the condition

74
Figure 3.3: Example 3.2.2.

suggests a procedure for solving di↵erentiable/unconstrained optimization problems. First,


find all of the stationary points (by solving rJ(x) = 0). Then, assuming there are a finite
number of such solutions, evaluate J for each of them, and choose the minimizer. The
following example demonstrates this procedure in 1D.

Example 3.2.2. Find the minimum of the function J(x) = x3 (x 3)(x 2).
Solution. A plot of J(x) is shown in Figure 3.3. We begin by computing the derivative of
J(x).
dJ d
rJ(x) = (x) = x2 (x 3)(x 2) = 4x3 15x2 + 12x (3.12)
dx dx

rJ(x) is continuous everywhere. Since the problem is also unconstrained, we are assured
by the first order necessary condition that any local solution must be stationary. The roots
of rJ(x) can be found with the standard formula for quadratic equations, or using Python,
and they are {0, 1.16, 2.60} (green dots in the figure). Finally we evaluate J on each of the
stationary points and choose the one with the least value: x⇤ = 2.60 (red star in the figure).

When presented with an optimization problem, there are some important questions to con-
sider:

75
1. What is its size? That is, how many decision variables and constraints does it have?
Both the dimension and the number of constraints strongly influence the amount of
computation and time needed to solve a problem.

2. Are the decision variables real or integer-valued? Integer-valued problems are harder
to solve than real-valued problems, since many numerical methods rely on the gradient.
Fortunately, the problems that we will encounter in this course all involve real-valued
decision variables and di↵erentiable objective functions.

3. Is the problem convex? The first order conditions become “supercharged” if we can
establish that the problem is convex.

3.3 Convex optimization problems

Convex optimization problems are ones with a special structure that makes them relatively
easy to solve. All other problems are non-convex. Non-convex problems are usually difficult
to solve in the global sense, although they can sometimes be solved in the local sense using
the first order condition.

A convex optimization problem is a minimization problem in which both the feasible set
⌦ and the cost function J are convex. The definitions of a convex set and a convex function
are given next.

• A set is convex if for any two of its elements a and b, the line segment ab is entirely
contained in the set.

• A function is convex if its epigraph is a convex set. The epigraph of a function is


the set of points that lay above the graph of the function. The epigragh of f (x) is
epi(f ) = {(x, y) | y f (x)}.

76
Figure 3.4: Convex sets and functions

These concepts are illustrated in Figure 3.4. On the left we see convex and non-convex
sets. For the convex set, no line segment that begins and ends within the set, leaves the
set. The non-convex set has a “dimple”, which violates convexity. Convex functions are
bowl-shaped. Their epigraph (the region above the function) is a convex set. A convex
optimization problem is therefore one with a dimple-less feasible set and a bowl-shaped cost
function.

3.3.1 Properties of convex optimization problems

There are two important facts about convex problems that make them relatively easy to
solve.

1. For convex problems, every local solution is a global solution. The practical implication
of this is that we can use local solvers (e.g. gradient descent) to find global solutions.

2. For convex problems with continuously di↵erentiable cost, stationary points are local
solutions. That is, situations such as that of points x5 and x6 in Figure 3.2 do not
arise in convex problems.

Together, these add left-facing arrows to Eq. 3.11, which now becomes:

x is a global solution () x is a local solution () x is stationary

Hence, for unconstrained smooth convex problems, the set of stationary points is the same
as the set of global solutions. Hence, to solve such problems we need only to solve the

77
system of equations rJ(x) = 0. We may, for simple problems, be able to find a solution
analytically. In practice we use numerical methods such as Newton’s method, or gradient
descent. Newton’s method can be faster, but it requires knowledge of the Hessian of J (i.e.
its second derivative). The gradient descent method requires knowledge only of rJ, and it
is the method that we will use in this course (Section 3.4).

Next we list a few important examples of convex sets and convex functions.

3.3.2 Examples of convex sets

Euclidean norm ball

The Euclidean norm of a vector x = (x1 , . . . , xd ) 2 Rd , also known as the 2-norm, is a


generalization of the standard 3D notion of distance to a d-dimensional vector space. We
denote it with kxk2 .
d
!1/2
X
kxk2 = x2i (3.13)
i=1

Notice that in three dimensions, this is the straight-line distance from the origin to x: kxk2 =
p
x21 + x22 + x23 . A Euclidean norm ball (a.k.a. the 2-norm ball) is a generalization of a 3D
sphere. It is defined as the set {x 2 Rd : kxk2  r}, where r is the radius of the ball. The
Euclidean norm ball is convex.

p-norm balls

The Euclidean norm can itself be generalized by replacing the 2’s in Eq. 3.13 with p’s, where
p is a positive integer. With p 1, the p-norm is defined as,

d
!1/p
X
kxkp = |xi |p (3.14)
i=1

78
The p-norm ball of radius r is analogous to the Euclidean norm ball: it is the set of points
whose p-norm is at most r: {x 2 Rd : kxkp  r}. The p-norm ball is convex. Of course,
nothing prevents us from using p < 1, however the resulting function fails to be a norm, and
its corresponding ball fails to be convex.

Figure 3.5: p-norm balls with unit radius (Image from Wikipedia)

Figure 3.5 shows some examples of p-norm balls in R2 . Note three important cases.

• 1-norm. With p = 1 the ball appears diamond-shaped. This is the so-called Manhat-
tan, or taxicab norm, because it measures distance only along vertical and horizontal
displacements (like a taxicab driving through Manhattan).

• 2-norm. The Euclidean norm ball is a circle.

• 1-norm. As p goes to infinity, the p-norm ball approaches a square, and Eq. 3.14
returns the largest absolute value among the components of x.

kxk1 = max(|x1 |, |x2 |, . . . , |xn |) (3.15)

79
Affine equalities

An affine equalities is a formula of the form,

↵ 1 x1 + ↵ 2 x2 + . . . + ↵ d xd = (3.16)

Here the ↵i ’s and are real numbers, and the xi ’s are the decision variables. The set of
points that satisfy this formula is called a hyperplane, and it is the generalization of a 3D
plane to d dimensions. We can arrange n such affine equality constraints into a matrix form.

Ax = b (3.17)

where A is an n ⇥ d matrix whose coefficients are the ↵’s, and b is a n ⇥ 1 column vector
with the ’s. The set of points that satisfy Eq. 3.17, or equivalently, the intersection of n
hyperplanes, is convex.

Convex inequalities

A convex inequality is a formula of the form,

g(x)  0 (3.18)

where g(x) is a convex function. The set of points x that satisfy a convex inequality is
convex. Because the intersection of any number of convex sets is also convex, we find that
the set of points satisfying any number of convex inequalities is a convex set.

This leads us to an important fact about convex optimization problems.

A constraint set consisting of affine equalities and convex inequalities defines a convex
feasible set ⌦.

80
In fact, all convex feasible sets are made up of affine equality constraints and convex
inequality constraints.

3.3.3 Convex functions

Here are some examples of convex functions. In each case, J is a function from RD to R.

• Affine functions. J(x) = a · x + b, with a 2 RD , b 2 R (a · x is the dot product of a


and x).

• p-norms. J(x) = kxkp with p 1.

• Function composition: J(x) = g(h(x)) is convex whenever h convex and g is affine.


Here h is a function that takes x 2 RD and returns a real number, and g takes that
real number and returns another real number. In other words, convexity is preserved
by composition with an affine scalar function.

These examples will show up as cost functions later in the course.

Next we present numerical algorithms for solving optimization problems. Most of the prob-
lems that we will encounter are di↵erentiable and have no constraints. The first-order opti-
mality condition tells us that it will suffice to find a stationary point. This is exactly what
the gradient descent method is designed to do.

3.4 Gradient descent

Gradient descent is a numerical technique for minimizing di↵erentiable, real-valued functions.


The method can be understood by imagining the function as a hilly terrain, as depicted in
Figure 3.6. The algorithm advances like a person walking along the terrain in search of its
lowest point. The person can only look down at the ground – they cannot raise their sight

81
Figure 3.6: The gradient descent method

to look around. At each moment, they observe the slope of the ground under their feet, and
move in the downward direction. They continue in this way until they reach an area that is
flat, i.e. a local minimum.

Gradient descent advances in the direction of the negative gradient. In our two-dimensional
example, the gradient is the blue arrow in the horizontal plane, pointing in the direction of
steepest ascent. Its negative (the green arrow) indicates the direction of steepest descent.

The gradient descent algorithm begins with the arbitrary selection of a starting point
x0 2 ⌦. From there, it proceeds by taking steps to x1 , x2 , . . . until no more downward
progress can be made. If the problem is unconstrained, then we can conclude that the
algorithm has reached a stationary point. If furthermore the problem is convex, then we
have found a global minimum.

The update rule for gradient descent is,

xk+1 = xk rJ(xk ) (3.19)

82
Here xk is the value after k steps and is the step size parameter. The step size can be
kept fixed, or it can be varied with each step, either in a predetermined manner or in a way
that depends on the current value of the gradient. There are strategies for varying that
guarantee convergence to a local minimum. There are also bad choices for that prevent
convergence. Gradient descent does not guarantee convergence to a local minimum unless
the step size parameter is properly chosen.

Although we will not study them here, it is worth noting that there are extensions of
gradient descent for problems with constraints. Equality constraints are typically treated by
appending them to the cost function using Lagrange multipliers. For inequality constraints,
projected gradient methods prevent the solution from leaving ⌦ by projecting the gradient
onto the boundary. Alternatively, the Frank-Wolfe algorithm finds feasible directions by
solving intermediate convex programs.

Next we will introduce a variant of gradient descent that is useful for optimization prob-
lems that are typical of data-based modeling techniques.

3.4.1 Stochastic gradient descent (SGD)

The techniques of supervised learning described in later chapters can all be described as
di↵erent formulations of the following optimization problem.

✓⇤ = argmin E[L(Y ; ✓)] (3.20)


✓2Rd

This is a type of optimization problem that we had not encountered before – one involving
a random variable Y . The goal here is to choose the values of d decision variables (✓) such
that the expected value of the function L(Y ; ✓) is minimized. Since Y is a random variable,
L(Y ; ✓) is random as well, and so it makes sense to consider its expected value. The function
L is known as the loss function. It is not important at this point to understand the purpose

83
of the loss function, nor to understand the justification for this optimization problem. We
will get to that in Chapter 5. For now we are only interested in the mechanics of solving the
problem using a technique called stochastic gradient descent.

Figure 3.7: Stochastic cost function

Figure 3.7 provides an illustration in 2D. The search space is the set of feasible parameters
✓ = (✓1 , ✓2 ) – ie the horizontal plane. For each value of ✓, the variations in Y produce
variations in L(Y ; ✓). Hence, we can visualize L(Y ; ✓) as a “fuzzy” function and E[L(Y ; ✓)]
as a smooth function that approximates it at every point. Our goal is to find the lowest
point on this surface.

To do this, we use the given dataset D = {yi }N to produce an estimate of the objective
function.
N
1 X
E[L(Y ; ✓)] ⇡ L(yi ; ✓) (3.21)
N i=1

The multiplicative factor 1/N does not influence the result, and can therefore be dropped.
Applying standard gradient descent (Eq. 3.19) to this problem produces the following update
rule.

N
!
X
✓k+1 = ✓k r✓ L(yi ; ✓) (3.22)
i=1
N
X
= ✓k r✓ L(yi ; ✓) (3.23)
i=1

84
The notation r✓ indicates that the gradient is taken with respect to ✓. Eq. 3.23 is the direct
application of gradient descent to the stochastic optimization problem. This works well with
small to medium-sized datasets. However for large datasets (large N ), the computation
becomes inefficient, since each step involves N evaluations of r✓ J; one for each data point
yi . For N in the tens or hundreds of thousands, this is an excessive amount of computation
for a single parameter update.

The idea behind SGD is simple: instead of basing the estimate of r✓ J on the full dataset
D, use a reduced dataset B, which we call a batch. Then the update becomes,

X
✓k+1 = ✓k r✓ L(yi ; ✓) (3.24)
yi 2B

The batch size is a parameter of the algorithm. However, since the sample mean is unbiased
for all N , choosing a smaller batch will not a↵ect the bias of the estimate, although it will
increase its variance. The hope is that this increased variance is compensated by the more
frequent steps taken by the algorithm, which allows it to advance quickly in the early stages.

Another parameter of SGD is the method by which we sample B from D. Certainly we


do not want any points in D to be ignored. Hence there are two reasonable approaches:
sampling with replacement and without replacement. To sample with replacement means to
choose each element of B randomly from the full D. Sampling without replacement means
that we partition D into an integer number K of batches (K = N/|B|), and we use one batch
per step. A single K-step pass through D is called an epoch. Typically, SGD will run for
many epochs.

The stopping criteria for SGD is a bit tricky. The noisy nature of the algorithm means
that it will bounce around without ever settling down, and hence we cannot simply look at
✓k+1 ✓k to decide when to stop. Rather, as SGD advances, we track the performance
of the resulting model and stop when the performance begins to deteriorate, due to a phe-

85
nomenon known as overfitting. We will delve deeper into these topics in lab and later in the
course.

86
Chapter 4

Statistical inference

87
88
Chapter 5

Supervised learning

89

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy