0% found this document useful (0 votes)
22 views14 pages

1-Probability 0

Probabilitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views14 pages

1-Probability 0

Probabilitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Matthew Schwartz

Statistical Mechanics, Spring 2021

Lecture 1: Probability

1 Basic probability
We are going to be dealing with systems with enormous degrees of freedom, typically governed by
Avogadro's number NA = 6.02  1023. This is the number of hydrogen atoms in a gram, or more
intuitively, the number of molecules of water in a tablespoon. Even a tiny cell, with a diameter of
only 100 microns (10−4 m), contains a trillion molecules. In most areas of physics, we work with
1
small numbers (the fine structure constant = 137 for example), and calculate things as a Taylor
P
series in the coupling f ( ) = cn n, often keeping only the leading term f ( )  c1 . In statistical
1
mechanics, we work with a large number N and calculate things as a Taylor expansion in N , often
keeping only the leading term (N = 1). The key to doing this is not to ask what each particle is
doing, which would be both impossible and impractical, but rather to ask what the probability is
that a particle is doing something. It is imperative therefore to begin statistical mechanics with
statistics.
In general, we will be interested in probabilities of states of a system which we write as Pa or
P (a). The parameter a represents the microstate  e.g. the positions fq ~ig and momenta fp
~ ig of
all the particles in a gas, or the square of the wavefunction j (q ~ )j2 in quantum mechanics. We
will sometimes think of a as a discrete index (e.g. if we flip a coin, it can land heads up with
1 1
PH = 2 or tails up with PT = 2 ) and sometimes continuous. In the continuous case, we call P (x)
Rx
the probability density, so that x 2 P (x) dx is the probility of finding x values between x1 and x2.
1
Probability densities only become probabilities when integrated.
We will get to know a number of different probability distributions:
 
1 (x − x0)2
Gaussian: P (x) = p exp − (1)
2  22
(t)m −t
Poisson: Pm(t) = e (2)
m!
N!
Binomial: BN (m) = ambN −m (3)
m!(N − m)!
Γ 1
Lorentzian: P (x) =   (4)
2 (x − x )2 + Γ 2
0 2

Flat: P (x) = constant (5)


Probabilities distributions are always normalized so that they integrate/sum to 1:
Z X
dxP (x) = 1; Pa = 1 (6)
a

Given a probability distribution, we can calculate the expected value of any observable by
integrating/summing against the probability. For example, the expected value of x (the mean) is
Z
x  hxi = dxxP (x) (7)
or the mean-square is Z
hx2i = dxx2 P (x) (8)

The variance of a distribution is the difference between the mean of the square and the square of
the mean:
Var  hx2i − hxi2 (9)

1
2 Section 1

The square root of the variance is called the standard deviation.


p
  hx2i − hxi2 (10)
While the mean has the intuitive interpretation as the expected outcome, variance is more subtle.
Indeed, developing intuition for variance is a key to mastering statistics. The key point is that the
expected value is worthless if you don't know how likely that value is.
For example, a Gaussian has two parameters, x0 and 0. The first parameter is the mean:
Z 1  
1 (x − x0)2
hxi = dxx p exp − = x0 (11)
−1 2 0 202
The mean of x is
2

hx2i =  2 + x20 (12)


p
So that the standard deviation is  = hx2i − hxi2 = 0. This is why we usually just write  instead
of 0 for the parameter of the Gaussian.
The standard deviation has an interpretation as the width of a distribution  how far you can
go from the mean before the probability has decreased substantially. For example, in a Gaussian,
the probability of finding x between x0 −  and x0 +  is
Z x0 +
(x)1 = dxP (x) = 0.68 (13)
x0 −

So, for a Gaussian, there is a 68% that the values of x fall within 1 standard deviation of the mean.
We will often be interested in situations where the mean is zero. Then the standard deviation
is equivalent to the root-mean-square
p
xRMS = hx2i (14)
~ i = 0. Thus the
For example, in a gas the velocities point in random directions, so hv p characteristic
speed of a gas is characterized not by the mean but by the RMS velocity vRMS = hv ~ 2i .
Another important concept is how probability distributions behave when they are combined.
For example, say PA(x) and PB (y) are the probabilities of winning x dollars when betting on horse
A and y dollars when betting on horse B. The probability of getting z total dollars is then
Z 1
PAB (z) = dxPA(z − x)PB (x) (15)
−1

This is the definition of the mathematical operation of convolution between two functions. We
say PAB is the convolution of PA and PB and write it as
PAB = PA PB (16)
Convolutions are extremely important in statistical mechanics, since we often measure only the
sum of a great many independent processes. For example, the pressure on the wall of a container
is due to the sum of the forces of all the little molecules hitting it, each with its own probability.

1.1 Examples
Consider the system of a gas molecule bouncing around in a 1D box of size L centered on x = 0. If
there are no external forces and no position-dependent interactions, the molecule is equally likely
to be anywhere in the box. So
1
P (x) = (17)
L
The mean value of the position of the molecule is
Z L
1 2
hxi = dxx = 0 (18)
L −L
2
Similarly, the mean value of x is
2

Z L
1 2 L2
hx i =
2
dxx2 = (19)
L −L 12
2
Law of large numbers 3

So that the standard deviation is


p 1
 = hx2i − hxi2 = p L  0.29L (20)
12
2
Note that the probability of finding x within hxi   is L = 58%. It is not 68% because the
probability distribution is not Gaussian. This illustrates that the interpretation of  as a 68%
confidence interval is not always accurate.
Suppose instead that there is some electric field so that the particles in the box are more likely
0.74
to be on one side than the other. We might find some crazy function P (x) = L ln(1 + e2x/L) for
these probabilities. Then, by numerical integration we find

hxi = 0.59L; hx2i = 0.42L2;  = 0.28L (21)


R hxi+
Also, hxi− P (x)dx = 0.6 so 60% within hxi  . This is just a contrived example. You should be
able to compute hxi and  with any function P (x), at least numerically, and you will generally
find that not exactly 68% are within hxi  , but often you get something close.

2 Law of large numbers


An extremely important result from probability is that even if P (x) is very complicated, when you
average over many measurements, the result dramatically simplifies. More precisely, the law of
large numbers states that

The average of the results from a set of independent trials


varies less and less the more trials are performed
More mathematically, we can state it this way
 If P (x) has standard deviation , then the probability PN (x) of finding that the average

over N draws from P (x) is x will have standard deviation p .
N

Thus as N ! 1, the standard deviation of the average p ! 0.
N
To derive the law of large numbers, lets consider the probability distribution for the center of
mass of molecules in a box. Say there are N molecules in the box and the probability function of
finding each is P (x). Some examples for P (x) are Section 1.1. We assume that the probabilities
for each molecule are independent  having one at x does not tell us anything about where the
others might be. In this case, what is the mean value of the center of mass of the system? We'll
write hxiN , hx2iN and N for quantities involving the N -body system and drop the subscript for
the N = 1 case: hxi1 = hxi and 1 = .
x +x
For N = 2, the center of mass is x = 1 2 2 , so the mean value of the center of mass is
Z L Z L Z L Z L
2 2 x1 + x2 2 x1 2 x2
hxi2 = dx1 dx2P (x1)P (x2) = dx1P (x1) + dx2P (x2) = hxi (22)
L
−2 −2
L 2 L
−2 2 L
−2 2

So the mean value for 2 molecules is the same as for 1 molecule. The expectation of x2 with 2
molecules is
Z L Z L  
2 2 x + x2 2
hx2i2 = dx1 dx2P (x1)P (x2) 1 (23)

L

L 2
2 2

Z L Z L Z L Z L
1 2 1 2 2 1 2
= dx1P (x1)x21 + dx1P (x1)x1 dx2P (x2)x2 + dx2P (x2)x22 (24)
4 L
−2 2 L
−2
L
−2 4 L
−2

1 1
= hx2i + hxi2 (25)
2 2
4 Section 3

So the standard deviation of the center-of-mass for 2 particles is:


p r
1 2 1 1 p 2 
2 = hx i2 − (hxi2) =
2 2
hx i + hxi2 − hxi2 = p hx i − hxi2 = p (26)
2 2 2 2
p
That is, the standard deviation has shrunk by a factor of 2 from the one particle case for any
P (x).
Now say there are N particles. The mean value of the center of mass is
Z L " Z L #
 
2 x1 +  + xN 1 2
hxiN = dx1dxNP (x1)P (xN ) = N dx1x1P (x1) = hxi (27)

L N N −
L
2 2

independent of N . The expectation value of x2 is


Z L
 
2 x +  + xN 2
hx2iN = dx1dxNP (x1)P (xN ) 1 (28)

L N
2

When we expand (x1 +  + xN )2 there are N terms that give hx2i and the remaining (N 2 − N )
terms are the same as hx1x2i = hxi2. So,
Z L
2 1
hx2iN = dx1dxNP (x1)P (xN ) [Nx21 + (N 2 − N )x1x2] (29)
L
−2 N2
 
1 1
= hx2i + 1 − hxi2 (30)
N N
Therefore
p 1 p 2 
N = hx2iN − hxi2 = p hx i − hxi2 = p (31)
N N
p
The appearance of N is called the law of large numbers. Note that Eq. (31), describing how
the standard deviation scales as we average over many molecules, holds for any function P (x).
Different P (x) will give different values of , but the relation between N with N molecules and
 with one molecule is universal.
1
For the gas in the box with a flat P (x) = L , as in Section 1.1, the expected value of the center
of mass is hxiN = 0, just like for any individual gas molecules, and the standard deviation is
 L
N = p  10−11 p . Thus, even though we don't know very well where any of the molecules are,
N 12
we know the center of mass to extraordinary precision.
The law of large numbers is the reason that statistical mechanics is possible: we can compute
macroscopic properties of systems (like the center of mass, or pressure, or all kinds of other things)
with great confidence even if we don't know exactly what is going on at the microscopic level.

3 Central Limit Theorem


We saw how for when we average over a large number N of draws from a probability distribution

P (x) the mean stays fixed and the standard deviation shrinks by  ! p . What can we say about
N
the shape of the probability distribution PN (x)? It turns out we can say a lot. In fact, in the limit
N ! 1 we know PN (x) exactly: it is a Gaussian!
More precisely the central limit theorem states that

When any probability distribution is sampled N times


the average of the samples approaches a Gaussian distribution as N ! 1
1
with width scaling like   p
N
Central Limit Theorem 5

There are a lot of ways to prove it. I find the moment approach the most accessible, as discussed
next. Another proof using convolutions an Fourier transforms in is Appendix C.

3.1 CLT proof using moments


One way to prove the central limit theorem is by computing moments. If you specify the complete
set of moments of a function, you know its shape completely. These moments are
mean: x = hxi (32)

variance: 2 = hx − x i2 = hx2i − x2 (33)

h(x − x)3i 1
skewness: S= = 3 [hx3i − 3x hx2i + 2x3] (34)
3 
h(x − x)4i
kurtosis: K= (35)
4
h(x − x)ni
nth moment: Mn = (36)
n
Skewness measures how asymmetric a distribution is around its mean. Kurtosis measures the 4th
derivative, which is a measure of curvature. More intuitively, higher kurtosis means a probability
distribution has a longer tail, i.e. more outliers from the mean. The higher moments do not have
simple interpretations.
Notice that all the higher-order moments are normalized by dividing by powers of  so that
they are dimensionless. To understand this normalization imagine plotting PN (x), but shift it to
center around x = 0 and rescale the x axis by  so that the width is always 1. Then the curve will
not get any smaller as N ! 0 because its width is fixed to be 1, but its shape may change. The
shape is determined by the numbers Mn with n > 2. See Fig. 2 below for an example.
For the Gaussian probability distribution in Eq. (1) the moments are easy to calculate in
Mathematica:
x = 0;  = ; S = 0; K = 3; M5 = 0; M6 = 15; M7 = 0; M8 = 105;  (Gaussian) (37)
Note that skewness is zero for a Gaussian because it is symmetric. For a Gaussian, in fact all
the odd moments (Mn with n odd) vanish. The even moments, normalized to powers of , are
dimensionless numbers given by the formula
8
<0
> ; n odd
Mn = 2 2−
n
n! (38)
>
: n  ; n even
2
!

These Mn completely determine the shape of a Gaussian. If a function has all of these moments,
it is a Gaussian.
Now let's compute the moments of the center of mass of our N molecules-in-a-box with prob-
ability P (x). We'll do this for a general P (x), but shift the domain so that hxi = x = 0 in order to
simplify the formulas in Eqs. (33)-(36). For example, the 3rd moment of PN (x) is
Z L
 
2 x +  + xN 3
hx iN =
3
dx1dxNP (x1)P (xN ) 1 (39)

L N
2

Since hxi = 0 the only terms in this expression which don't vanish are the ones of the form xj3. So

1 3
hx3iN = hx i (40)
N2
We conclude that the skewness SN with N molecules is related to the skewness S1 for 1 molecule by
h(x − x)3iN h(x − x)3i/N 2 S
SN = = p = p1 (41)
N3
( / N )3 N
6 Section 3

In particular, the skewness goes to zero as N ! 1. That is, the distribution becomes more and
more symmetric abound the mean as N ! 1.
Now let's look at the 4th moment, kurtosis. Following the same method we need
Z L
 
2 x +  + xN 4
hx4iN = dx1dxNP (x1)P (xN ) 1 (42)

L N
2

In this case, since hxi = 0, the terms that don't vanish are xj4 or xj2x2i with i =
/ j. Thinking about the
combinatorics a little you can convince yourself that there N terms of the form x4i and 3N (N − 1)
terms of the form x2i xj2.1 So,
1 4 3(N − 1) 2 2
hx4iN = hx i + hx ihx i (43)
N3 N3
1
Then, calling K1 = 4 hx4i the kurtosis for N = 1 we have
   
h(x − x)4iN 1 1 4 3(N − 1) 2 2 K1 1
KN = = 4 hx i + hx ihx i = +3 1− (44)
N4  /N 2 N 3 N3 N N

This is interesting  it says that as N ! 1 the kurtosis KN ! 3 independent of the kurtosis of the
one particle probability distribution! So the skewness goes to zero and the kurtosis goes to 3.
For the 6th moment the term which dominates at large N is the non-vanishing one with the
1
largest combinatoric factor: hx2i3. There are NC3 6 C2 4 C2 = 6 N (N − 1)(N − 2)  15  2 ! 15 of
these. So (M6)N ! 15. Similarly, (M8)N ! 105. In other words, for any P (x) we find that as N ! 1

SN ! 0; KN ! 3; (M5)N ! 0; (M6)N ! 15; (M7)N ! 0; (M8)N ! 105;  (45)

What we are seeing is that at large N all of the higher moments go to those of a Gaussian! If you
work out the details, the general formula is
8
<0
> ; n odd
(Mr)N ! 2− r2 r! ; n even (46)
>
: r
! 2

In exact agreement with the moments of a Gaussian. Thus we always get a Gaussian and the
central limit is proven. Another proof using convolutions is in Appendix C.

3.2 Combining flat distributions


Because the central limit theorem is so important, let's try to understand why it is true more
physically. Again, say we have some probability distribution P (x) for molecules in a box, with
L L
− 2 < x < 2 . We want to pick N molecules and compute their mean position (center of mass
1P
position) x = N j xj . What is the probability distribution PN (x) that the mean value is x?
1
To be concrete, let's take the flat distribution P (x) = L . For N = 1, we pick only molecule with
1
position x1. Then x = x1 and so P (x) = L : any value for the center-of-mass position is equally likely.
Now say N = 2, so we pick two molecules with positions x1 and x2. What is the probability that
x +x
they will have mean x? For a given x we need 1 2 2 = x. For example if x = 0, then for any x1 there
L
is an x2 that works, namely x2 = −x1. However, if the mean is all the way on the edge, x = 2 , then
L
not all x1 work; in fact, we need both x1 and x2 to be exactly 2 . Thus there are fewer possibilities
when x is close to the boundaries of the box than if x is central. One way to see this is graphically
   
N N N! N (N − 1)
1. There are 1
= N of the xj4 terms. There are 2
= 2!(N − 2)! = 2
possible pairs i =
/ j and there
 
4
are 2
= 6 ways of picking which two of the 4 terms in the expansion are i. So the total number of these terms is
3N (N − 1).
Central Limit Theorem 7

Figure 1. The regions in the x1 /x2 plane with mean value x are diagonal lines for L = 2. The length of
the line is the probability P2(x). For x = 0, the line is longest and probability greatest. For x = 1, the line
reduces to a point and the probability to zero.

To be quantitative, the easiest way to calculate the probability is with the Dirac  function
(x) (see Appendix A for a refresher on (x)). Using the -function, we can write the probability
x +x
for getting a mean value x = 1 2 2 as
Z L Z L
 
2 2 x + x2
P2(x) = dx1P (x1) dx2 P (x2) 1 −x (47)
L
−2 −
L 2
2

This is another way of writing a convolution, as in Eq. (15): P2 = P  P .


As a check, we can verify that this probability distribution is normalized correctly
Z L Z L Z L Z L  
2 2 2 2 x + x2
dxP2(x) = dx dx1P (x1) dx2 P (x2) 1 −x

L

L

L

L 2
2 2 2 2

Z L Z L
2 2
= dx1P (x1) dx2 P (x2) = 1 (48)
L L
−2 −2

where we have used the -function to integrate over x to get to the second line.
To evaluate P2(x) we first pull a factor of 2 out of the -function using Eq. (82), giving
Z L Z L
2 2
P2(x) = 2 dx1P (x1) dx2 P (x2)(x1 + x2 − 2x) (49)
L L
−2 −2
x1 + x2
Now, the -function can only fire if its argument hits zero in the integration
 region.
 Since 2
=x
L L
we can solve for x1 = 2x − x2. If x < 0 then the most x1 can be is 2x − −2 = 2
+ 2x. In other
words, we have
Z L +2x
2
P2(x < 0) = 2 dx1P (x1)P (2x − x1) (50)
L
−2
1
Taking the flat distribution P (x) = L this evaluates to P2(x < 0) = 2L + 4x. Similarly, for x > 0
L
the limit is x1 > 2x − 2 and for a flat distribution P2(x > 0) = 2L − 4x. Thus we have

(51)

2L + 4x; x<0
L P2(x) =
2
=
2L − 4x; x>0

You can also check this by evaluating Eq. (47) with Mathematica:

P=Integrate[DiracDelta[x1+x2-2x],{x1,-1,1},{x2,-1,1}];
8 Section 3

Plot[P, {x, -1, 1}]

For N = 3 we compute
Z L Z L Z L
 
2 2 2 x + x2 + x3
P3(x) = dx1P (x1) dx2 P (x2) dx3 P (x3) 1 −x (52)
−2
L
−2
L

L 3
2

and so on. These successive approximations look like

Figure 2. The average position of N = 1; 2; 3; 4 particles, each of which separately has a flat probability
distribution.

We see that already at N = 4 the flat probability distribution is becoming a Gaussian. Note
also that the widths of the distributions are getting narrower.
The central limit theorem says that the distribution of the mean of N draws from a prob-

ability distribution approaches a Gaussian of width p as N ! 1 independent of the original
N
probability distribution. That is,
r  
N (x − x)2
PN (x) ! exp −N (53)
2 2 2 2
Sometimes we sum the values of the draws from a distribution instead of averaging p them. In
this case, the mean grows as x ! N x and the standard deviation grows like  ! N . Thus an
equivalent phrasing of the central limit theorem is
 Central Limit Theorem: A function with mean x and standard deviation  convolved p
with itself N times approaches a Gaussian with mean Nx and standard deviation N  as
N ! 1.
Summing the values is what happens when you convolve a function with itself. So for summing
the values, the central limit theorem has the form
 
1 (x − Nx)2
PNsum(x) = P  P    P ! p exp − (54)
|||||||||||||||||{z}}}}}}}}}}}}}}}}} 2 2N 2 2N
N

A proof of the CLT using convolutions is in Appendix B.


We put the sum superscript to remind ourselves that we sum the values from each draw from
P (x) rather than average their values. The relation is simply
 
1 x
PNsum(x) = PN (55)
N N
1
The N comes from the fact that the probability distributions are differential, so we should techni-
x x
cally write PNsum(x)dx = PN N d N . Note when we average x ! x and  ! p and when we sum

p N
x ! N x and  ! N , so either way
 1 
!p (56)
x N x

Thus a foolproof way to think of the scaling is that the dimensionless ratio x

should decrease as
1
p .
N
Poisson distribution 9

3.3 Why we take logarithms in statistical mechanics


In statistical mechanics, we will make great use out of the central limit theorem. Generally we
have systems composed of enormously large numbers of particles N  Avogadro's number 1024.
The things we measure are macroscopic: the pressure a gas puts on a wall is the average pressure.
Microscopically, the gas has a bunch of little molecules hitting and bouncing off the wall and the
force these molecules impart is constantly varying. We don't care about these tiny fluctuations,
just the average. So any time we try to measure something, like the pressure in a gas, or the con-
centration of a chemical, we will necessarily be averaging over an enormous number of fluctuations.
Because of the central limit theorem, the distribution of any macroscopic quantity will be close to
a Gaussian around its mean. This central limit theorem itself doesn't tell us what the mean is, or
how various macroscopic quantities are related  we need physics for that. But it tells us that we
don't need to worry about the precise details of the microscopic description.
Normally when a function f (x) is rapidly falling away from x  x we Taylor expand x = x
and keep the first few terms. We can do this for PN (x) too. However, the Taylor expansion of a
Gaussian has an infinite number of terms
x2 1
X  m    
− 1 x2 x2 1 x2 2 1 x2 3
e 22 = − 2 =1− 2 + − +  (57)
m! 2 2 2 2 2 6 22
m=0

You need all the terms to reconstruct the original Gaussian. However, if we take the logarithm
first, then Taylor expand, we find
x2
− x2
ln e 22 = − 2 (58)
2
with only one term. So it will be extremely convenient to start taking the logarithms of our
probabilities. By the central limit theorem, when we average the values,
r
(x − x)2 N
ln PN (x) ! −N + ln (59)
2 2 2 2
As N ! 1 there are no higher order terms.
In other words, a Gaussian is an unusual function. It is flat near the peak, but then quickly
drops off and has a long tail. Since the function is smooth near the peak, it's hard to know what's
going on at the tail from expanding near the peak. In particular, you have to work very hard to
get information about points with x &  from information at the peak. Taking the logarithm puts
the peak and the tail on the same footing. Of course, we can't get something for nothing: taking
logarithms alone won't solve any problems. But taking logarithms often makes it easier to solve
problems. We will see many examples of this as the course progresses.

4 Poisson distribution
In many physical situations, there is a large number N of possible events each occurring with very
small probability  for a given time interval. For example, if you put a glass out in the rain, there
are lots of possible drops of water that could fall into the glass, but each has a small probability.
Or you have lots of friends on Instagram, each one has a small probability of posting something
interesting. Or we have a gas of molecules and each one has a small chance of being in some tiny
volume. Probabilities in situations like this, where each event is uncorrelated with the previous
event, are described by the Poisson distribution.
Let's take a concrete example, radioactive decay. A block of 235U has N  1024 atoms each of
which can decay with a tiny probability
dP = dt (60)
1
 is called the decay rate. It has units of time . For a single atom of 235U , this decay rate is
 = 3  10−17 s−1. In a mole of Uranium (1024 atoms), 107 Uranium atoms decay, on average, each
second. What is the chance of seeing m decays in a time t?
1
Let's start with m = 0 and the time t very small (compared to  ), t = t. If the rate to decay
is dP = dt then the probability of not decaying in time t = t is
Pno decay(t) = 1 − t (61)
10 Section 4

For the system to survive to a time 2 t with no decays, it would have to not decay in t and
then not decay again in the next  t. Since the probability of two uncorrelated occurrences (or
not-occurrences in this case) is the product of the probabilities, P (a&b) = P (a)P (b) we then have
Pno decay(2t) = (1 − t)2 (62)
t
Now we can get all the way to time t by sewing together small times t = N and taking N ! 1.
We thus have  
t N
Pno decay(t) = lim 1 −  = e−t (63)
N !1 N
So that's the m = 0 case: no particles decay.
1
Using this formula, how long will it take for the probability of some decay to be 2 ? That's the
1 1
same as the probability of no decay being 1 − 2 = 2 . So we just solve
1 1 0.7
= e−t1/2 ) t1/2 = ln 2 = (64)
2  
1
We often say 
is the lifetime and t1/2 is the halflife. The two numbers are related by a factor
1
of ln2: t1/2 = 
ln 2.
Now try m = 1. We need the probability that there is exactly one decay in exactly one of the
time intervals. There are N intervals we can pick. So,
     
t N −1 t t N
P1 decay(t) = lim N 1 −   = lim − t@t 1 −  (65)
N !1 N}}}}}}}}}}}}}}}}}}}}} ||{z}
||||||||||||||||||||||{z} ||||| N}}
}}}}} N !1 N
N −1 no decays one decay

In the third term, we have simply rewritten the expression in a smart way with a derivative so we
can reduce it to a previously solved problem  a powerful physicist trick. Now we switch the order
of the limit and the @t and use Eq. (63) to get
P1 decay(t) = −t@tPnodecay(t) = te−t (66)
 
N! 1
For two decays there are N
2
= (N − 2)!2! = 2 N (N − 1) ways and we have
   
N (N − 1) t N −2 t 2 1 2 2 (t)2 −t
P2decays(t) = lim 1−  = t @t Pno decay(t) = e (67)
N !1 2 }}}}}}}}}}}} ||||||||||||||||||||||{z}
|||||||||||||{z} N}}}}}}}}}}}}}}}}}}}}} ||||||||{z}
N}}}}}}} 2 2
pick 2 to decay N −2 no decays two decays

For general m the result is


(t)m −t (68)
Pm(t) = e
m!
This is called the Poisson distribution. It gives the probability for exactly m events in time t
when each event has a probability per unit time of  and the events are uncorrelated.
In any time t there must have been some number of decays between 0 and 1. Indeed,
X X (t)m
Pm(t) = e−t = 1 (69)
m!
m

So that's consistent (as is the t-independence of this sum).


The way we derived the Poisson distribution was for a fixed m, as a function of t. But it can be
more useful to think of it as a function of m at a fixed value of t: P (m; t) = Pm(t). Keep in mind
though that for fixed t, P (m; t) as a function of m is a discrete probability distribution (meaning
m is an integer). In contrast for fixed m, P (m; t) is a continuous function of t. Moreover, while
it is a normalized probability Rin m, it is a simply a function (not a probability distribution) of t.
There is not a sense in which dtPm (t) = 1  this doesn't even have the right units.
For a given fixed t, how many particles do we expect to have decayed? In other words, what is
the expected value hmi in a time t? We compute the mean value for m, by summing the value of
m times the probability of getting m
X X (t)m
hmi = mPm(z) = m e−t = t (70)
m!
m m
Poisson distribution 11

The last step is a little tricky  see if you can figure out how to do the sum yourself. (You can
always run Mathematica if you get stuck on steps like this.) The result implies that the expected
number of decays in a time t is t. It makes sense that if you double the time, twice as many
particles decay. How long will it take for half the particles to decay?
The standard deviation of the Poisson distribution is
p p
 = hm2i − hmi2 = t (71)
Again, you can check this yourself as an exercise. p
So the Poisson distribution as a function of m at fixed t has mean t and width t . Thus the
width compared to the mean is
 1
=p (72)
hmi t
This goes to 0 as t ! 1. In other words, the Poisson distribution is narrower and narrower as t
1
gets larger. What does this mean physically? It means if we wait one lifetime (t =  ) we should
p 2
expect 1  1 particle to decay. If we wait 2 lifetimes, we expect 2  2 to decay (t =  , hmi = 2
2 p
and  = p = 2 ). If we wait 100 lifetimes, we expect 100  10 to decay. So the longer we wait,
2
not only are there more decays, but we know more precisely how many decays there will be. This
is, of course, a consequence of the central limit theorem.
So what do you expect the distribution to look like as t ! 1 or m ! 1? Let's first look
numerically. We can plot Pm(t) as a function m, which is a discrete index, or as a function of t,
which is continuous:

Figure 3. The Poisson distribution as a function of the discrete index m for various times (left) and time,
for various values of m (right)

We clearly see the Gaussian shape emerging at large t (left) and at large m (right).
Now let's try to see how the Gaussian form arises analytically. First of all, we want the high
1
statistics limit, which means large t in units of  which also means large m. When you see a factor
of m! and want to expand at large m, you should immediately think Stirling's approximation:
x!  e−xxx  () (73)
or equivalently
ln x!  x ln x − x +  (74)
For a simple derivation, see Appendix B. We will use this expansion a lot.
The log of the Poisson distribution is
 
(t)m −t
ln Pm(t) = ln e = mln(t) − t − ln m! (75)
m!
Then we use Stirling's approximation for m!

t
ln Pm(t) !
!
!
!!
!
!
!!
!
!
!! mln(t) − t − m ln m + m +  = mln
!
! + (m − t) +  (76)
m1 m
12 Section 5

This is still a mess. But we expect Pm(t) to be peaked around its mean hmi = t. So let's Taylor
expand ln Pm(t) around m = t. The leading term, from setting m = t makes Eq. (76) vanish.
The next term is

@
ln Pm(t) = lim [ln(t) − ln m] = 0 (77)
@m m=t m!t

which also vanishes. We have to go one more order in the Taylor expansion to get a nonzero answer:
 
@2 1 1
ln P m(t) = lim − =− (78)
@m2 m=t m!t m t
Thus,
1
ln Pm(t) = − (m − t)2 +  (79)
2t
and therefore
1 (m −t)2
− 2t
Pm(t) !!
!!!
!
!
!!
!
!
!!p
! e (80)
m1 2t
p
This is a Gaussian with mean hmi = t and width  = t exactly as expected by the central limit
theorem.
You might not be terribly impressed with this derivation as a check of the central limit theorem.
After all, we expanded ln Pm to second order around m = hmi. Doing that, for any function Pm
is guaranteed to give a Gaussian. But that's really the whole point of the central limit theorem 
any function does give a Gaussian. So in the end you should be impressed after all.

5 Summary
In this lecture, we introduced the basic concepts from probability that will be useful for statistical
mechanics. The key concepts are
R
 Normalized probability distributions P (x) with dxP (x) = 1
R
 Mean: x = hxi = dxxP (x)
R
 Variance var = dx(x − x)2P (x),
p
 Standard deviation or width  = var
 
1 )2
(x − x
 Gaussian distribution P (x) = p exp − 22 has mean x and width .
2 

 If you draw x from Gaussian it is 68% likely to between x −  and x + .


R1
 The convolution of two distributions is defined as (PA PB )(z) = −1 dxPA(z − x)PB (x).
It describes the probabilty of getting z as the sum of a number draw from PA and another
number drawn from PB .
 Given a probabilty distribition P (x) with mean x and width , you can construct a new
probabilty distribution PN (x) by averaging over N draws from P (x). The central limit
thoerem (CLT) says that as N ! 1 this new distribution will approach a Gaussian with

the same mean as P (x) (xN = x) and a smaller standard deviation N  p . All other
N
properties of P (x) are lost after this averaging at large N .
 The CLT also implies that if we sum (rather than average) the values from draws, the mean
p
grows like xN  N x and the standard deviation like N  N . If we
 Because of the CLT, Gaussians are very common. Their exponential decay encourages us to
study logarithms of distributions, which turns fast-varying exponentials into slow-varying
x2
− x2
polynomials: ln e 22 = − 22 .
 When we have a rate dP = dt for an event happening that is independent of time, then
the probability of having m events after a time t is described by the Poisson distribution
(t)m
Pm(t) = m! e−t.
Stirling's approximation 13

p
 Stirling's approximation is that N !  2N N Ne−N at large N . This works very well,
even at N = 1.

Appendix A Dirac -function


The Dirac -function is very useful in physics, from quantum mechanics to statistical mechanics.
The -function is not really a function but rather a distribution. (x) is zero everywhere except at
x = 0. When you integrate a function against (x) you pick up the value of that function at 0. That is
Z
dx(x)f (x) = f (0) (81)

This is the defining property of (x). The integration region has to include x = 0 but is otherwise
arbitrary since (x) = 0 if x =
/ 0.
Another useful property of -functions is that if we rescale the argument of (x) by a number
1
a then the -function rescales by a . That is,
1
(ax) = (x) (82)
a
x
To check this, we can change variables from x ! a in the integral
Z Z   Z  
x x 1 1
dx(ax)f (x) = d ( x)f = f (0) = dx (x) f(x) (83)
a a a a

It's sometimes helpful to think of the  function as the limit of a regular function. There are
lots functions whose limits are  functions. For example, Gaussians:
2
1 −
x
(x) = lim p e 22 (84)
!0 2 
As a check, note that the integral over the Gaussian is 1 regardless of , so the  function also
integrates to 1. As  ! 0, the width of the Gaussian goes to zero, so it has zero support away from
mean, that is it vanishes except at x = 0, just like the  function.

Appendix B Stirling's approximation


There are many ways to derive Stirling's approximation. Here's a relatively easy one. We start by
taking the logarithm
N
X
ln N ! = ln N + ln (N − 1) + ln(N − 2) +  + ln1 = ln j (85)
j =1

For large N we then write the sum as an integral


N
X Z N
ln N ! = ln j  dj ln j = N ln N − N − 1  N ln N − N (86)
j =1 1

That's the answer.


One can include more terms in the expansion by using the Euler-McLauren formula for the
difference between a sum and an integral. For example, the next term is
p
N !  2N N Ne−N (87)
An alternative derivation
R1 is to given an integral representation of the factorial as a G function:
n! = Γ(n + 1) = 0 xne−xd x. For example, Mathematica can simply series expand this around
n = 1 to reproduce Eq. (87). Try it!
14 Appendix C

1
The next order correction to this is down by 12N , which gets small fast. You can check that
Stirling's approximation is off by less than 8% already at N = 1 and by less than 2% by N = 3. For
Avogadro's number N = 6  1023 it is off by one part in 1025.

Appendix C Central limit theorem from convolutions


Here's another slick proof of the central limit theorem. We start with the definition
Z  
x +  + xn
PN (x) = dx1:::dxnP (x1):::P (xn) 1 −x (88)
N
Now we write the  function in Fourier space as
Z
dk ik(x1 ++xn−x)
(x1 +  + xn − x) = e (89)
2
So that Z Z  
x ++x
ik 1 N n −x
PN (x) = dk dx1:::dxnP (x1):::P (xn)e (90)

Defining the Fourier transform of P as


Z
P~(k) = dxeikxP (x) (91)
we then have   N
Z
dk ~ k
PN (x) = P eikx (92)
2 N
Eq. (92) is just the statement that Fourier transforms turns convolutions into products.Then,
  Z Z    
~ k kx
iN ikx 1 (ky)2 1 iky 3
P = dxe P (x) = dy 1 + − + +  P (x) (93)
N N 2 N 3! N

k k 2 hx2i k 3hx3i
=1 + i x − −i +  (94)
N 2 N 2 6N 3
 
Now if we didn't do anythingelse, then as N ! 1 we see immediately that P~ N ! 1 and so
k

PN (x) ! (x). This is because the whole distribution is shrinking down to be around x = 0. We
don't care about this shrinking, but rather what is happing to the shape of the distribution So,
as with the moment proof in Section 3.1,pwe first normalize PN by shifting so that x = 0. We also
need to rescale the x-axis by a factor of N so that the width stays finite. Thus we can write
     3 
k k2 2 k3 hx i
P~ =1− −i +  (95)
N 2N N 6N 3/2 N 3/2
p
where the terms in () are going to be finite as N ! 1 after rescaling x by N . Then
2   3
  N k2  2   N  
k2  2
k 1
P~ =4 1 − 5 = e− 2 N
2 N
+O ::: (96)
N N N 3/2
x
where ex = limN !1 1 + N N was used and the ::: terms give contributions that vanish as N ! 1.
We can then compute the inverse Fourier transform to get
Z 1 2 2  r Z p  r 2
dk − k2 N ikx 2N 1 dk − k22 ik 2N x
N − Nx
PN (x) = e e = e e 
= e 2 2 (97)
−1 2 2 −1 2 2 2

which is the desired result, the central limit theorem.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy