Part IA Probability
Based on lectures by R. Weber
Notes taken by Dexter Chua
Lent 2015
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
Basic concepts
Classical probability, equally likely outcomes. Combinatorial analysis, permutations
and combinations. Stirling’s formula (asymptotics for log n! proved). [3]
Axiomatic approach
Axioms (countable case). Probability spaces. Inclusion-exclusion formula. Continuity
and subadditivity of probability measures. Independence. Binomial, Poisson and geo-
metric distributions. Relation between Poisson and binomial distributions. Conditional
probability, Bayes’s formula. Examples, including Simpson’s paradox. [5]
Discrete random variables
Exp ectation. Functions of a random variable, indicator function, variance, standard
deviation. Covariance, independence of random variables. Generating functions: sums
of indep endent random variables, random sum formula, moments.
Conditional expectation. Random walks: gambler’s ruin, recurrence relations. Dif-
ference equations and their solution. Mean time to absorption. Branching processes:
generating functions and extinction probability. Combinatorial applications of generat-
ing functions. [7]
Continuous random variables
Distributions and density functions. Expectations; exp ectation of a function of a
random variable. Uniform, normal and exponential random variables. Memoryless
prop erty of exponential distribution. Joint distributions: transformation of random
variables (including Jacobians), examples. Simulation: generating continuous random
variables, independent normal random variables. Geometrical probability: Bertrand’s
paradox, Buffon’s needle. Correlation coefficient, bivariate normal random variables. [6]
Inequalities and limits
Markov’s inequality, Chebyshev’s inequality. Weak law of large numbers. Convexity:
Jensen’s inequality for general random variables, AM/GM inequality.
Moment generating functions and statement (no proof) of continuity theorem. State-
ment of central limit theorem and sketch of proof. Examples, including sampling. [3]
Contents
0 Introduction
1 Classical probability
1.1 Classical probability
1.2 Counting
1.3 Stirling’s formula
2 Axioms of probability
2.1 Axioms and definitions
2.2 Inequalities and formulae
2.3 Independence
2.4 Important discrete distributions
2.5 Conditional probability
3 Discrete random variables
3.1 Discrete random variables
3.2 Inequalities
3.3 Weak law of large numbers
3.4 Multiple random variables
3.5 Probability generating functions
4 Interesting problems
4.1 Branching processes
4.2 Random walk and gambler’s ruin
5 Continuous random variables
5.1 Continuous random variables
5.2 Stochastic ordering and inspection paradox
5.3 Jointly distributed random variables
5.4 Geometric probability
5.5 The normal distribution
5.6 Transformation of random variables
5.7 Moment generating functions
6 More distributions
6.1 Cauchy distribution
6.2 Gamma distribution
6.3 Beta distribution*
6.4 More on the normal distribution
6.5 Multivariate normal
7 Central limit theorem
0 Introduction
In every day life, we often encounter the use of the term probability, and they
are used in many different ways. For example, we can hear people say:
(i) The probability that a fair coin will land heads is 1/2.
(ii)
The probability that a selection of 6 members wins the National Lottery
Lotto jackpot is 1 in
19
6
= 13983816 or 7.15112 × 10
8
.
(iii) The probability that a drawing pin will land ’point up’ is 0.62.
(iv)
The probability that a large earthquake will occur on the San Andreas
Fault in the next 30 years is about 21%
(v) The probability that humanity will be extinct by 2100 is about 50%
The first two cases are things derived from logic. For example, we know that
the coin either lands heads or tails. By definition, a fair coin is equally likely to
land heads or tail. So the probability of either must be 1/2.
The third is something probably derived from experiments. Perhaps we did
1000 experiments and 620 of the pins point up. The fourth and fifth examples
belong to yet another category that talks about are beliefs and predictions.
We call the first kind “classical probability”, the second kind “frequentist
probability” and the last “subjective probability”. In this course, we only
consider classical probability.
1 Classical probability
We start with a rather informal introduction to probability. Afterwards, in
Chapter 2, we will have a formal axiomatic definition of probability and formally
study their properties.
1.1 Classical probability
Definition (Classical probability). Classical probability applies in a situation
when there are a finite number of equally likely outcome.
A classical example is the problem of points.
Example. A and B play a game in which they keep throwing coins. If a head
lands, then
A
gets a point. Otherwise,
B
gets a point. The first person to get
10 points wins a prize.
Now suppose
A
has got 8 points and
B
has got 7, but the game has to end
because an earthquake struck. How should they divide the prize? We answer
this by finding the probability of
A
winning. Someone must have won by the
end of 19 rounds, i.e. after 4 more rounds. If
A
wins at least 2 of them, then
A
wins. Otherwise, B wins.
The number of ways this can happen is
4
2
+
4
3
+
4
4
= 11, while there are
16 possible outcomes in total. So A should get 11/16 of the prize.
In general, consider an experiment that has a random outcome.
Definition (Sample space). The set of all possible outcomes is the sample space,
Ω. We can lists the outcomes as ω
1
, ω
2
, ··· Ω. Each ω is an outcome.
Definition (Event). A subset of is called an event.
Example. When rolling a dice, the sample space is
{
1
,
2
,
3
,
4
,
5
,
6
}
, and each
item is an outcome. “Getting an odd number” and “getting 3” are two possible
events.
In probability, we will be dealing with sets a lot, so it would be helpful to
come up with some notation.
Definition (Set notations). Given any two events A, B Ω,
The complement of A is A
C
= A
=
¯
A = \ A.
A or B is the set A B.
A and B is the set A B.
A and B are mutually exclusive or disjoint if A B = .
If A B, then A occurring implies B occurring.
Definition (Probability). Suppose =
{ω
1
, ω
2
, ··· , ω
N
}
. Let
A
be an
event. Then the probability of A is
P(A) =
Number of outcomes in A
Number of outcomes in
=
|A|
N
.
Here we are assuming that each outcome is equally likely to happen, which
is the case in (fair) dice rolls and coin flips.
Example. Suppose
r
digits are drawn at random from a table of random digits
from 0 to 9. What is the probability that
(i) No digit exceeds k;
(ii) The largest digit drawn is k?
The sample space is = {(a
1
, a
2
, ··· , a
r
) : 0 a
i
9}. Then || = 10
r
.
Let
A
k
= [
no digit exceeds k
] =
{
(
a
1
, ··· , a
r
) : 0
a
i
k}
. Then
|A
k
|
=
(k + 1)
r
. So
P (A
k
) =
(k + 1)
r
10
r
.
Now let
B
k
= [
largest digit drawn is k
]. We can find this by finding all outcomes
in which no digits exceed
k
, and subtract it by the number of outcomes in which
no digit exceeds k 1. So |B
k
| = |A
k
| |A
k1
| and
P (B
k
) =
(k + 1)
r
k
r
10
r
.
1.2 Counting
To find probabilities, we often need to count things. For example, in our example
above, we had to count the number of elements in B
k
.
Example. A menu has 6 starters, 7 mains and 6 desserts. How many possible
meals combinations are there? Clearly 6 × 7 ×6 = 252.
Here we are using the fundamental rule of counting:
Theorem (Fundamental rule of counting). Suppose we have to make
r
multiple
choices in sequence. There are
m
1
possibilities for the first choice,
m
2
possibilities
for the second etc. Then the total number of choices is m
1
× m
2
× ···m
r
.
Example. How many ways can 1
,
2
, ··· , n
be ordered? The first choice has
n
possibilities, the second has
n
1 possibilities etc. So there are
n×
(
n
1)
×···×
1 =
n!.
Sampling with or without replacement
Suppose we have to pick
n
items from a total of
x
items. We can model this as
follows: Let
N
=
{
1
,
2
, ··· , n}
be the list. Let
X
=
{
1
,
2
, ··· , x}
be the items.
Then each way of picking the items is a function
f
:
N X
with
f
(
i
) = item at
the ith position.
Definition (Sampling with replacement). When we sample with replacement,
after choosing at item, it is put back and can be chosen again. Then any sampling
function f satisfies sampling with replacement.
Definition (Sampling without replacement). When we sample without replace-
ment, after choosing an item, we kill it with fire and cannot choose it again.
Then f must be an injective function, and clearly we must have x n.
We can also have sampling with replacement, but we require each item to be
chosen at least once. In this case, f must be surjective.
Example. Suppose
N
=
{a, b, c}
and
X
=
{p, q, r, s}
. How many injective
functions are there N X?
When we choose
f
(
a
), we have 4 options. When we choose
f
(
b
), we have
3 left. When we choose
f
(
c
), we have 2 choices left. So there are 24 possible
choices.
Example. I have
n
keys in my pocket. We select one at random once and try
to unlock. What is the possibility that I succeed at the rth trial?
Suppose we do it with replacement. We have to fail the first
r
1 trials and
succeed in the rth. So the probability is
(n 1)(n 1) ···(n 1)(1)
n
r
=
(n 1)
r1
n
r
.
Now suppose we are smarter and try without replacement. Then the probability
is
(n 1)(n 2) ···(n r + 1)(1)
n(n 1) ···(n r + 1)
=
1
n
.
Example (Birthday problem). How many people are needed in a room for there
to be a probability that two people have the same birthday to be at least a half?
Suppose
f
(
r
) is the probability that, in a room of
r
people, there is a birthday
match.
We solve this by finding the probability of no match, 1
f
(
r
). The total
number of possibilities of birthday combinations is 365
r
. For nobody to have
the same birthday, the first person can have any birthday. The second has 364
else to choose, etc. So
P(no match) =
365 · 364 ·363 ···(366 r)
365 · 365 ·365 ···365
.
If we calculate this with a computer, we find that
f
(22) = 0
.
475695 and
f
(23) =
0.507297.
While this might sound odd since 23 is small, this is because we are thinking
about the wrong thing. The probability of match is related more to the number of
pairs of people, not the number of people. With 23 people, we have 23
×
22
/
2 =
253 pairs, which is quite large compared to 365.
Sampling with or without regard to ordering
There are cases where we don’t care about, say list positions. For example, if we
pick two representatives from a class, the order of picking them doesn’t matter.
In terms of the function
f
:
N X
, after mapping to
f
(1)
, f
(2)
, ··· , f
(
n
),
we can
Leave the list alone
Sort the list ascending. i.e. we might get (2
,
5
,
4) and (4
,
2
,
5). If we don’t
care about list positions, these are just equivalent to (2, 4, 5).
Re-number each item by the number of the draw on which it was first seen.
For example, we can rename (2
,
5
,
2) and (5
,
4
,
5) both as (1
,
2
,
1). This
happens if the labelling of items doesn’t matter.
Both of above. So we can rename (2, 5, 2) and (8, 5, 5) both as (1, 1, 2).
Total number of cases
Combining these four possibilities with whether we have replacement, no replace-
ment, or “everything has to be chosen at least once”, we have 12 possible cases
of counting. The most important ones are:
Replacement + with ordering: the number of ways is x
n
.
Without replacement + with ordering: the number of ways is
x
(n)
=
x
n
=
x(x 1) ···(x n + 1).
Without replacement + without order: we only care which items get
selected. The number of ways is
x
n
= C
x
n
= x
(n)
/n!.
Replacement + without ordering: we only care how many times the item
got chosen. This is equivalent to partitioning
n
into
n
1
+
n
2
+
···
+
n
k
.
Say n = 6 and k = 3. We can write a particular partition as
∗∗ | |
So we have
n
+
k
1 symbols and
k
1 of them are bars. So the number
of ways is
n+k1
k1
.
Multinomial coefficient
Suppose that we have to pick n items, and each item can either be an apple or
an orange. The number of ways of picking such that
k
apples are chosen is, by
definition,
n
k
.
In general, suppose we have to fill successive positions in a list of length
n
, with replacement, from a set of
k
items. The number of ways of doing so
such that item
i
is picked
n
i
times is defined to be the multinomial coefficient
n
n
1
,n
2
,···,n
k
.
Definition (Multinomial coefficient). A multinomial coefficient is
n
n
1
, n
2
, ··· , n
k
=
n
n
1

n n
1
n
2
···
n n
1
··· n
k1
n
k
=
n!
n
1
!n
2
! ···n
k
!
.
It is the number of ways to distribute
n
items into
k
positions, in which the
i
th
position has n
i
items.
Example. We know that
(x + y)
n
= x
n
+
n
1
x
n1
y + ··· + y
n
.
If we have a trinomial, then
(x + y + z)
n
=
X
n
1
,n
2
,n
3
n
n
1
, n
2
, n
3
x
n
1
y
n
2
z
n
3
.
Example. How many ways can we deal 52 cards to 4 player, each with a hand
of 13? The total number of ways is
52
13, 13, 13, 13
=
52!
(13!)
4
= 53644737765488792839237440000 = 5.36 × 10
28
.
While computers are still capable of calculating that, what if we tried more
power cards? Suppose each person has n cards. Then the number of ways is
(4n)!
(n!)
4
,
which is huge. We can use Stirling’s Formula to approximate it:
1.3 Stirling’s formula
Before we state and prove Stirling’s formula, we prove a weaker (but examinable)
version:
Proposition. log n! n log n
Proof. Note that
log n! =
n
X
k=1
log k.
Now we claim that
Z
n
1
log x dx
n
X
1
log k
Z
n+1
1
log x dx.
This is true by considering the diagram:
x
y
ln x
ln(x 1)
We actually evaluate the integral to obtain
n log n n + 1 log n! (n + 1) log(n + 1) n;
Divide both sides by n log n and let n . Both sides tend to 1. So
log n!
n log n
1.
Now we prove Stirling’s Formula:
Theorem (Stirling’s formula). As n ,
log
n!e
n
n
n+
1
2
= log
2π + O
1
n
Corollary.
n!
2πn
n+
1
2
e
n
Proof. (non-examinable) Define
d
n
= log
n!e
n
n
n+1/2
= log n! (n + 1/2) log n + n
Then
d
n
d
n+1
= (n + 1/2) log
n + 1
n
1.
Write t = 1/(2n + 1). Then
d
n
d
n+1
=
1
2t
log
1 + t
1 t
1.
We can simplifying by noting that
log(1 + t) t =
1
2
t
2
+
1
3
t
3
1
4
t
4
+ ···
log(1 t) + t =
1
2
t
2
1
3
t
3
1
4
t
4
···
Then if we subtract the equations and divide by 2t, we obtain
d
n
d
n+1
=
1
3
t
2
+
1
5
t
4
+
1
7
t
6
+ ···
<
1
3
t
2
+
1
3
t
4
+
1
3
t
6
+ ···
=
1
3
t
2
1 t
2
=
1
3
1
(2n + 1)
2
1
=
1
12
1
n
1
n + 1
By summing these bounds, we know that
d
1
d
n
<
1
12
1
1
n
Then we know that
d
n
is bounded below by
d
1
+ something, and is decreasing
since
d
n
d
n+1
is positive. So it converges to a limit
A
. We know
A
is a lower
bound for d
n
since (d
n
) is decreasing.
Suppose
m > n
. Then
d
n
d
m
<
1
n
1
m
1
12
. So taking the limit as
m
,
we obtain an upper bound for d
n
: d
n
< A + 1/(12n). Hence we know that
A < d
n
< A +
1
12n
.
However, all these results are useless if we don’t know what
A
is. To find
A
, we
have a small detour to prove a formula:
Take
I
n
=
R
π/2
0
sin
n
θ
d
θ
. This is decreasing for increasing
n
as
sin
n
θ
gets
smaller. We also know that
I
n
=
Z
π/2
0
sin
n
θ dθ
=
cos θ sin
n1
θ
π/2
0
+
Z
π/2
0
(n 1) cos
2
θ sin
n2
θ dθ
= 0 +
Z
π/2
0
(n 1)(1 sin
2
θ) sin
n2
θ dθ
= (n 1)(I
n2
I
n
)
So
I
n
=
n 1
n
I
n2
.
We can directly evaluate the integral to obtain I
0
= π/2, I
1
= 1. Then
I
2n
=
1
2
·
3
4
···
2n 1
2n
π/2 =
(2n)!
(2
n
n!)
2
π
2
I
2n+1
=
2
3
·
4
5
···
2n
2n + 1
=
(2
n
n!)
2
(2n + 1)!
So using the fact that I
n
is decreasing, we know that
1
I
2n
I
2n+1
I
2n1
I
2n+1
= 1 +
1
2n
1.
Using the approximation
n
!
n
n+1/2
e
n+A
, where
A
is the limit we want to
find, we can approximate
I
2n
I
2n+1
= π(2n + 1)
((2n)!)
2
2
4n+1
(n!)
4
π(2n + 1)
1
ne
2A
2π
e
2A
.
Since the last expression is equal to 1, we know that
A
=
log
2π
. Hooray for
magic!
This approximation can be improved:
Proposition (non-examinable). We use the 1
/
12
n
term from the proof above
to get a better approximation:
2πn
n+1/2
e
n+
1
12n+1
n!
2πn
n+1/2
e
n+
1
12n
.
Example. Suppose we toss a coin 2
n
times. What is the probability of equal
number of heads and tails? The probability is
2n
n
2
2n
=
(2n)!
(n!)
2
2
2n
1
Example. Suppose we draw 26 cards from 52. What is the probability of getting
13 reds and 13 blacks? The probability is
26
13

26
13
52
26
= 0.2181.
2 Axioms of probability
2.1 Axioms and definitions
So far, we have semi-formally defined some probabilistic notions. However, what
we had above was rather restrictive. We were only allowed to have a finite
number of possible outcomes, and all outcomes occur with the same probability.
However, most things in the real world do not fit these descriptions. For example,
we cannot use this to model a coin that gives heads with probability π
1
.
In general, “probability” can be defined as follows:
Definition (Probability space). A probability space is a triple (Ω
, F, P
). is a
set called the sample space,
F
is a collection of subsets of Ω, and
P
:
F
[0
,
1]
is the probability measure.
F has to satisfy the following axioms:
(i) , F.
(ii) A F A
C
F.
(iii) A
1
, A
2
, ··· F
S
i=1
A
i
F.
And P has to satisfy the following Kolmogorov axioms:
(i) 0 P(A) 1 for all A F
(ii) P(Ω) = 1
(iii)
For any countable collection of events
A
1
, A
2
, ···
which are disjoint, i.e.
A
i
A
j
= for all i, j, we have
P
[
i
A
i
!
=
X
i
P(A
i
).
Items in are known as the outcomes, items in
F
are known as the events, and
P(A) is the probability of the event A.
If is finite (or countable), we usually take
F
to be all the subsets of Ω, i.e.
the power set of Ω. However, if is, say,
R
, we have to be a bit more careful
and only include nice subsets, or else we cannot have a well-defined P.
Often it is not helpful to specify the full function
P
. Instead, in discrete cases,
we just specify the probabilities of each outcome, and use the third axiom to
obtain the full P.
Definition (Probability distribution). Let =
{ω
1
, ω
2
, ···}
. Choose numbers
p
1
, p
2
, ··· such that
P
i=1
p
i
= 1. Let p(ω
i
) = p
i
. Then define
P(A) =
X
ω
i
A
p(ω
i
).
This
P
(
A
) satisfies the above axioms, and
p
1
, p
2
, ···
is the probability distribution
Using the axioms, we can quickly prove a few rather obvious results.
Theorem.
(i) P() = 0
(ii) P(A
C
) = 1 P(A)
(iii) A B P(A) P(B)
(iv) P(A B) = P(A) + P(B) P(A B).
Proof.
(i) and are disjoint. So P(Ω) + P() = P(Ω ) = P(Ω). So P() = 0.
(ii) P(A) + P(A
C
) = P(Ω) = 1 since A and A
C
are disjoint.
(iii) Write B = A (B A
C
). Then
P (B) = P(A) + P(B A
C
) P(A).
(iv) P
(
A B
) =
P
(
A
) +
P
(
B A
C
). We also know that
P
(
B
) =
P
(
A B
) +
P(B A
C
). Then the result follows.
From above, we know that
P
(
A B
)
P
(
A
) +
P
(
B
). So we say that
P
is a
subadditive function. Also,
P
(
A B
) +
P
(
A B
)
P
(
A
) +
P
(
B
) (in fact both
sides are equal!). We say P is submodular.
The next theorem is better expressed in terms of limits.
Definition (Limit of events). A sequence of events
A
1
, A
2
, ···
is increasing if
A
1
A
2
···. Then we define the limit as
lim
n→∞
A
n
=
[
1
A
n
.
Similarly, if they are decreasing, i.e. A
1
A
2
···, then
lim
n→∞
A
n
=
\
1
A
n
.
Theorem. If A
1
, A
2
, ··· is increasing or decreasing, then
lim
n→∞
P(A
n
) = P
lim
n→∞
A
n
.
Proof. Take B
1
= A
1
, B
2
= A
2
\ A
1
. In general,
B
n
= A
n
\
n1
[
1
A
i
.
Then
n
[
1
B
i
=
n
[
1
A
i
,
[
1
B
i
=
[
1
A
i
.
Then
P(lim A
n
) = P
[
1
A
i
!
= P
[
1
B
i
!
=
X
1
P(B
i
) (Axiom III)
= lim
n→∞
n
X
i=1
P(B
i
)
= lim
n→∞
P
n
[
1
A
i
!
= lim
n→∞
P(A
n
).
and the decreasing case is proven similarly (or we can simply apply the above to
A
C
i
).
2.2 Inequalities and formulae
Theorem (Boole’s inequality). For any A
1
, A
2
, ···,
P
[
i=1
A
i
!
X
i=1
P(A
i
).
This is also known as the “union bound”.
Proof.
Our third axiom states a similar formula that only holds for disjoint sets.
So we need a (not so) clever trick to make them disjoint. We define
B
1
= A
1
B
2
= A
2
\ A
1
B
i
= A
i
\
i1
[
k=1
A
k
.
So we know that
[
B
i
=
[
A
i
.
But the B
i
are disjoint. So our Axiom (iii) gives
P
[
i
A
i
!
= P
[
i
B
i
!
=
X
i
P (B
i
)
X
i
P (A
i
) .
Where the last inequality follows from (iii) of the theorem above.
Example. Suppose we have countably infinite number of biased coins. Let
A
k
= [
k
th toss head] and
P
(
A
k
) =
p
k
. Suppose
P
1
p
k
<
. What is the
probability that there are infinitely many heads?
The event “there is at least one more head after the
i
th coin toss” is
S
k=i
A
k
.
There are infinitely many heads if and only if there are unboundedly many coin
tosses, i.e. no matter how high
i
is, there is still at least more more head after
the ith toss.
So the probability required is
P
\
i=1
[
k=i
A
k
!
= lim
i→∞
P
[
k=i
A
k
!
lim
i→∞
X
k=i
p
k
= 0
Therefore P(infinite number of heads) = 0.
Example (Erd¨os 1947). Is it possible to colour a complete
n
-graph (i.e. a graph
of
n
vertices with edges between every pair of vertices) red and black such that
there is no k-vertex complete subgraph with monochrome edges?
Erd¨os said this is possible if
n
k
2
1
(
k
2
)
< 1.
We colour edges randomly, and let
A
i
= [
i
th subgraph has monochrome edges].
Then the probability that at least one subgraph has monochrome edges is
P
[
A
i
X
P(A
i
) =
n
k
2 · 2
(
k
2
)
.
The last expression is obtained since there are
n
k
ways to choose a subgraph; a
monochrome subgraph can be either red or black, thus the multiple of 2; and
the probability of getting all red (or black) is 2
(
k
2
)
.
If this probability is less than 1, then there must be a way to colour them in
which it is impossible to find a monochrome subgraph, or else the probability is
1. So if
n
k
2
1
(
k
2
)
< 1, the colouring is possible.
Theorem (Inclusion-exclusion formula).
P
n
[
i
A
i
!
=
n
X
1
P(A
i
)
X
i
1
<i
2
P(A
i
1
A
j
2
) +
X
i
1
<i
2
<i
3
P(A
i
1
A
i
2
A
i
3
) ···
+ (1)
n1
P(A
1
··· A
n
).
Proof. Perform induction on n. n = 2 is proven above.
Then
P(A
1
A
2
···A
n
) = P(A
1
) + P(A
2
··· A
n
) P
n
[
i=2
(A
1
A
i
)
!
.
Then we can apply the induction hypothesis for
n
1, and expand the mess.
The details are very similar to that in IA Numbers and Sets.
Example. Let 1
,
2
, ··· , n
be randomly permuted to
π
(1)
, π
(2)
, ··· , π
(
n
). If
i = π(i) for all i, we say we have a derangement.
Let A
i
= [i = π(i)].
Then
P
n
[
i=1
A
i
!
=
X
k
P(A
k
)
X
k
1
<k
2
P(A
k
1
A
k
2
) + ···
= n ·
1
n
n
2
1
n
1
n 1
+
n
3
1
n
1
n 1
1
n 2
+ ···
= 1
1
2!
+
1
3!
··· + (1)
n1
1
n!
e
1
So the probability of derangement is 1 P(
S
A
k
) 1 e
1
0.632.
Recall that, from inclusion exclusion,
P(A B C) = P(A) + P(B) + P(C) P(AB) P(BC) P(AC) + P(ABC),
where
P
(
AB
) is a shorthand for
P
(
A B
). If we only take the first three terms,
then we get Boole’s inequality
P(A B C) P(A) + P(B) + P(C).
In general
Theorem (Bonferroni’s inequalities). For any events
A
1
, A
2
, ··· , A
n
and 1
r n, if r is odd, then
P
n
[
1
A
i
!
X
i
1
P(A
i
1
)
X
i
1
<i
2
P(A
i
1
A
i
2
) +
X
i
1
<i
2
<i
3
P(A
i
1
A
i
2
A
i
3
) + ···
+
X
i
1
<i
2
<···<i
r
P(A
i
1
A
i
2
A
i
3
···A
i
r
).
If r is even, then
P
n
[
1
A
i
!
X
i
1
P(A
i
1
)
X
i
1
<i
2
P(A
i
1
A
i
2
) +
X
i
1
<i
2
<i
3
P(A
i
1
A
i
2
A
i
3
) + ···
X
i
1
<i
2
<···<i
r
P(A
i
1
A
i
2
A
i
3
···A
i
r
).
Proof. Easy induction on n.
Example. Let =
{
1
,
2
, ··· , m}
and 1
j, k m
. Write
A
k
=
{
1
,
2
, ··· , k}
.
Then
A
k
A
j
= {1, 2, ··· , min(j, k)} = A
min(j,k)
and
A
k
A
j
= {1, 2, ··· , max(j, k)} = A
max(j,k)
.
We also have P(A
k
) = k/m.
Now let 1
x
1
, ··· , x
n
m
be some numbers. Then Bonferroni’s inequality
says
P
[
A
x
i
X
P(A
x
i
)
X
i<j
P(A
x
i
A
x
j
).
So
max{x
1
, x
2
, ··· , x
n
}
X
x
i
X
i
1
<i
2
min{x
1
, x
2
}.
2.3 Independence
Definition (Independent events). Two events A and B are independent if
P(A B) = P(A)P(B).
Otherwise, they are said to be dependent.
Two events are independent if they are not related to each other. For example,
if you roll two dice separately, the outcomes will be independent.
Proposition. If A and B are independent, then A and B
C
are independent.
Proof.
P(A B
C
) = P(A) P(A B)
= P(A) P(A)P(B)
= P(A)(1 P(B))
= P(A)P(B
C
)
This definition applies to two events. What does it mean to say that three
or more events are independent?
Example. Roll two fair dice. Let
A
1
and
A
2
be the event that the first and
second die is odd respectively. Let
A
3
= [sum is odd]. The event probabilities
are as follows:
Event Probability
A
1
1/2
A
2
1/2
A
3
1/2
A
1
A
2
1/4
A
1
A
3
1/4
A
2
A
3
1/4
A
1
A
2
A
3
0
We see that
A
1
and
A
2
are independent,
A
1
and
A
3
are independent, and
A
2
and
A
3
are independent. However, the collection of all three are not independent,
since if A
1
and A
2
are true, then A
3
cannot possibly be true.
From the example above, we see that just because a set of events is pairwise
independent does not mean they are independent all together. We define:
Definition (Independence of multiple events). Events
A
1
, A
2
, ···
are said to
be mutually independent if
P(A
i
1
A
i
2
··· A
i
r
) = P(A
i
1
)P(A
i
2
) ···P(A
i
r
)
for any i
1
, i
2
, ···i
r
and r 2.
Example. Let
A
ij
be the event that
i
and
j
roll the same. We roll 4 dice. Then
P(A
12
A
13
) =
1
6
·
1
6
=
1
36
= P(A
12
)P(A
13
).
But
P(A
12
A
13
A
23
) =
1
36
= P(A
12
)P(A
13
)P(A
23
).
So they are not mutually independent.
We can also apply this concept to experiments. Suppose we model two
independent experiments with
1
=
{α
1
, α
2
, ···}
and
2
=
{β
1
, β
2
, ···}
with
probabilities
P
(
α
i
) =
p
i
and
P
(
β
i
) =
q
i
. Further suppose that these two
experiments are independent, i.e.
P((α
i
, β
j
)) = p
i
q
j
for all i, j. Then we can have a new sample space =
1
×
2
.
Now suppose
A
1
and
B
2
are results (i.e. events) of the two
experiments. We can view them as subspaces of by rewriting them as
A ×
2
and
1
× B. Then the probability
P(A B) =
X
α
i
A,β
i
B
p
i
q
i
=
X
α
i
A
p
i
X
β
i
B
q
i
= P(A)P(B).
So we say the two experiments are “independent” even though the term usually
refers to different events in the same experiment. We can generalize this to
n
independent experiments, or even countably many infinite experiments.
2.4 Important discrete distributions
We’re now going to quickly go through a few important discrete probability
distributions. By discrete we mean the sample space is countable. The sample
space is = {ω
1
, ω
2
, ···} and p
i
= P({ω
i
}).
Definition (Bernoulli distribution). Suppose we toss a coin. =
{H, T }
and
p [0, 1]. The Bernoulli distribution, denoted B(1, p) has
P(H) = p; P(T ) = 1 p.
Definition (Binomial distribution). Suppose we toss a coin
n
times, each with
probability p of getting heads. Then
P(HHT T ···T ) = pp(1 p) ···(1 p).
So
P(two heads) =
n
2
p
2
(1 p)
n2
.
In general,
P(k heads) =
n
k
p
k
(1 p)
nk
.
We call this the binomial distribution and write it as B(n, p).
Definition (Geometric distribution). Suppose we toss a coin with probability
p
of getting heads. The probability of having a head after k consecutive tails is
p
k
= (1 p)
k
p
This is geometric distribution. We say it is memoryless because how many tails
we’ve got in the past does not give us any information to how long I’ll have to
wait until I get a head.
Definition (Hypergeometric distribution). Suppose we have an urn with
n
1
red
balls and
n
2
black balls. We choose
n
balls. The probability that there are
k
red balls is
P(k red) =
n
1
k

n
2
nk
n
1
+n
2
n
.
Definition (Poisson distribution). The Poisson distribution denoted P (λ) is
p
k
=
λ
k
k!
e
λ
for k N.
What is this weird distribution? It is a distribution used to model rare events.
Suppose that an event happens at a rate of
λ
. We can think of this as there
being a lot of trials, say
n
of them, and each has a probability
λ/n
of succeeding.
As we take the limit n , we obtain the Poisson distribution.
Theorem (Poisson approximation to binomial). Suppose
n
and
p
0
such that np = λ. Then
q
k
=
n
k
p
k
(1 p)
nk
λ
k
k!
e
λ
.
Proof.
q
k
=
n
k
p
k
(1 p)
nk
=
1
k!
n(n 1) ···(n k + 1)
n
k
(np)
k
1
np
n
nk
1
k!
λ
k
e
λ
since (1 a/n)
n
e
a
.
2.5 Conditional probability
Definition (Conditional probability). Suppose
B
is an event with
P
(
B
)
>
0.
For any event A Ω, the conditional probability of A given B is
P(A | B) =
P(A B)
P(B)
.
We interpret as the probability of A happening given that B has happened.
Note that if A and B are independent, then
P(A | B) =
P(A B)
P(B)
=
P(A)P(B)
P(B)
= P(A).
Example. In a game of poker, let A
i
= [player i gets royal flush]. Then
P(A
1
) = 1.539 × 10
6
.
and
P(A
2
| A
1
) = 1.969 × 10
6
.
It is significantly bigger, albeit still incredibly tiny. So we say “good hands
attract”.
If P(A | B) > P(A), then we say that B attracts A. Since
P(A B)
P(B)
> P(A)
P(A B)
P(A)
> P(B),
A
attracts
B
if and only if
B
attracts
A
. We can also say
A
repels
B
if
A
attracts
B
C
.
Theorem.
(i) P(A B) = P(A | B)P(B).
(ii) P(A B C) = P(A | B C)P(B | C)P(C).
(iii) P(A | B C) =
P(AB|C)
P(B|C)
.
(iv)
The function
P
(
· | B
) restricted to subsets of
B
is a probability function
(or measure).
Proof.
Proofs of (i), (ii) and (iii) are trivial. So we only prove (iv). To prove
this, we have to check the axioms.
(i) Let A B. Then P(A | B) =
P(AB)
P(B)
1.
(ii) P(B | B) =
P(B)
P(B)
= 1.
(iii) Let A
i
be disjoint events that are subsets of B. Then
P
[
i
A
i
B
!
=
P(
S
i
A
i
B)
P(B)
=
P (
S
i
A
i
)
P(B)
=
X
P(A
i
)
P(B)
=
X
P(A
i
B)
P(B)
=
X
P(A
i
| B).
Definition (Partition). A partition of the sample space is a collection of disjoint
events {B
i
}
i=0
such that
S
i
B
i
= Ω.
For example, “odd” and “even” partition the sample space into two events.
The following result should be clear:
Proposition. If
B
i
is a partition of the sample space, and
A
is any event, then
P(A) =
X
i=1
P(A B
i
) =
X
i=1
P(A | B
i
)P(B
i
).
Example. A fair coin is tossed repeatedly. The gambler gets +1 for head, and
1 for tail. Continue until he is broke or achieves $a. Let
p
x
= P(goes broke | starts with $x),
and B
1
be the event that he gets head on the first toss. Then
p
x
= P(B
1
)p
x+1
+ P(B
C
1
)p
x1
p
x
=
1
2
p
x+1
+
1
2
p
x1
We have two boundary conditions
p
0
= 1,
p
a
= 0. Then solving the recurrence
relation, we have
p
x
= 1
x
a
.
Theorem (Bayes’ formula). Suppose
B
i
is a partition of the sample space, and
A and B
i
all have non-zero probability. Then for any B
i
,
P(B
i
| A) =
P(A | B
i
)P(B
i
)
P
j
P(A | B
j
)P(B
j
)
.
Note that the denominator is simply P(A) written in a fancy way.
Example (Screen test). Suppose we have a screening test that tests whether a
patient has a particular disease. We denote positive and negative results as +
and
respectively, and
D
denotes the person having disease. Suppose that the
test is not absolutely accurate, and
P(+ | D) = 0.98
P(+ | D
C
) = 0.01
P(D) = 0.001.
So what is the probability that a person has the disease given that he received a
positive result?
P(D | +) =
P(+ | D)P(D)
P(+ | D)P(D) + P(+ | D
C
)P(D
C
)
=
0.98 · 0.001
0.98 · 0.001 + 0.01 ·0.999
= 0.09
So this test is pretty useless. Even if you get a positive result, since the disease is
so rare, it is more likely that you don’t have the disease and get a false positive.
Example. Consider the two following cases:
(i) I have 2 children, one of whom is a boy.
(ii) I have two children, one of whom is a son born on a Tuesday.
What is the probability that both of them are boys?
(i) P(BB | BB BG) =
1/4
1/4+2/4
=
1
3
.
(ii)
Let
B
denote a boy born on a Tuesday, and
B
a boy not born on a
Tuesday. Then
P(B
B
B
B | BB
B
B
B
G) =
1
14
·
1
14
+ 2 ·
1
14
·
6
14
1
14
·
1
14
+ 2 ·
1
14
·
6
14
+ 2 ·
1
14
·
1
2
=
13
27
.
How can we understand this? It is much easier to have a boy born on a Tuesday
if you have two boys than one boy. So if we have the information that a boy
is born on a Tuesday, it is now less likely that there is just one boy. In other
words, it is more likely that there are two boys.
3 Discrete random variables
With what we’ve got so far, we are able to answer questions like “what is the
probability of getting a heads?” or “what is the probability of getting 10 heads
in a row?”. However, we cannot answer questions like “what do we expect to
get on average?”. What does it even mean to take the average of a “heads” and
a “tail”?
To make some sense of this notion, we have to assign, to each outcome, a
number. For example, if let “heads” correspond to 1 and “tails” correspond to 0.
Then on average, we can expect to get 0
.
5. This is the idea of a random variable.
3.1 Discrete random variables
Definition (Random variable). A random variable
X
taking values in a set
X
is a function X :
X
.
X
is usually a set of numbers, e.g. R or N.
Intuitively, a random variable assigns a “number” (or a thing in
X
) to each
event (e.g. assign 6 to the event “dice roll gives 6”).
Definition (Discrete random variables). A random variable is discrete if
X
is
finite or countably infinite.
Notation. Let T
X
, define
P(X T ) = P({ω : X(ω) T }).
i.e. the probability that the outcome is in T.
Here, instead of talking about the probability of getting a particular outcome
or event, we are concerned with the probability of a random variable taking a
particular value. If is itself countable, then we can write this as
P(X T ) =
X
ωΩ:X(ω)T
p
ω
.
Example. Let
X
be the value shown by rolling a fair die. Then
X
=
{1, 2, 3, 4, 5, 6}. We know that
P(X = i) =
1
6
.
We call this the discrete uniform distribution.
Definition (Discrete uniform distribution). A discrete uniform distribution
is a discrete distribution with finitely many possible outcomes, in which each
outcome is equally likely.
Example. Suppose we roll two dice, and let the values obtained by
X
and
Y
.
Then the sum can be represented by X + Y , with
X+Y
= {2, 3, ··· , 12}.
This shows that we can add random variables to get a new random variable.
Notation. We write
P
X
(x) = P(X = x).
We can also write X B(n, p) to mean
P(X = r) =
n
r
p
r
(1 p)
nr
,
and similarly for the other distributions we have come up with before.
Definition (Expectation). The expectation (or mean) of a real-valued
X
is
equal to
E[X] =
X
ω
p
ω
X(ω).
provided this is absolutely convergent. Otherwise, we say the expectation doesn’t
exist. Alternatively,
E[X] =
X
x
X
X
ω:X(ω)=x
p
ω
X(ω)
=
X
x
X
x
X
ω:X(ω)=x
p
ω
=
X
x
X
xP (X = x).
We are sometimes lazy and just write EX.
This is the “average” value of
X
we expect to get. Note that this definition
only holds in the case where the sample space is countable. If is continuous
(e.g. the whole of R), then we have to define the expectation as an integral.
Example. Let X be the sum of the outcomes of two dice. Then
E[X] = 2 ·
1
36
+ 3 ·
2
36
+ ··· + 12 ·
1
36
= 7.
Note that
E
[
X
] can be non-existent if the sum is not absolutely convergent.
However, it is possible for the expected value to be infinite:
Example (St. Petersburg paradox). Suppose we play a game in which we keep
tossing a coin until you get a tail. If you get a tail on the
i
th round, then I pay
you $2
i
. The expected value is
E[X] =
1
2
· 2 +
1
4
· 4 +
1
8
· 8 + ··· = .
This means that on average, you can expect to get an infinite amount of money!
In real life, though, people would hardly be willing to pay $20 to play this game.
There are many ways to resolve this paradox, such as taking into account the
fact that the host of the game has only finitely many money and thus your real
expected gain is much smaller.
Example. We calculate the expected values of different distributions:
(i) Poisson P (λ). Let X P (λ). Then
P
X
(r) =
λ
r
e
λ
r!
.
So
E[X] =
X
r=0
rP (X = r)
=
X
r=0
rλ
r
e
λ
r!
=
X
r=1
λ
λ
r1
e
λ
(r 1)!
= λ
X
r=0
λ
r
e
λ
r!
= λ.
(ii) Let X B(n, p). Then
E[X] =
n
X
0
rP (x = r)
=
n
X
0
r
n
r
p
r
(1 p)
nr
=
n
X
0
r
n!
r!(n r)!
p
r
(1 p)
nr
= np
n
X
r=1
(n 1)!
(r 1)![(n 1) (r 1)]!
p
r1
(1 p)
(n1)(r1)
= np
n1
X
0
n 1
r
p
r
(1 p)
n1r
= np.
Given a random variable
X
, we can create new random variables such as
X
+ 3 or
X
2
. Formally, let
f
:
R R
and
X
be a real-valued random variable.
Then f (X) is a new random variable that maps ω 7→ f (X(ω)).
Example. if
a, b, c
are constants, then
a
+
bX
and (
X c
)
2
are random variables,
defined as
(a + bX)(ω) = a + bX(ω)
(X c)
2
(ω) = (X(ω) c)
2
.
Theorem.
(i) If X 0, then E[X] 0.
(ii) If X 0 and E[X] = 0, then P(X = 0) = 1.
(iii) If a and b are constants, then E[a + bX] = a + bE[X].
(iv)
If
X
and
Y
are random variables, then
E
[
X
+
Y
] =
E
[
X
] +
E
[
Y
]. This is
true even if X and Y are not independent.
(v) E[X] is a constant that minimizes E[(X c)
2
] over c.
Proof.
(i) X 0 means that X(ω) 0 for all ω. Then
E[X] =
X
ω
p
ω
X(ω) 0.
(ii)
If there exists
ω
such that
X
(
ω
)
>
0 and
p
ω
>
0, then
E
[
X
]
>
0. So
X(ω) = 0 for all ω.
(iii)
E[a + bX] =
X
ω
(a + bX(ω))p
ω
= a + b
X
ω
p
ω
= a + b E[X].
(iv)
E[X+Y ] =
X
ω
p
ω
[X(ω)+Y (ω)] =
X
ω
p
ω
X(ω)+
X
ω
p
ω
Y (ω) = E[X]+E[Y ].
(v)
E[(X c)
2
] = E[(X E[X] + E[X] c)
2
]
= E[(X E[X])
2
+ 2(E[X] c)(X E[X]) + (E[X] c)
2
]
= E(X E[X])
2
+ 0 + (E[X] c)
2
.
This is clearly minimized when
c
=
E
[
X
]. Note that we obtained the zero
in the middle because E[X E[X]] = E[X] E[X] = 0.
An easy generalization of (iv) above is
Theorem. For any random variables
X
1
, X
2
, ···X
n
, for which the following
expectations exist,
E
"
n
X
i=1
X
i
#
=
n
X
i=1
E[X
i
].
Proof.
X
ω
p(ω)[X
1
(ω) + ··· + X
n
(ω)] =
X
ω
p(ω)X
1
(ω) + ··· +
X
ω
p(ω)X
n
(ω).
Definition (Variance and standard deviation). The variance of a random
variable X is defined as
var(X) = E[(X E[X])
2
].
The standard deviation is the square root of the variance,
p
var(X).
This is a measure of how “dispersed” the random variable
X
is. If we have a
low variance, then the value of X is very likely to be close to E[X].
Theorem.
(i) var X 0. If var X = 0, then P(X = E[X]) = 1.
(ii) var
(
a
+
bX
) =
b
2
var
(
X
). This can be proved by expanding the definition
and using the linearity of the expected value.
(iii) var(X) = E[X
2
] E[X]
2
, also proven by expanding the definition.
Example (Binomial distribution). Let
X B
(
n, p
) be a binomial distribution.
Then E[X] = np. We also have
E[X(X 1)] =
n
X
r=0
r(r 1)
n!
r!(n r)!
p
r
(1 p)
nr
= n(n 1)p
2
n
X
r=2
n 2
r 2
p
r2
(1 p)
(n2)(r2)
= n(n 1)p
2
.
The sum goes to 1 since it is the sum of all probabilities of a binomial
N
(
n
2
, p
)
So E[X
2
] = n(n 1)p
2
+ E[X] = n(n 1)p
2
+ np. So
var(X) = E[X
2
] (E[X])
2
= np(1 p) = npq.
Example (Poisson distribution). If
X P
(
λ
), then
E
[
X
] =
λ
, and
var
(
X
) =
λ
,
since P (λ) is B(n, p) with n , p 0, np λ.
Example (Geometric distribution). Suppose
P
(
X
=
r
) =
q
r
p
for
r
= 0
,
1
,
2
, ···
.
Then
E[X] =
X
0
rpq
r
= pq
X
0
rq
r1
= pq
X
0
d
dq
q
r
= pq
d
dq
X
0
q
r
= pq
d
dq
1
1 q
=
pq
(1 q)
2
=
q
p
.
Then
E[X(X 1)] =
X
0
r(r 1)pq
r
= pq
2
X
0
r(r 1)q
r2
= pq
2
d
2
dq
2
1
1 q
=
2pq
2
(1 q)
3
So the variance is
var(X) =
2pq
2
(1 q)
3
+
q
p
q
2
p
2
=
q
p
2
.
Definition (Indicator function). The indicator function or indicator variable
I[A] (or I
A
) of an event A is
I[A](ω) =
(
1 ω A
0 ω ∈ A
This indicator random variable is not interesting by itself. However, it is a
rather useful tool to prove results.
It has the following properties:
Proposition.
E[I[A]] =
P
ω
p(ω)I[A](ω) = P(A).
I[A
C
] = 1 I[A].
I[A B] = I[A]I[B].
I[A B] = I[A] + I[B] I[A]I[B].
I[A]
2
= I[A].
These are easy to prove from definition. In particular, the last property
comes from the fact that I[A] is either 0 and 1, and 0
2
= 0, 1
2
= 1.
Example. Let 2
n
people (
n
husbands and
n
wives, with
n >
2) sit alternate
man-woman around the table randomly. Let
N
be the number of couples sitting
next to each other.
Let A
i
= [ith couple sits together]. Then
N =
n
X
i=1
I[A
i
].
Then
E[N] = E
h
X
I[A
i
]
i
=
n
X
1
E
I[A
i
]
= nE
I[A
1
]
= nP(A
i
) = n ·
2
n
= 2.
We also have
E[N
2
] = E
X
I[A
i
]
2
= E
X
i
I[A
i
]
2
+ 2
X
i<j
I[A
i
]I[A
j
]
= nE
I[A
i
]
+ n(n 1)E
I[A
1
]I[A
2
]
We have
E
[
I
[
A
1
]
I
[
A
2
]] =
P
(
A
1
A
2
) =
2
n
1
n1
1
n1
+
n2
n1
2
n1
. Plugging in,
we ultimately obtain var(N) =
2(n2)
n1
.
In fact, as n , N P (2).
We can use these to prove the inclusion-exclusion formula:
Theorem (Inclusion-exclusion formula).
P
n
[
i
A
i
!
=
n
X
1
P(A
i
)
X
i
1
<i
2
P(A
i
1
A
j
2
) +
X
i
1
<i
2
<i
3
P(A
i
1
A
i
2
A
i
3
) ···
+ (1)
n1
P(A
1
··· A
n
).
Proof. Let I
j
be the indicator function for A
j
. Write
S
r
=
X
i
1
<i
2
<···<i
r
I
i
1
I
i
2
···I
i
r
,
and
s
r
= E[S
r
] =
X
i
1
<···<i
r
P(A
i
1
··· A
i
r
).
Then
1
n
Y
j=1
(1 I
j
) = S
1
S
2
+ S
3
··· + (1)
n1
S
n
.
So
P
n
[
1
A
j
!
= E
"
1
n
Y
1
(1 I
j
)
#
= s
1
s
2
+ s
3
··· + (1)
n1
s
n
.
We can extend the idea of independence to random variables. Two random
variables are independent if the value of the first does not affect the value of the
second.
Definition (Independent random variables). Let
X
1
, X
2
, ··· , X
n
be discrete
random variables. They are independent iff for any x
1
, x
2
, ··· , x
n
,
P(X
1
= x
1
, ··· , X
n
= x
n
) = P(X
1
= x
1
) ···P(X
n
= x
n
).
Theorem. If
X
1
, ··· , X
n
are independent random variables, and
f
1
, ··· , f
n
are
functions R R, then f
1
(X
1
), ··· , f
n
(X
n
) are independent random variables.
Proof.
Note that given a particular
y
i
, there can be many different
x
i
for which
f
i
(
x
i
) =
y
i
. When finding
P
(
f
i
(
x
i
) =
y
i
), we need to sum over all
x
i
such that
f
i
(x
i
) = f
i
. Then
P(f
1
(X
1
) = y
1
, ···f
n
(X
n
) = y
n
) =
X
x
1
:f
1
(x
1
)=y
1
·
·
x
n
:f
n
(x
n
)=y
n
P(X
1
= x
1
, ··· , X
n
= x
n
)
=
X
x
1
:f
1
(x
1
)=y
1
·
·
x
n
:f
n
(x
n
)=y
n
n
Y
i=1
P(X
i
= x
i
)
=
n
Y
i=1
X
x
i
:f
i
(x
i
)=y
i
P(X
i
= x
i
)
=
n
Y
i=1
P(f
i
(x
i
) = y
i
).
Note that the switch from the second to third line is valid since they both expand
to the same mess.
Theorem. If
X
1
, ··· , X
n
are independent random variables and all the following
expectations exists, then
E
h
Y
X
i
i
=
Y
E[X
i
].
Proof. Write R
i
for the range of X
i
. Then
E
"
n
Y
1
X
i
#
=
X
x
1
R
1
···
X
x
n
R
n
x
1
x
2
···x
n
× P(X
1
= x
1
, ··· , X
n
= x
n
)
=
n
Y
i=1
X
x
i
R
i
x
i
P(X
i
= x
i
)
=
n
Y
i=1
E[X
i
].
Corollary. Let
X
1
, ···X
n
be independent random variables, and
f
1
, f
2
, ···f
n
are functions R R. Then
E
h
Y
f
i
(x
i
)
i
=
Y
E[f
i
(x
i
)].
Theorem. If X
1
, X
2
, ···X
n
are independent random variables, then
var
X
X
i
=
X
var(X
i
).
Proof.
var
X
X
i
= E
X
X
i
2
E
h
X
X
i
i
2
= E
X
X
2
i
+
X
i=j
X
i
X
j
X
E[X
i
]
2
=
X
E[X
2
i
] +
X
i=j
E[X
i
]E[X
j
]
X
(E[X
i
])
2
X
i=j
E[X
i
]E[X
j
]
=
X
E[X
2
i
] (E[X
i
])
2
.
Corollary. Let
X
1
, X
2
, ···X
n
be independent identically distributed random
variables (iid rvs). Then
var
1
n
X
X
i
=
1
n
var(X
1
).
Proof.
var
1
n
X
X
i
=
1
n
2
var
X
X
i
=
1
n
2
X
var(X
i
)
=
1
n
2
n var(X
1
)
=
1
n
var(X
1
)
This result is important in statistics. This means that if we want to reduce
the variance of our experimental results, then we can repeat the experiment
many times (corresponding to a large
n
), and then the sample average will have
a small variance.
Example. Let
X
i
be iid
B
(1
, p
), i.e.
P
(1) =
p
and
P
(0) = 1
p
. Then
Y = X
1
+ X
2
+ ··· + X
n
B(n, p).
Since
var
(
X
i
) =
E
[
X
2
i
]
(
E
[
X
i
])
2
=
p p
2
=
p
(1
p
), we have
var
(
Y
) =
np(1 p).
Example. Suppose we have two rods of unknown lengths
a, b
. We can measure
the lengths, but is not accurate. Let
A
and
B
be the measured value. Suppose
E[A] = a, var(A) = σ
2
E[B] = b, var(B) = σ
2
.
We can measure it more accurately by measuring
X
=
A
+
B
and
Y
=
A B
.
Then we estimate a and b by
ˆa =
X + Y
2
,
ˆ
b =
X Y
2
.
Then E[ˆa] = a and E[
ˆ
b] = b, i.e. they are unbiased. Also
var(ˆa) =
1
4
var(X + Y ) =
1
4
2σ
2
=
1
2
σ
2
,
and similarly for
b
. So we can measure it more accurately by measuring the
sticks together instead of separately.
3.2 Inequalities
Here we prove a lot of different inequalities which may be useful for certain
calculations. In particular, Chebyshev’s inequality will allow us to prove the
weak law of large numbers.
Definition (Convex function). A function
f
: (
a, b
)
R
is convex if for all
x
1
, x
2
(a, b) and λ
1
, λ
2
0 such that λ
1
+ λ
2
= 1,
λ
1
f(x
1
) + λ
2
f(x
2
) f (λ
1
x
1
+ λ
2
x
2
).
It is strictly convex if the inequality above is strict (except when
x
1
=
x
2
or
λ
1
or λ
2
= 0).
x
1
x
2
λ
1
x
1
+ λ
2
x
2
λ
1
f (x
1
) + λ
2
f (x
2
)
A function is concave if f is convex.
A useful criterion for convexity is
Proposition. If
f
is differentiable and
f
′′
(
x
)
0 for all
x
(
a, b
), then it is
convex. It is strictly convex if f
′′
(x) > 0.
Theorem (Jensen’s inequality). If f : (a, b) R is convex, then
n
X
i=1
p
i
f(x
i
) f
n
X
i=1
p
i
x
i
!
for all p
1
, p
2
, ··· , p
n
such that p
i
0 and
P
p
i
= 1, and x
i
(a, b).
This says that E[f(X)] f(E[X]) (where P(X = x
i
) = p
i
).
If
f
is strictly convex, then equalities hold only if all
x
i
are equal, i.e.
X
takes only one possible value.
Proof. Induct on n. It is true for n = 2 by the definition of convexity. Then
f(p
1
x
1
+ ··· + p
n
x
n
) = f
p
1
x
1
+ (p
2
+ ··· + p
n
)
p
2
x
2
+ ··· + l
n
x
n
p
2
+ ··· + p
n
p
1
f(x
1
) + (p
2
+ ···p
n
)f
p
2
x
2
+ ··· + p
n
x
n
p
2
+ ··· + p
n
.
p
1
f(x
1
) + (p
2
+ ··· + p
n
)
p
2
( )
f(x
2
) + ··· +
p
n
( )
f(x
n
)
= p
1
f(x
1
) + ··· + p
n
(x
n
).
where the ( ) is p
2
+ ··· + p
n
.
Strictly convex case is proved with
replaced by
<
by definition of strict
convexity.
Corollary (AM-GM inequality). Given x
1
, ··· , x
n
positive reals, then
Y
x
i
1/n
1
n
X
x
i
.
Proof.
Take
f
(
x
) =
log x
. This is convex since its second derivative is
x
2
>
0.
Take P(x = x
i
) = 1/n. Then
E[f(x)] =
1
n
X
log x
i
= log GM
and
f(E[x]) = log
1
n
X
x
i
= log AM
Since
f
(
E
[
x
])
E
[
f
(
x
)], AM
GM. Since
log x
is strictly convex, AM = GM
only if all x
i
are equal.
Theorem (Cauchy-Schwarz inequality). For any two random variables X, Y ,
(E[XY ])
2
E[X
2
]E[Y
2
].
Proof. If Y = 0, then both sides are 0. Otherwise, E[Y
2
] > 0. Let
w = X Y ·
E[XY ]
E[Y
2
]
.
Then
E[w
2
] = E
X
2
2XY
E[XY ]
E[Y
2
]
+ Y
2
(E[XY ])
2
(E[Y
2
])
2
= E[X
2
] 2
(E[XY ])
2
E[Y
2
]
+
(E[XY ])
2
E[Y
2
]
= E[X
2
]
(E[XY ])
2
E[Y
2
]
Since E[w
2
] 0, the Cauchy-Schwarz inequality follows.
Theorem (Markov inequality). If
X
is a random variable with
E|X| <
and
ε > 0, then
P(|X| ε)
E|X|
ε
.
Proof. We make use of the indicator function. We have
I[|X| ε]
|X|
ε
.
This is proved by exhaustion: if
|X| ε
, then LHS = 1 and RHS
1; If
|X| < ε
,
then LHS = 0 and RHS is non-negative.
Take the expected value to obtain
P(|X| ε)
E|X|
ε
.
Similarly, we have
Theorem (Chebyshev inequality). If
X
is a random variable with
E
[
X
2
]
<
and ε > 0, then
P(|X| ε)
E[X
2
]
ε
2
.
Proof. Again, we have
I[{|X| ε}]
x
2
ε
2
.
Then take the expected value and the result follows.
Note that these are really powerful results, since they do not make any
assumptions about the distribution of
X
. On the other hand, if we know
something about the distribution, we can often get a larger bound.
An important corollary is that if µ = E[X], then
P(|X µ| ε)
E[(X µ)
2
]
ε
2
=
var X
ε
2
3.3 Weak law of large numbers
Theorem (Weak law of large numbers). Let
X
1
, X
2
, ···
be iid random variables,
with mean µ and var σ
2
.
Let S
n
=
P
n
i=1
X
i
.
Then for all ε > 0,
P
S
n
n
µ
ε
0
as n .
We say,
S
n
n
tends to µ (in probability), or
S
n
n
p
µ.
Proof. By Chebyshev,
P
S
n
n
µ
ε
E
S
n
n
µ
2
ε
2
=
1
n
2
E(S
n
)
2
ε
2
=
1
n
2
ε
2
var(S
n
)
=
n
n
2
ε
2
var(X
1
)
=
σ
2
2
0
Note that we cannot relax the “independent” condition. For example, if
X
1
=
X
2
=
X
3
=
···
= 1 or 0, each with probability 1
/
2. Then
S
n
/n →
1
/
2
since it is either 1 or 0.
Example. Suppose we toss a coin with probability p of heads. Then
S
n
n
=
number of heads
number of tosses
.
Since E[X
i
] = p, then the weak law of large number tells us that
S
n
n
p
p.
This means that as we toss more and more coins, the proportion of heads will
tend towards p.
Since we called the above the weak law, we also have the strong law, which
is a stronger statement.
Theorem (Strong law of large numbers).
P
S
n
n
µ as n
= 1.
We say
S
n
n
as
µ,
where “as” means “almost surely”.
It can be shown that the weak law follows from the strong law, but not the
other way round. The proof is left for Part II because it is too hard.
3.4 Multiple random variables
If we have two random variables, we can study the relationship between them.
Definition (Covariance). Given two random variables X, Y , the covariance is
cov(X, Y ) = E[(X E[X])(Y E[Y ])].
Proposition.
(i) cov(X, c) = 0 for constant c.
(ii) cov(X + c, Y ) = cov(X, Y ).
(iii) cov(X, Y ) = cov(Y, X).
(iv) cov(X, Y ) = E[XY ] E[X]E[Y ].
(v) cov(X, X) = var(X).
(vi) var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ).
(vii) If X, Y are independent, cov(X, Y ) = 0.
These are all trivial to prove and proof is omitted.
It is important to note that
cov
(
X, Y
) = 0 does not imply
X
and
Y
are
independent.
Example.
Let (
X, Y
) = (2
,
0)
,
(
1
,
1) or (
1
,
1) with equal probabilities of 1
/
3.
These are not independent since Y = 0 X = 2.
However, cov(X, Y ) = E[XY ] E[X]E[Y ] = 0 0 · 0 = 0.
If we randomly pick a point on the unit circle, and let the coordinates be
(
X, Y
), then
E
[
X
] =
E
[
Y
] =
E
[
XY
] = 0 by symmetry. So
cov
(
X, Y
) = 0
but
X
and
Y
are clearly not independent (they have to satisfy
x
2
+
y
2
= 1).
The covariance is not that useful in measuring how well two variables correlate.
For one, the covariance can (potentially) have dimensions, which means that the
numerical value of the covariance can depend on what units we are using. Also,
the magnitude of the covariance depends largely on the variance of
X
and
Y
themselves. To solve these problems, we define
Definition (Correlation coefficient). The correlation coefficient of
X
and
Y
is
corr(X, Y ) =
cov(X, Y )
p
var(X) var(Y )
.
Proposition. |corr(X, Y )| 1.
Proof. Apply Cauchy-Schwarz to X E[X] and Y E[Y ].
Again, zero correlation does not necessarily imply independence.
Alternatively, apart from finding a fixed covariance or correlation number,
we can see how the distribution of
X
depends on
Y
. Given two random variables
X, Y
,
P
(
X
=
x, Y
=
y
) is known as the joint distribution. From this joint
distribution, we can retrieve the probabilities
P
(
X
=
x
) and
P
(
Y
=
y
). We can
also consider different conditional expectations.
Definition (Conditional distribution). Let
X
and
Y
be random variables (in
general not independent) with joint distribution
P
(
X
=
x, Y
=
y
). Then the
marginal distribution (or simply distribution) of X is
P(X = x) =
X
y
y
P(X = x, Y = y).
The conditional distribution of X given Y is
P(X = x | Y = y) =
P(X = x, Y = y)
P(Y = y)
.
The conditional expectation of X given Y is
E[X | Y = y] =
X
x
X
xP(X = x | Y = y).
We can view
E
[
X | Y
] as a random variable in
Y
: given a value of
Y
, we return
the expectation of X.
Example. Consider a dice roll. Let
Y
= 1 denote an even roll and
Y
= 0 denote
an odd roll. Let
X
be the value of the roll. Then
E
[
X | Y
] = 3 +
Y
, ie 4 if even,
3 if odd.
Example. Let X
1
, ··· , X
n
be iid B(1, p). Let Y = X
1
+ ··· + X
n
. Then
P(X
1
= 1 | Y = r) =
P(X
1
= 1,
P
n
2
X
i
= r 1)
P(Y = r)
=
p
n1
r1
p
r1
(1 p)
(n1)(r1)
n
r
p
r
(1 p)
n1
=
r
n
.
So
E[X
1
| Y ] = 1 ·
r
n
+ 0
1
r
n
=
r
n
=
Y
n
.
Note that this is a random variable!
Theorem. If X and Y are independent, then
E[X | Y ] = E[X]
Proof.
E[X | Y = y] =
X
x
xP(X = x | Y = y)
=
X
x
xP(X = x)
= E[X]
We know that the expected value of a dice roll given it is even is 4, and the
expected value given it is odd is 3. Since it is equally likely to be even or odd,
the expected value of the dice roll is 3.5. This is formally captured by
Theorem (Tower property of conditional expectation).
E
Y
[E
X
[X | Y ]] = E
X
[X],
where the subscripts indicate what variable the expectation is taken over.
Proof.
E
Y
[E
X
[X | Y ]] =
X
y
P(Y = y)E[X | Y = y]
=
X
y
P(Y = y)
X
x
xP(X = x | Y = y)
=
X
x
X
y
xP(X = x, Y = y)
=
X
x
x
X
y
P(X = x, Y = y)
=
X
x
xP(X = x)
= E[X].
This is also called the law of total expectation. We can also state it as:
suppose A
1
, A
2
, ··· , A
n
is a partition of Ω. Then
E[X] =
X
i:P(A
i
)>0
E[X | A
i
]P(A
i
).
3.5 Probability generating functions
Consider a random variable X, taking values 0, 1, 2, ···. Let p
r
= P(X = r).
Definition (Probability generating function (pgf)). The probability generating
function (pgf) of X is
p(z) = E[z
X
] =
X
r=0
P(X = r)z
r
= p
0
+ p
1
z + p
2
z
2
··· =
X
0
p
r
z
r
.
This is a power series (or polynomial), and converges if |z| 1, since
|p(z)|
X
r
p
r
|z
r
|
X
r
p
r
= 1.
We sometimes write as p
X
(z) to indicate what the random variable.
This definition might seem a bit out of the blue. However, it turns out to be
a rather useful algebraic tool that can concisely summarize information about
the probability distribution.
Example. Consider a fair di.e. Then p
r
= 1/6 for r = 1, ··· , 6. So
p(z) = E[z
X
] =
1
6
(z + z
2
+ ··· + z
6
) =
1
6
z
1 z
6
1 z
.
Theorem. The distribution of
X
is uniquely determined by its probability
generating function.
Proof.
By definition,
p
0
=
p
(0),
p
1
=
p
(0) etc. (where
p
is the derivative of
p
).
In general,
d
i
dz
i
p(z)
z=0
= i!p
i
.
So we can recover (p
0
, p
1
, ···) from p(z).
Theorem (Abel’s lemma).
E[X] = lim
z1
p
(z).
If p
(z) is continuous, then simply E[X] = p
(1).
Note that this theorem is trivial if
p
(1) exists, as long as we know that we
can differentiate power series term by term. What is important here is that even
if
p
(1) doesn’t exist, we can still take the limit and obtain the expected value,
e.g. when E[X] = .
Proof. For z < 1, we have
p
(z) =
X
1
rp
r
z
r1
X
1
rp
r
= E[X].
So we must have
lim
z1
p
(z) E[X].
On the other hand, for any ε, if we pick N large, then
N
X
1
rp
r
E[X] ε.
So
E[X] ε
N
X
1
rp
r
= lim
z1
N
X
1
rp
r
z
r1
lim
z1
X
1
rp
r
z
r1
= lim
z1
p
(z).
So E[X] lim
z1
p
(z). So the result follows
Theorem.
E[X(X 1)] = lim
z1
p
′′
(z).
Proof. Same as above.
Example. Consider the Poisson distribution. Then
p
r
= P(X = r) =
1
r!
λ
r
e
λ
.
Then
p(z) = E[z
X
] =
X
0
z
r
1
r!
λ
r
e
λ
= e
λz
e
λ
= e
λ(z1)
.
We can have a sanity check:
p
(1) = 1, which makes sense, since
p
(1) is the sum
of probabilities.
We have
E[X] =
d
dz
e
λ(z1)
z=1
= λ,
and
E[X(X 1)] =
d
2
dx
2
e
λ(z1)
z=1
= λ
2
So
var(X) = E[X
2
] E[X]
2
= λ
2
+ λ λ
2
= λ.
Theorem. Suppose
X
1
, X
2
, ··· , X
n
are independent random variables with pgfs
p
1
, p
2
, ··· , p
n
. Then the pgf of X
1
+ X
2
+ ··· + X
n
is p
1
(z)p
2
(z) ···p
n
(z).
Proof.
E[z
X
1
+···+X
n
] = E[z
X
1
···z
X
n
] = E[z
X
1
] ···E[z
X
n
] = p
1
(z) ···p
n
(z).
Example. Let X B(n, p). Then
p(z) =
n
X
r=0
P(X = r)z
r
=
X
n
r
p
r
(1 p)
nr
z
r
= (pz + (1 p))
n
= (pz + q)
n
.
So
p
(
z
) is the product of
n
copies of
pz
+
q
. But
pz
+
q
is the pgf of
Y B
(1
, p
).
This shows that
X
=
Y
1
+
Y
2
+
···
+
Y
n
(which we already knew), i.e. a
binomial distribution is the sum of Bernoulli trials.
Example. If
X
and
Y
are independent Poisson random variables with parame-
ters λ, µ respectively, then
E[t
X+Y
] = E[t
X
]E[t
Y
] = e
λ(t1)
e
µ(t1)
= e
(λ+µ)(t1)
So X + Y P(λ + µ).
We can also do it directly:
P(X + Y = r) =
r
X
i=0
P(X = i, Y = r i) =
r
X
i=0
P(X = i)P(X = r i),
but is much more complicated.
We can use pgf-like functions to obtain some combinatorial results.
Example. Suppose we want to tile a 2
× n
bathroom by 2
×
1 tiles. One way
to do it is
We can do it recursively: suppose there are
f
n
ways to tile a 2
×n
grid. Then if
we start tiling, the first tile is either vertical, in which we have
f
n1
ways to tile
the remaining ones; or the first tile is horizontal, in which we have
f
n2
ways to
tile the remaining. So
f
n
= f
n1
+ f
n2
,
which is simply the Fibonacci sequence, with f
0
= f
1
= 1.
Let
F (z) =
X
n=0
f
n
z
n
.
Then from our recurrence relation, we obtain
f
n
z
n
= f
n1
z
n
+ f
n2
z
n
.
So
X
n=2
f
n
z
n
=
X
n=2
f
n1
z
n
+
X
n=2
f
n2
z
n
.
Since f
0
= f
1
= 1, we have
F (z) f
0
zf
1
= z(F (z) f
0
) + z
2
F (z).
Thus F (z) = (1 z z
2
)
1
. If we write
α
1
=
1
2
(1 +
5), α
2
=
1
2
(1
5).
then we have
F (z) = (1 z z
2
)
1
=
1
(1 α
1
z)(1 α
2
z)
=
1
α
1
α
2
α
1
1 α
1
z
α
2
1 α
2
z
=
1
α
1
α
2
α
1
X
n=0
α
n
1
z
n
α
2
X
n=0
α
n
2
z
n
!
.
So
f
n
=
α
n+1
1
α
n+1
2
α
1
α
2
.
Example. A Dyck word is a string of brackets that match, such as (), or ((())()).
There is only one Dyck word of length 2, (). There are 2 of length 4, (()) and
()(). Similarly, there are 5 Dyck words of length 5.
Let
C
n
be the number of Dyck words of length 2
n
. We can split each Dyck
word into (
w
1
)
w
2
, where
w
1
and
w
2
are Dyck words. Since the lengths of
w
1
and w
2
must sum up to 2(n 1),
C
n+1
=
n
X
i=0
C
i
C
ni
. ()
We again use pgf-like functions: let
c(x) =
X
n=0
C
n
x
n
.
From (), we can show that
c(x) = 1 + xc(x)
2
.
We can solve to show that
c(x) =
1
1 4x
2x
=
X
0
2n
n
x
n
n + 1
,
noting that C
0
= 1. Then
C
n
=
1
n + 1
2n
n
.
Sums with a random number of terms
A useful application of generating functions is the sum with a random number
of random terms. For example, an insurance company may receive a random
number of claims, each demanding a random amount of money. Then we have
a sum of a random number of terms. This can be answered using probability
generating functions.
Example. Let
X
1
, X
2
, ··· , X
n
be iid with pgf
p
(
z
) =
E
[
z
X
]. Let
N
be a random
variable independent of
X
i
with pgf
h
(
z
). What is the pgf of
S
=
X
1
+
···
+
X
N
?
E[z
S
] = E[z
X
1
+···+X
N
]
= E
N
[E
X
i
[z
X
1
+...+X
N
| N ]
| {z }
assuming fixed N
]
=
X
n=0
P(N = n)E[z
X
1
+X
2
+···+X
n
]
=
X
n=0
P(N = n)E[z
X
1
]E[z
X
2
] ···E[z
X
n
]
=
X
n=0
P(N = n)(E[z
X
1
])
n
=
X
n=0
P(N = n)p(z)
n
= h(p(z))
since h(x) =
P
n=0
P(N = n)x
n
.
So
E[S] =
d
dz
h(p(z))
z=1
= h
(p(1))p
(1)
= E[N]E[X
1
]
To calculate the variance, use the fact that
E[S(S 1)] =
d
2
dz
2
h(p(z))
z=1
.
Then we can find that
var(S) = E[N] var(X
1
) + E[X
2
1
] var(N).
4 Interesting problems
Here we are going to study two rather important and interesting probabilistic
processes branching processes and random walks. Solutions to these will
typically involve the use of probability generating functions.
4.1 Branching processes
Branching processes are used to model population growth by reproduction. At
the beginning, there is only one individual. At each iteration, the individual
produces a random number of offsprings. In the next iteration, each offspring
will individually independently reproduce randomly according to the same
distribution. We will ask questions such as the expected number of individuals
in a particular generation and the probability of going extinct.
Consider
X
0
, X
1
, ···
, where
X
n
is the number of individuals in the
n
th
generation. We assume the following:
(i) X
0
= 1
(ii)
Each individual lives for unit time and produces
k
offspring with probability
p
k
.
(iii) Suppose all offspring behave independently. Then
X
n+1
= Y
n
1
+ Y
n
2
+ ··· + Y
n
X
n
,
where
Y
n
i
are iid random variables, which is the same as
X
1
(the superscript
denotes the generation).
It is useful to consider the pgf of a branching process. Let
F
(
z
) be the pgf of
Y
n
i
. Then
F (z) = E[z
Y
n
i
] = E[z
X
1
] =
X
k=0
p
k
z
k
.
Define
F
n
(z) = E[z
X
n
].
The main theorem of branching processes here is
Theorem.
F
n+1
(z) = F
n
(F (z)) = F (F (F (···F (z) ···)))) = F (F
n
(z)).
Proof.
F
n+1
(z) = E[z
X
n+1
]
= E[E[z
X
n+1
| X
n
]]
=
X
k=0
P(X
n
= k)E[z
X
n+1
| X
n
= k]
=
X
k=0
P(X
n
= k)E[z
Y
n
1
+···+Y
n
k
| X
n
= k]
=
X
k=0
P(X
n
= k)E[z
Y
1
]E[z
Y
2
] ···E[z
Y
n
]
=
X
k=0
P(X
n
= k)(E[z
X
1
])
k
=
X
k=0
P(X
n
= k)F (z)
k
= F
n
(F (z))
Theorem. Suppose
E[X
1
] =
X
kp
k
= µ
and
var(X
1
) = E[(X µ)
2
] =
X
(k µ)
2
p
k
< .
Then
E[X
n
] = µ
n
, var X
n
= σ
2
µ
n1
(1 + µ + µ
2
+ ··· + µ
n1
).
Proof.
E[X
n
] = E[E[X
n
| X
n1
]]
= E[µX
n1
]
= µE[X
n1
]
Then by induction, E[X
n
] = µ
n
(since X
0
= 1).
To calculate the variance, note that
var(X
n
) = E[X
2
n
] (E[X
n
])
2
and hence
E[X
2
n
] = var(X
n
) + (E[X])
2
We then calculate
E[X
2
n
] = E[E[X
2
n
| X
n1
]]
= E[var(X
n
) + (E[X
n
])
2
| X
n1
]
= E[X
n1
var(X
1
) + (µX
n1
)
2
]
= E[X
n1
σ
2
+ (µX
n1
)
2
]
= σ
2
µ
n1
+ µ
2
E[X
2
n1
].
So
var X
n
= E[X
2
n
] (E[X
n
])
2
= µ
2
E[X
2
n1
] + σ
2
µ
n1
µ
2
(E[X
n1
])
2
= µ
2
(E[X
2
n1
] E[X
n1
]
2
) + σ
2
µ
n1
= µ
2
var(X
n1
) + σ
2
µ
n1
= µ
4
var(X
n2
) + σ
2
(µ
n1
+ µ
n
)
= ···
= µ
2(n1)
var(X
1
) + σ
2
(µ
n1
+ µ
n
+ ··· + µ
2n3
)
= σ
2
µ
n1
(1 + µ + ··· + µ
n1
).
Of course, we can also obtain this using the probability generating function as
well.
Extinction probability
Let
A
n
be the event
X
n
= 0, ie extinction has occurred by the
n
th generation.
Let q be the probability that extinction eventually occurs. Let
A =
[
n=1
A
n
= [extinction eventually occurs].
Since A
1
A
2
A
3
···, we know that
q = P(A) = lim
n→∞
P(A
n
) = lim
n→∞
P(X
n
= 0).
But
P(X
n
= 0) = F
n
(0),
since F
n
(0) =
P
P(X
n
= k)z
k
. So
F (q) = F
lim
n→∞
F
n
(0)
= lim
n→∞
F (F
n
(0)) = lim
n→∞
F
n+1
(0) = q.
So
F (q) = q.
Alternatively, using the law of total probability
q =
X
k
P(X
1
= k)P(extinction | X
1
= k) =
X
k
p
k
q
k
= F (q),
where the second equality comes from the fact that for the whole population to
go extinct, each individual population must go extinct.
This means that to find the probability of extinction, we need to find a fixed
point of F . However, if there are many fixed points, which should we pick?
Theorem. The probability of extinction q is the smallest root to the equation
q = F (q). Write µ = E[X
1
]. Then if µ 1, then q = 1; if µ > 1, then q < 1.
Proof.
To show that it is the smallest root, let
α
be the smallest root. Then note
that 0
α F
(0)
F
(
α
) =
α
since
F
is increasing (proof: write the function
out!). Hence F (F (0)) α. Continuing inductively, F
n
(0) α for all n. So
q = lim
n→∞
F
n
(0) α.
So q = α.
To show that
q
= 1 when
µ
1, we show that
q
= 1 is the only root. We
know that
F
(
z
)
, F
′′
(
z
)
0 for
z
(0
,
1) (proof: write it out again!). So
F
is
increasing and convex. Since
F
(1) =
µ
1, it must approach (1
,
1) from above
the F = z line. So it must look like this:
z
F (z)
So z = 1 is the only root.
4.2 Random walk and gambler’s ruin
Here we’ll study random walks, using gambler’s ruin as an example.
Definition (Random walk). Let
X
1
, ··· , X
n
be iid random variables such
that
X
n
= +1 with probability
p
, and
1 with probability 1
p
. Let
S
n
=
S
0
+ X
1
+ ··· + X
n
. Then (S
0
, S
1
, ··· , S
n
) is a 1-dimensional random walk.
If p = q =
1
2
, we say it is a symmetric random walk.
Example. A gambler starts with $
z
, with
z < a
, and plays a game in which he
wins $1 or loses $1 at each turn with probabilities
p
and
q
respectively. What
are
p
z
= P(random walk hits a before 0 | starts at z),
and
q
z
= P(random walk hits 0 before a | starts at z)?
He either wins his first game, with probability
p
, or loses with probability
q
. So
p
z
= qp
z1
+ pp
z+1
,
for 0 < z < a, and p
0
= 0, p
a
= 1.
Try p
z
= t
z
. Then
pt
2
t + q = (pt q)(t 1) = 0,
noting that p = 1 q. If p = q, then
p
z
= A1
z
+ B
q
p
z
.
Since p
0
= 0, we get A = B. Since p
a
= 1, we obtain
p
z
=
1 (q/p)
z
1 (q/p)
a
.
If p = q, then p
z
= A + Bz = z/a.
Similarly, (or perform the substitutions p 7→ q, q 7→ p and z 7→ a z)
q
z
=
(q/p)
a
(q/p)
z
(q/p)
a
1
if p = q, and
q
z
=
a z
a
if p = q. Since p
z
+ q
z
= 1, we know that the game will eventually end.
What if a ? What is the probability of going bankrupt?
P(path hits 0 ever) = P
[
a=z+1
[path hits 0 before a]
!
= lim
a→∞
P(path hits 0 before a)
= lim
a→∞
q
z
=
(
(q/p)
z
p > q
1 p q.
So if the odds are against you (i.e. the probability of losing is greater than the
probability of winning), then no matter how small the difference is, you are
bound to going bankrupt eventually.
Duration of the game
Let
D
z
= expected time until the random walk hits 0 or
a
, starting from
z
. We
first show that this is bounded. We know that there is one simple way to hit 0
or
a
: get +1 or
1 for
a
times in a row. This happens with probability
p
a
+
q
a
,
and takes
a
steps. So even if this were the only way to hit 0 or
a
, the expected
duration would be
a
p
a
+q
a
. So we must have
D
z
a
p
a
+ q
a
This is a very crude bound, but it is sufficient to show that it is bounded, and
we can meaningfully apply formulas to this finite quantity.
We have
D
z
= E[duration]
= E[E[duration | X
1
]]
= pE[duration | X
1
= 1] + qE[duration | X
1
= 1]
= p(1 + D
z+1
) + q(1 + D
z1
)
So
D
z
= 1 + pD
z+1
+ qD
z1
,
subject to D
0
= D
a
= 0.
We first find a particular solution by trying D
z
= Cz. Then
Cz = 1 + pC(z + 1) + qC(z 1).
So
C =
1
q p
,
for p = q. Then we find the complementary solution. Try D
z
= t
z
.
pt
2
t + q = 0,
which has roots 1 and q/p. So the general solution is
D
z
= A + B
q
p
z
+
z
q p
.
Putting in the boundary conditions,
D
z
=
z
q p
a
q p
·
1 (q/p)
z
1 (q/p)
a
.
If p = q, then to find the particular solution, we have to try
D
z
= Cz
2
.
Then we find C = 1. So
D
z
= z
2
+ A + Bz.
Then the boundary conditions give
D
z
= z(a z).
Using generating functions
We can use generating functions to solve the problem as well.
Let U
z,n
= P(random walk absorbed at 0 at n | start in z).
We have the following conditions:
U
0,0
= 1;
U
z,0
= 0 for 0
< z a
;
U
0,n
= U
a,n
= 0 for n > 0.
We define a pgf-like function.
U
z
(s) =
X
n=0
U
z,n
s
n
.
We know that
U
z,n+1
= pU
z+1,n
+ qU
z1,n
.
Multiply by s
n+1
and sum on n = 0, 1, ···. Then
U
z
(s) = psU
z+1
(s) + qsU
z1
(s).
We try U
z
(s) = [λ(s)]
z
. Then
λ(s) = psλ(s)
2
+ qs.
Then
λ
1
(s), λ
2
(s) =
1 ±
p
1 4pqs
2
2ps
.
So
U
z
(s) = A(s)λ
1
(s)
z
+ B(s)λ
2
(s)
z
.
Since U
0
(s) = 1 and U
a
(s) = 0, we know that
A(s) + B(s) = 1
and
A(s)λ
1
(s)
a
+ B(s)λ
2
(s)
a
= 0.
Then we find that
U
z
(s) =
λ
1
(s)
a
λ
2
(s)
z
λ
2
(s)
a
λ
1
(s)
z
λ
1
(s)
a
λ
2
(s)
a
.
Since λ
1
(s)λ
2
(s) =
q
p
, we can “simplify” this to obtain
U
z
(s) =
q
p
z
·
λ
1
(s)
az
λ
2
(s)
az
λ
1
(s)
a
λ
2
(s)
a
.
We see that
U
z
(1) =
q
z
. We can apply the same method to find the generating
function for absorption at
a
, say
V
z
(
s
). Then the generating function for the
duration is U
z
+ V
z
. Hence the expected duration is D
z
= U
z
(1) + V
z
(1).
5 Continuous random variables
5.1 Continuous random variables
So far, we have only looked at the case where the outcomes are discrete.
Consider an experiment where we throw a needle randomly onto the ground
and record the angle it makes with a fixed horizontal. Then our sample space is
= {ω R : 0 ω < 2π}. Then we have
P(ω [0, θ]) =
θ
2π
, 0 θ 2π.
With continuous distributions, we can no longer talk about the probability of
getting a particular number, since this is always zero. For example, we will
almost never get an angle of exactly 0.42 radians.
Instead, we can only meaningfully talk about the probability of
X
falling into
a particular range. To capture the distribution of
X
, we want to define a function
f
such that for each
x
and small
δx
, the probability of
X
lying in [
x, x
+
δx
] is
given by
f
(
x
)
δx
+
o
(
δx
). This
f
is known as the probability density function.
Integrating this, we know that the probability of
X
[
a, b
] is
R
b
a
f
(
x
) d
x
. We
take this as the definition of f.
Definition (Continuous random variable). A random variable
X
:
R
is
continuous if there is a function f : R R
0
such that
P(a X b) =
Z
b
a
f(x) dx.
We call f the probability density function, which satisfies
f 0
R
−∞
f(x) = 1.
Note that P(X = a) = 0 since it is
R
a
a
f(x) dx. Then we also have
P
[
aQ
[X = a]
= 0,
since it is a countable union of probability 0 events (and axiom 3 states that the
probability of the countable union is the sum of probabilities, i.e. 0).
Definition (Cumulative distribution function). The cumulative distribution
function (or simply distribution function) of a random variable
X
(discrete,
continuous, or neither) is
F (x) = P(X x).
We can see that F (x) is increasing and F (x) 1 as x .
In the case of continuous random variables, we have
F (x) =
Z
x
−∞
f(z) dz.
Then
F
is continuous and differentiable. In general,
F
(
x
) =
f
(
x
) whenever
F
is differentiable.
The name of continuous random variables comes from the fact that
F
(
x
) is
(absolutely) continuous.
The cdf of a continuous random variable might look like this:
P
The cdf of a discrete random variable might look like this:
P
The cdf of a random variable that is neither discrete nor continuous might look
like this:
P
Note that we always have
P(a < x b) = F (b) F (a).
This will be equal to
R
b
a
f(x) dx in the case of continuous random variables.
Definition (Uniform distribution). The uniform distribution on [a, b] has pdf
f(x) =
1
b a
.
Then
F (x) =
Z
x
a
f(z) dz =
x a
b a
for a x b.
If X follows a uniform distribution on [a, b], we write X U[a, b].
Definition (Exponential random variable). The exponential random variable
with parameter λ has pdf
f(x) = λe
λx
and
F (x) = 1 e
λx
for x 0.
We write X E(λ).
Then
P(a x b) =
Z
b
a
f(z) dz = e
λa
e
λb
.
Proposition. The exponential random variable is memoryless, i.e.
P(X x + z | X x) = P(X z).
This means that, say if
X
measures the lifetime of a light bulb, knowing it has
already lasted for 3 hours does not give any information about how much longer
it will last.
Recall that the geometric random variable is the discrete memoryless random
variable.
Proof.
P(X x + z | X x) =
P(X x + z)
P(X x)
=
R
x+z
f(u) du
R
x
f(u) du
=
e
λ(x+z)
e
λx
= e
λz
= P(X z).
Similarly, given that, you have lived for
x
days, what is the probability of
dying within the next δx days?
P(X x + δx | X x) =
P(x X x + δx)
P(X x)
=
λe
λx
δx
e
λx
= λδx.
So it is independent of how old you currently are, assuming your survival follows
an exponential distribution.
In general, we can define the hazard rate to be
h(x) =
f(x)
1 F (x)
.
Then
P(x X x + δx | X x) =
P(x X x + δx)
P(X x)
=
δxf(x)
1 F (x)
= δx · h(x).
In the case of exponential distribution, h(x) is constant and equal to λ.
Similar to discrete variables, we can define the expected value and the
variance. However, we will (obviously) have to replace the sum with an integral.
Definition (Expectation). The expectation (or mean) of a continuous random
variable is
E[X] =
Z
−∞
xf(x) dx,
provided not both
R
0
xf(x) dx and
R
0
−∞
xf(x) dx are infinite.
Theorem. If X is a continuous random variable, then
E[X] =
Z
0
P(X x) dx
Z
0
P(X x) dx.
This result is more intuitive in the discrete case:
X
x=0
xP(X = x) =
X
x=0
X
y=x+1
P(X = y) =
X
x=0
P(X > x),
where the first equality holds because on both sides, we have
x
copies of
P
(
X
=
x
)
in the sum.
Proof.
Z
0
P(X x) dx =
Z
0
Z
x
f(y) dy dx
=
Z
0
Z
0
I[y x]f(y) dy dx
=
Z
0
Z
0
I[x y] dx
f(y) dy
=
Z
0
yf(y) dy.
We can similarly show that
R
0
P(X x) dx =
R
0
−∞
yf(y) dy.
Example. Suppose X E(λ). Then
P(X x) =
Z
x
λe
λt
dt = e
λx
.
So
E[X] =
Z
0
e
λx
dx =
1
λ
.
Definition (Variance). The variance of a continuous random variable is
var(X) = E[(X E[X])
2
] = E[X
2
] (E[X])
2
.
So we have
var(X) =
Z
−∞
x
2
f(x) dx
Z
−∞
xf(x) dx
2
.
Example. Let X U[a, b]. Then
E[X] =
Z
b
a
x
1
b a
dx =
a + b
2
.
So
var(X) =
Z
b
a
x
2
1
b a
dx (E[X])
2
=
1
12
(b a)
2
.
Apart from the mean, or expected value, we can also have other notions of
“average values”.
Definition (Mode and median). Given a pdf f(x), we call ˆx a mode if
f(ˆx) f(x)
for all
x
. Note that a distribution can have many modes. For example, in the
uniform distribution, all x are modes.
We say it is a median if
Z
ˆx
−∞
f(x) dx =
1
2
=
Z
ˆx
f(x) dx.
For a discrete random variable, the median is ˆx such that
P(X ˆx)
1
2
, P(X ˆx)
1
2
.
Here we have a non-strict inequality since if the random variable, say, always
takes value 0, then both probabilities will be 1.
Suppose that we have an experiment whose outcome follows the distribution
of
X
. We can perform the experiment many times and obtain many results
X
1
, ··· , X
n
. The average of these results is known as the sample mean.
Definition (Sample mean). If
X
1
, ··· , X
n
is a random sample from some
distribution, then the sample mean is
¯
X =
1
n
n
X
1
X
i
.
5.2 Stochastic ordering and inspection paradox
We want to define a (partial) order on two different random variables. For
example, we might want to say that
X
+ 2
> X
, where
X
is a random variable.
A simple definition might be expectation ordering, where
X Y
if
E
[
X
]
E
[
Y
]. This, however, is not satisfactory since
X Y
and
Y X
does not imply
X = Y . Instead, we can use the stochastic order.
Definition (Stochastic order). The stochastic order is defined as:
X
st
Y
if
P(X > t) P(Y > t) for all t.
This is clearly transitive. Stochastic ordering implies expectation ordering,
since
X
st
Y
Z
0
P(X > x) dx
Z
0
P(y > x) dx E[X] E[Y ].
Alternatively, we can also order things by hazard rate.
Example (Inspection paradox). Suppose that
n
families have children attending
a school. Family
i
has
X
i
children at the school, where
X
1
, ··· , X
n
are iid
random variables, with
P
(
X
i
=
k
) =
p
k
. Suppose that the average family size is
µ.
Now pick a child at random. What is the probability distribution of his
family size? Let
J
be the index of the family from which she comes (which is a
random variable). Then
P(X
J
= k | J = j) =
P(J = j, X
j
= k)
P(J = j)
.
The denominator is 1
/n
. The numerator is more complex. This would require
the
j
th family to have
k
members, which happens with probability
p
k
; and
that we picked a member from the
j
th family, which happens with probability
E
h
k
k+
P
i=j
X
i
i
. So
P(X
J
= k | J = j) = E
"
nkp
k
k +
P
i=j
X
i
#
.
Note that this is independent of j. So
P(X
J
= k) = E
"
nkp
k
k +
P
i=j
X
i
#
.
Also, P(X
1
= k) = p
k
. So
P(X
J
= k)
P(X
1
= k)
= E
"
nk
k +
P
i=j
X
i
#
.
This is increasing in
k
, and greater than 1 for
k > µ
. So the average value of the
family size of the child we picked is greater than the average family size. It can
be shown that X
J
is stochastically greater than X
1
.
This means that if we pick the children randomly, the sample mean of the
family size will be greater than the actual mean. This is since for the larger a
family is, the more likely it is for us to pick a child from the family.
5.3 Jointly distributed random variables
Definition (Joint distribution). Two random variables
X, Y
have joint distri-
bution F : R
2
7→ [0, 1] defined by
F (x, y) = P(X x, Y y).
The marginal distribution of X is
F
X
(x) = P(X x) = P(X x, Y < ) = F (x, ) = lim
y→∞
F (x, y)
Definition (Jointly distributed random variables). We say
X
1
, ··· , X
n
are
jointly distributed continuous random variables and have joint pdf
f
if for any
set A R
n
P((X
1
, ··· , X
n
) A) =
Z
(x
1
,···x
n
)A
f(x
1
, ··· , x
n
) dx
1
···dx
n
.
where
f(x
1
, ··· , x
n
) 0
and
Z
R
n
f(x
1
, ··· , x
n
) dx
1
···dx
n
= 1.
Example. In the case where n = 2,
F (x, y) = P(X x, Y y) =
Z
x
−∞
Z
y
−∞
f(x, y) dx dy.
If F is differentiable, then
f(x, y) =
2
x∂y
F (x, y).
Theorem. If
X
and
Y
are jointly continuous random variables, then they are
individually continuous random variables.
Proof. We prove this by showing that X has a density function.
We know that
P(X A) = P(X A, Y (−∞, +))
=
Z
xA
Z
−∞
f(x, y) dy dx
=
Z
xA
f
X
(x) dx
So
f
X
(x) =
Z
−∞
f(x, y) dy
is the (marginal) pdf of X.
Definition (Independent continuous random variables). Continuous random
variables X
1
, ··· , X
n
are independent if
P(X
1
A
1
, X
2
A
2
, ··· , X
n
A
n
) = P(X
1
A
1
)P(X
2
A
2
) ···P(X
n
A
n
)
for all A
i
X
i
.
If we let F
X
i
and f
X
i
be the cdf, pdf of X, then
F (x
1
, ··· , x
n
) = F
X
1
(x
1
) ···F
X
n
(x
n
)
and
f(x
1
, ··· , x
n
) = f
X
1
(x
1
) ···f
X
n
(x
n
)
are each individually equivalent to the definition above.
To show that two (or more) random variables are independent, we only have
to factorize the joint pdf into factors that each only involve one variable.
Example. If (
X
1
, X
2
) takes a random value from [0
,
1]
×
[0
,
1], then
f
(
x
1
, x
2
) = 1.
Then we can see that
f
(
x
1
, x
2
) = 1
·
1 =
f
(
x
1
)
· f
(
x
2
). So
X
1
and
X
2
are
independent.
On the other hand, if (
Y
1
, Y
2
) takes a random value from [0
,
1]
×
[0
,
1] with
the restriction that
Y
1
Y
2
, then they are not independent, since
f
(
x
1
, x
2
) =
2I[Y
1
Y
2
], which cannot be split into two parts.
Proposition. For independent continuous random variables X
i
,
(i) E[
Q
X
i
] =
Q
E[X
i
]
(ii) var(
P
X
i
) =
P
var(X
i
)
5.4 Geometric probability
Often, when doing probability problems that involve geometry, we can visualize
the outcomes with the aid of a picture.
Example. Two points
X
and
Y
are chosen independently on a line segment of
length
L
. What is the probability that
|X Y |
? By “at random”, we mean
f(x, y) =
1
L
2
,
since each of X and Y have pdf 1/L.
We can visualize this on a graph:
A
L
L
Here the two axes are the values of
X
and
Y
, and
A
is the permitted region.
The total area of the white part is simply the area of a square with length
L
.
So the area of A is L
2
(L )
2
= 2Lℓ
2
. So the desired probability is
Z
A
f(x, y) dx dy =
2Lℓ
2
L
2
.
Example (Bertrand’s paradox). Suppose we draw a random chord in a circle.
What is the probability that the length of the chord is greater than the length
of the side of an inscribed equilateral triangle?
There are three ways of “drawing a random chord”.
(i)
We randomly pick two end points over the circumference independently.
Now draw the inscribed triangle with the vertex at one end point. For the
length of the chord to be longer than a side of the triangle, the other end
point must between the two other vertices of the triangle. This happens
with probability 1/3.
(ii)
wlog the chord is horizontal, on the lower side of the circle. The mid-point is
uniformly distributed along the middle (dashed) line. Then the probability
of getting a long line is 1/2.
(iii)
The mid point of the chord is distributed uniformly across the circle. Then
you get a long line if and only if the mid-point lies in the smaller circle
shown below. This occurs with probability 1/4.
We get different answers for different notions of “random”! This is why when
we say “randomly”, we should be explicit in what we mean by that.
Example (Buffon’s needle). A needle of length
is tossed at random onto a
floor marked with parallel lines a distance
L
apart, where
L
. Let
A
be the
event that the needle intersects a line. What is P(A)?
X
θ
L
Suppose that X U [0, L] and θ U [0, π]. Then
f(x, θ) =
1
.
This touches a line iff X sin θ. Hence
P(A) =
Z
π
θ=0
sin θ
L
| {z }
P(X sin θ)
1
π
dθ =
2
πL
.
Since the answer involves
π
, we can estimate
π
by conducting repeated exper-
iments! Suppose we have
N
hits out of
n
tosses. Then an estimator for
p
is
ˆp =
N
n
. Hence
ˆπ =
2
ˆpL
.
We will later find out that this is a really inefficient way of estimating
π
(as you
could have guessed).
5.5 The normal distribution
Definition (Normal distribution). The normal distribution with parameters
µ, σ
2
, written N(µ, σ
2
) has pdf
f(x) =
1
2πσ
exp
(x µ)
2
2σ
2
,
for −∞ < x < .
It looks approximately like this:
µ
The standard normal is when µ = 0, σ
2
= 1, i.e. X N(0, 1).
We usually write
ϕ
(
x
) for the pdf and Φ(
x
) for the cdf of the standard normal.
This is a rather important probability distribution. This is partly due to the
central limit theorem, which says that if we have a large number of iid random
variables, then the distribution of their averages are approximately normal. Many
distributions in physics and other sciences are also approximately or exactly
normal.
We first have to show that this makes sense, i.e.
Proposition.
Z
−∞
1
2πσ
2
e
1
2σ
2
(xµ)
2
dx = 1.
Proof. Substitute z =
(xµ)
σ
. Then
I =
Z
−∞
1
2π
e
1
2
z
2
dz.
Then
I
2
=
Z
−∞
1
2π
e
x
2
/2
dx
Z
1
2π
e
y
2
/2
dy
=
Z
0
Z
2π
0
1
2π
e
r
2
/2
r dr dθ
= 1.
We also have
Proposition. E[X] = µ.
Proof.
E[X] =
1
2πσ
Z
−∞
xe
(xµ)
2
/2σ
2
dx
=
1
2πσ
Z
−∞
(x µ)e
(xµ)
2
/2σ
2
dx +
1
2πσ
Z
−∞
µe
(xµ)
2
/2σ
2
dx.
The first term is antisymmetric about
µ
and gives 0. The second is just
µ
times
the integral we did above. So we get µ.
Also, by symmetry, the mode and median of a normal distribution are also
both µ.
Proposition. var(X) = σ
2
.
Proof.
We have
var
(
X
) =
E
[
X
2
]
(
E
[
X
])
2
. Substitute
Z
=
Xµ
σ
. Then
E
[
Z
] = 0,
E[Z
2
] =
1
σ
2
E[X
2
].
Then
var(Z) =
1
2π
Z
−∞
z
2
e
z
2
/2
dz
=
1
2π
ze
z
2
/2
−∞
+
1
2π
Z
−∞
e
z
2
/2
dz
= 0 + 1
= 1
So var X = σ
2
.
Example. UK adult male heights are normally distributed with mean 70” and
standard deviation 3”. In the Netherlands, these figures are 71” and 3”.
What is
P
(
Y > X
), where
X
and
Y
are the heights of randomly chosen UK
and Netherlands males, respectively?
We have
X N
(70
,
3
2
) and
Y N
(71
,
3
2
). Then (as we will show in later
lectures) Y X N (1, 18).
P(Y > X) = P(Y X > 0) = P
Y X 1
18
>
1
18
= 1 Φ(1/
18),
since
(Y X)1
18
N (0, 1), and the answer is approximately 0.5931.
Now suppose that in both countries, the Olympic male basketball teams are
selected from that portion of male whose hight is at least above 4” above the
mean (which corresponds to the 9
.
1% tallest males of the country). What is the
probability that a randomly chosen Netherlands player is taller than a randomly
chosen UK player?
For the second part, we have
P(Y > X | X 74, Y 75) =
R
75
x=74
ϕ
X
(x) dx +
R
x=75
R
y=x
ϕ
Y
(y)ϕ
X
(x) dy dx
R
x=74
ϕ
X
(x) dx
R
y=75
ϕ
Y
(y) dy
,
which is approximately 0.7558. So even though the Netherlands people are only
slightly taller, if we consider the tallest bunch, the Netherlands people will be
much taller on average.
5.6 Transformation of random variables
We will now look at what happens when we apply a function to random variables.
We first look at the simple case where there is just one variable, and then move
on to the general case where we have multiple variables and can mix them
together.
Single random variable
Theorem. If
X
is a continuous random variable with a pdf
f
(
x
), and
h
(
x
)
is a continuous, strictly increasing function with
h
1
(
x
) differentiable, then
Y = h(X) is a random variable with pdf
f
Y
(y) = f
X
(h
1
(y))
d
dy
h
1
(y).
Proof.
F
Y
(y) = P(Y y)
= P(h(X) y)
= P(X h
1
(y))
= F (h
1
(y))
Take the derivative with respect to y to obtain
f
Y
(y) = F
Y
(y) = f(h
1
(y))
d
dy
h
1
(y).
It is often easier to redo the proof than to remember the result.
Example. Let X U[0, 1]. Let Y = log X. Then
P(Y y) = P(log X y)
= P(X e
y
)
= 1 e
y
.
But this is the cumulative distribution function of
E
(1). So
Y
is exponentially
distributed with λ = 1.
In general, we get the following result:
Theorem. Let
U U
[0
,
1]. For any strictly increasing distribution function
F
,
the random variable X = F
1
U has distribution function F .
Proof.
P(X x) = P(F
1
(U) x) = P(U F (x)) = F (x).
This condition “strictly increasing” is needed for the inverse to exist. If it is
not, we can define
F
1
(u) = inf{x : F (x) u, 0 < u < 1},
and the same result holds.
This can also be done for discrete random variables
P
(
X
=
x
i
) =
p
i
by letting
X = x
j
if
j1
X
i=0
p
i
U <
j
X
i=0
p
i
,
for U U[0, 1].
Multiple random variables
Suppose X
1
, X
2
, ··· , X
n
are random variables with joint pdf f. Let
Y
1
= r
1
(X
1
, ··· , X
n
)
Y
2
= r
2
(X
1
, ··· , X
n
)
.
.
.
Y
n
= r
n
(X
1
, ··· , X
n
).
For example, we might have Y
1
=
X
1
X
1
+X
2
and Y
2
= X
1
+ X
2
.
Let
R R
n
such that
P
((
X
1
, ··· , X
n
)
R
) = 1, i.e.
R
is the set of all values
(X
i
) can take.
Suppose
S
is the image of
R
under the above transformation, and the map
R S is bijective. Then there exists an inverse function
X
1
= s
1
(Y
1
, ··· , Y
n
)
X
2
= s
2
(Y
1
, ··· , Y
n
)
.
.
.
X
n
= s
n
(Y
1
, ··· , Y
n
).
For example, if
X
1
, X
2
refers to the coordinates of a random point in Cartesian
coordinates, Y
1
, Y
2
might be the coordinates in polar coordinates.
Definition (Jacobian determinant). Suppose
s
i
y
j
exists and is continuous at
every point (y
1
, ··· , y
n
) S. Then the Jacobian determinant is
J =
(s
1
, ··· , s
n
)
(y
1
, ··· , y
n
)
= det
s
1
y
1
···
s
1
y
n
.
.
.
.
.
.
.
.
.
s
n
y
1
···
s
n
y
n
Take
A R
and
B
=
r
(
A
). Then using results from IA Vector Calculus, we
get
P((X
1
, ··· , X
n
) A) =
Z
A
f(x
1
, ··· , x
n
) dx
1
···dx
n
=
Z
B
f(s
1
(y
1
, ···y
n
), s
2
, ··· , s
n
)|J| dy
1
··· dy
n
= P((Y
1
, ···Y
n
) B).
So
Proposition. (Y
1
, ··· , Y
n
) has density
g(y
1
, ··· , y
n
) = f(s
1
(y
1
, ··· , y
n
), ···s
n
(y
1
, ··· , y
n
))|J|
if (y
1
, ··· , y
n
) S, 0 otherwise.
Example. Suppose (X, Y ) has density
f(x, y) =
(
4xy 0 x 1, 0 y 1
0 otherwise
We see that X and Y are independent, with each having a density f(x) = 2x.
Define U = X/Y , V = XY . Then we have X =
UV and Y =
p
V/U.
The Jacobian is
det
x/∂u x/∂v
y/∂u y/∂v
= det
1
2
p
v/u
1
2
p
u/v
1
2
p
v/u
3
1
2
p
1/uv
=
1
2u
Alternatively, we can find this by considering
det
u/∂x u/∂y
v/∂x u/∂y
= 2u,
and then inverting the matrix. So
g(u, v) = 4
uv
r
v
u
1
2u
=
2v
u
,
if (u, v) is in the image S, 0 otherwise. So
g(u, v) =
2v
u
I[(u, v) S].
Since this is not separable, we know that U and V are not independent.
In the linear case, life is easy. Suppose
Y =
Y
1
.
.
.
Y
n
= A
X
1
.
.
.
X
n
= AX
Then X = A
1
Y. Then
x
i
y
j
= (A
1
)
ij
. So |J| = |det(A
1
)| = |det A|
1
. So
g(y
1
, ··· , y
n
) =
1
|det A|
f(A
1
y).
Example. Suppose
X
1
, X
2
have joint pdf
f
(
x
1
, x
2
). Suppose we want to find
the pdf of
Y
=
X
1
+
X
2
. We let
Z
=
X
2
. Then
X
1
=
Y Z
and
X
2
=
Z
. Then
Y
Z
=
1 1
0 1
X
1
X
2
= AX
Then |J| = 1/|det A| = 1. Then
g(y, z) = f(y z, z)
So
g
Y
(y) =
Z
−∞
f(y z, z) dz =
Z
−∞
f(z, y z) dz.
If X
1
and X
2
are independent, f(x
1
, x
2
) = f
1
(x
1
)f
2
(x
2
). Then
g(y) =
Z
−∞
f
1
(z)f
2
(y z) dz.
Non-injective transformations
We previously discussed transformation of random variables by injective maps.
What if the mapping is not? There is no simple formula for that, and we have
to work out each case individually.
Example. Suppose X has pdf f. What is the pdf of Y = |X|?
We use our definition. We have
P(|X| (a, b)) =
Z
b
a
f(x) +
Z
a
b
f(x) dx =
Z
b
a
(f(x) + f(x)) dx.
So
f
Y
(x) = f(x) + f(x),
which makes sense, since getting
|X|
=
x
is equivalent to getting
X
=
x
or
X = x.
Example. Suppose
X
1
E
(
λ
)
, X
2
E
(
µ
) are independent random variables.
Let Y = min(X
1
, X
2
). Then
P(Y t) = P(X
1
t, X
2
t)
= P(X
1
t)P(X
2
t)
= e
λt
e
µt
= e
(λ+µ)t
.
So Y E(λ + µ).
Given random variables, not only can we ask for the minimum of the variables,
but also ask for, say, the second-smallest one. In general, we define the order
statistics as follows:
Definition (Order statistics). Suppose that
X
1
, ··· , X
n
are some random
variables, and
Y
1
, ··· , Y
n
is
X
1
, ··· , X
n
arranged in increasing order, i.e.
Y
1
Y
2
··· Y
n
. This is the order statistics.
We sometimes write Y
i
= X
(i)
.
Assume the X
i
are iid with cdf F and pdf f. Then the cdf of Y
n
is
P(Y
n
y) = P(X
1
y, ··· , X
n
y) = P(X
1
y) ···P(X
n
y) = F(y)
n
.
So the pdf of Y
n
is
d
dy
F (y)
n
= nf(y)F (y)
n1
.
Also,
P(Y
1
y) = P(X
1
y, ··· , X
n
y) = (1 F (y))
n
.
What about the joint distribution of Y
1
, Y
n
?
G(y
1
, y
n
) = P(Y
1
y
1
, Y
n
y
n
)
= P(Y
n
y
n
) P(Y
1
y
1
, Y
n
y
n
)
= F (y
n
)
n
(F (y
n
) F (y
1
))
n
.
Then the pdf is
2
y
1
y
n
G(y
1
, y
n
) = n(n 1)(F (y
n
) F (y
1
))
n2
f(y
1
)f(y
n
).
We can think about this result in terms of the multinomial distribution. By defi-
nition, the probability that
Y
1
[
y
1
, y
1
+
δ
) and
Y
n
(
y
n
δ, y
n
] is approximately
g(y
1
, y
n
).
Suppose that
δ
is sufficiently small that all other
n
2
X
i
’s are very unlikely
to fall into [
y
1
, y
1
+
δ
) and (
y
n
δ, y
n
]. Then to find the probability required,
we can treat the sample space as three bins. We want exactly one
X
i
to fall
into the first and last bins, and
n
2
X
i
’s to fall into the middle one. There are
n
1,n2,1
= n(n 1) ways of doing so.
The probability of each thing falling into the middle bin is
F
(
y
n
)
F
(
y
1
),
and the probabilities of falling into the first and last bins are
f
(
y
1
)
δ
and
f
(
y
n
)
δ
.
Then the probability of Y
1
[y
1
, y
1
+ δ) and Y
n
(y
n
δ, y
n
] is
n(n 1)(F (y
n
) F (y
1
))
n2
f(y
1
)f(y
n
)δ
2
,
and the result follows.
We can also find the joint distribution of the order statistics, say
g
, since it
is just given by
g(y
1
, ···y
n
) = n!f(y
1
) ···f (y
n
),
if
y
1
y
2
··· y
n
, 0 otherwise. We have this formula because there are
n
!
combinations of
x
1
, ··· , x
n
that produces a given order statistics
y
1
, ··· , y
n
, and
the pdf of each combination is f(y
1
) ···f (y
n
).
In the case of iid exponential variables, we find a nice distribution for the
order statistic.
Example. Let
X
1
, ··· , X
n
be iid
E
(
λ
), and
Y
1
, ··· , Y
n
be the order statistic.
Let
Z
1
= Y
1
Z
2
= Y
2
Y
1
.
.
.
Z
n
= Y
n
Y
n1
.
These are the distances between the occurrences. We can write this as a Z =
A
Y,
with
A =
1 0 0 ··· 0
1 1 0 ··· 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 ··· 1
Then
det
(
A
) = 1 and hence
|J|
= 1. Suppose that the pdf of
Z
1
, ··· , Z
n
is, say
h. Then
h(z
1
, ··· , z
n
) = g(y
1
, ··· , y
n
) · 1
= n!f(y
1
) ···f (y
n
)
= n!λ
n
e
λ(y
1
+···+y
n
)
= n!λ
n
e
λ(nz
1
+(n1)z
2
+···+z
n
)
=
n
Y
i=1
(λi)e
(λi)z
n+1i
Since h is expressed as a product of n density functions, we have
Z
i
E((n + 1 i)λ).
with all Z
i
independent.
5.7 Moment generating functions
If
X
is a continuous random variable, then the analogue of the probability
generating function is the moment generating function:
Definition (Moment generating function). The moment generating function of
a random variable X is
m(θ) = E[e
θX
].
For those θ in which m(θ) is finite, we have
m(θ) =
Z
−∞
e
θx
f(x) dx.
We can prove results similar to that we had for probability generating
functions.
We will assume the following without proof:
Theorem. The mgf determines the distribution of
X
provided
m
(
θ
) is finite for
all θ in some interval containing the origin.
Definition (Moment). The rth moment of X is E[X
r
].
Theorem. The
r
th moment
X
is the coefficient of
θ
r
r!
in the power series
expansion of m(θ), and is
E[X
r
] =
d
n
dθ
n
m(θ)
θ=0
= m
(n)
(0).
Proof. We have
e
θX
= 1 + θX +
θ
2
2!
X
2
+ ··· .
So
m(θ) = E[e
θX
] = 1 + θE[X] +
θ
2
2!
E[X
2
] + ···
Example. Let X E(λ). Then its mgf is
E[e
θX
] =
Z
0
e
θx
λe
λx
dx = λ
Z
0
e
(λθ)x
dx =
λ
λ θ
,
where 0 < θ < λ. So
E[X] = m
(0) =
λ
(λ θ)
2
θ=0
=
1
λ
.
Also,
E[X
2
] = m
′′
(0) =
2λ
(λ θ)
3
θ=0
=
2
λ
2
.
So
var(X) = E[X
2
] E[X]
2
=
2
λ
2
1
λ
2
=
1
λ
2
.
Theorem. If
X
and
Y
are independent random variables with moment generat-
ing functions m
X
(θ), m
Y
(θ), then X + Y has mgf m
X+Y
(θ) = m
X
(θ)m
Y
(θ).
Proof.
E[e
θ(X+Y )
] = E[e
θX
e
θY
] = E[e
θX
]E[e
θY
] = m
X
(θ)m
Y
(θ).
6 More distributions
6.1 Cauchy distribution
Definition (Cauchy distribution). The Cauchy distribution has pdf
f(x) =
1
π(1 + x
2
)
for −∞ < x < .
We check that this is a genuine distribution:
Z
−∞
1
π(1 + x
2
)
dx =
Z
π/2
π/2
1
π
dθ = 1
with the substitution x = tan θ. The distribution is a bell-shaped curve.
Proposition. The mean of the Cauchy distribution is undefined, while
E
[
X
2
] =
.
Proof.
E[X] =
Z
0
x
π(1 + x
2
)
dx +
Z
0
−∞
x
π(1 + x
2
)
dx =
which is undefined, but E[X
2
] = + = .
Suppose
X, Y
are independent Cauchy distributions. Let
Z
=
X
+
Y
. Then
f(z) =
Z
−∞
f
X
(x)f
Y
(z x) dx
=
Z
−∞
1
π
2
1
(1 + x
2
)(1 + (z x)
2
)
dx
=
1/2
π(1 + (z/2)
2
)
for all
−∞ < z <
(the integral can be evaluated using a tedious partial
fraction expansion).
So
1
2
Z
has a Cauchy distribution. Alternatively the arithmetic mean of
Cauchy random variables is a Cauchy random variable.
By induction, we can show that
1
n
(
X
1
+
···
+
X
n
) follows Cauchy distribution.
This becomes a “counter-example” to things like the weak law of large numbers
and the central limit theorem. Of course, this is because those theorems require
the random variable to have a mean, which the Cauchy distribution lacks.
Example.
(i)
If Θ
U
[
π
2
,
π
2
], then
X
=
tan θ
has a Cauchy distribution. For example,
if we fire a bullet at a wall 1 meter apart at a random random angle
θ
, the
vertical displacement follows a Cauchy distribution.
1
θ
X = tan θ
(ii) If X, Y N (0, 1), then X/Y has a Cauchy distribution.
6.2 Gamma distribution
Suppose
X
1
, ··· , X
n
are iid
E
(
λ
). Let
S
n
=
X
1
+
···
+
X
n
. Then the mgf of
S
n
is
E
h
e
θ(X
1
+...+X
n
)
i
= E
e
θX
1
···E
e
θX
n
=
E
e
θX

n
=
λ
λ θ
n
.
We call this a gamma distribution.
We claim that this has a distribution given by the following formula:
Definition (Gamma distribution). The gamma distribution Γ(n, λ) has pdf
f(x) =
λ
n
x
n1
e
λx
(n 1)!
.
We can show that this is a distribution by showing that it integrates to 1.
We now show that this is indeed the sum of n iid E(λ):
E[e
θX
] =
Z
0
e
θx
λ
n
x
n1
e
λx
(n 1)!
dx
=
λ
λ θ
n
Z
0
(λ θ)
n
x
n1
e
(λθ)x
(n 1)!
dx
The integral is just integrating over Γ(n, λ θ), which gives one. So we have
E[e
θX
] =
λ
λ θ
n
.
6.3 Beta distribution*
Suppose
X
1
, ··· , X
n
are iid
U
[0
,
1]. Let
Y
1
Y
2
··· Y
n
be the order
statistics. Then the pdf of Y
i
is
f(y) =
n!
(i 1)!(n i)!
y
i1
(1 y)
ni
.
Note that the leading term is the multinomial coefficient
n
i1,1,n1
. The formula
is obtained using the same technique for finding the pdf of order statistics.
This is the beta distribution: Y
i
β(i, n i + 1). In general
Definition (Beta distribution). The beta distribution β(a, b) has pdf
f(x; a, b) =
Γ(a + b)
Γ(a)Γ(b)
x
a1
(1 x)
b1
for 0 x 1.
This has mean a/(a + b).
Its moment generating function is
m(θ) = 1 +
X
k=1
k1
Y
r=0
a + r
a + b + r
!
θ
k
k!
,
which is horrendous!
6.4 More on the normal distribution
Proposition. The moment generating function of N(µ, σ
2
) is
E[e
θX
] = exp
θµ +
1
2
θ
2
σ
2
.
Proof.
E[e
θX
] =
Z
−∞
e
θx
1
2πσ
e
1
2
(xµ)
2
σ
2
dx.
Substitute z =
xµ
σ
. Then
E[e
θX
] =
Z
−∞
1
2π
e
θ(µ+σz)
e
1
2
z
2
dz
= e
θµ+
1
2
θ
2
σ
2
Z
−∞
1
2π
e
1
2
(zθσ)
2
| {z }
pdf of N (σθ,1)
dz
= e
θµ+
1
2
θ
2
σ
2
.
Theorem. Suppose
X, Y
are independent random variables with
X N
(
µ
1
, σ
2
1
),
and Y (µ
2
, σ
2
2
). Then
(i) X + Y N (µ
1
+ µ
2
, σ
2
1
+ σ
2
2
).
(ii) aX N(
1
, a
2
σ
2
1
).
Proof.
(i)
E[e
θ(X+Y )
] = E[e
θX
] · E[e
θY
]
= e
µ
1
θ+
1
2
σ
2
1
θ
2
· e
µ
2
θ+
1
2
σ
2
2
θ
= e
(µ
1
+µ
2
)θ+
1
2
(σ
2
1
+σ
2
2
)θ
2
which is the mgf of N(µ
1
+ µ
2
, σ
2
1
+ σ
2
2
).
(ii)
E[e
θ(aX)
] = E[e
(θa)X
]
= e
µ()+
1
2
σ
2
()
2
= e
()θ+
1
2
(a
2
σ
2
)θ
2
Finally, suppose
X N
(0
,
1). Write
ϕ
(
x
) =
1
2π
e
x
2
/2
for its pdf. It would
be very difficult to find a closed form for its cumulative distribution function,
but we can find an upper bound for it:
P(X x) =
Z
x
ϕ(t) dt
Z
x
1 +
1
t
2
ϕ(t) dt
=
1
x
ϕ(x)
To see the last step works, simply differentiate the result and see that you get
1 +
1
x
2
ϕ(x). So
P(X x)
1
x
1
2π
e
1
2
x
2
.
Then
log P(X x)
1
2
x
2
.
6.5 Multivariate normal
Let X
1
, ··· , X
n
be iid N(0, 1). Then their joint density is
g(x
1
, ··· , x
n
) =
n
Y
i=1
ϕ(x
i
)
=
n
Y
1
1
2π
e
1
2
x
2
i
=
1
(2π)
n/2
e
1
2
P
n
1
x
2
i
=
1
(2π)
n/2
e
1
2
x
T
x
,
where x = (x
1
, ··· , x
n
)
T
.
This result works if
X
1
, ··· , X
n
are iid
N
(0
,
1). Suppose we are interested in
Z = µ + AX,
where
A
is an invertible
n × n
matrix. We can think of this as
n
measurements
Z that are affected by underlying standard-normal factors X. Then
X = A
1
(Z µ)
and
|J| = |det(A
1
)| =
1
det A
So
f(z
1
, ··· , z
n
) =
1
(2π)
n/2
1
det A
exp
1
2
(A
1
(z µ))
T
(A
1
(z µ))
=
1
(2π)
n/2
det A
exp
1
2
(z µ)
T
Σ
1
(z µ)
=
1
(2π)
n/2
det Σ
exp
1
2
(z µ)
T
Σ
1
(z µ)
.
where Σ = AA
T
and Σ
1
= (A
1
)
T
A
1
. We say
Z =
Z
1
.
.
.
Z
n
M V N(µ, Σ) or N (µ, Σ).
This is the multivariate normal.
What is this matrix Σ? Recall that
cov
(
Z
i
, Z
j
) =
E
[(
Z
i
µ
i
)(
Z
j
µ
j
)]. It
turns out this covariance is the i, jth entry of Σ, since
E[(Z µ)(Z µ)
T
] = E[AX(AX)
T
]
= E(AXX
T
A
T
) = AE[XX
T
]A
T
= AIA
T
= AA
T
= Σ
So we also call Σ the covariance matrix.
In the special case where n = 1, this is a normal distribution and Σ = σ
2
.
Now suppose
Z
1
, ··· , Z
n
have covariances 0. Then Σ =
diag
(
σ
2
1
, ··· , σ
2
n
).
Then
f(z
1
, ··· , z
n
) =
n
Y
1
1
2πσ
i
e
1
2σ
2
i
(z
i
µ
i
)
2
.
So Z
1
, ··· , Z
n
are independent, with Z
i
N (µ
i
, σ
2
i
).
Here we proved that if
cov
= 0, then the variables are independent. However,
this is only true when
Z
i
are multivariate normal. It is generally not true for
arbitrary distributions.
For these random variables that involve vectors, we will need to modify our
definition of moment generating functions. We define it to be
m(θ) = E[e
θ
T
X
] = E[e
θ
1
X
1
+···+θ
n
X
n
].
Bivariate normal
This is the special case of the multivariate normal when
n
= 2. Since there
aren’t too many terms, we can actually write them out.
The bivariate normal has
Σ =
σ
2
1
ρσ
1
σ
2
ρσ
1
σ
2
σ
2
2
.
Then
corr(X
1
, X
2
) =
cov(X
1
, X
2
)
p
var(X
1
) var(X
2
)
=
ρσ
1
σ
2
σ
1
σ
2
= ρ.
And
Σ
1
=
1
1 ρ
2
σ
2
1
ρσ
1
1
σ
1
2
ρσ
1
1
σ
1
2
σ
2
2
The joint pdf of the bivariate normal with zero mean is
f(x
1
, x
2
) =
1
2πσ
1
σ
2
p
1 ρ
2
exp
1
2(1 ρ
2
)
x
2
1
σ
2
1
2ρx
1
x
2
σ
1
σ
2
+
x
2
2
σ
2
2

If the mean is non-zero, replace x
i
with x
i
µ
i
.
The joint mgf of the bivariate normal is
m(θ
1
, θ
2
) = e
θ
1
µ
1
+θ
2
µ
2
+
1
2
(θ
2
1
σ
2
1
+2θ
1
θ
2
ρσ
1
σ
2
+θ
2
2
σ
2
2
)
.
Nice and elegant.
7 Central limit theorem
Suppose
X
1
, ··· , X
n
are iid random variables with mean
µ
and variance
σ
2
. Let
S
n
= X
1
+ ··· + X
n
. Then we have previously shown that
var(S
n
/
n) = var
S
n
n
= σ
2
.
Theorem (Central limit theorem). Let
X
1
, X
2
, ···
be iid random variables with
E[X
i
] = µ, var(X
i
) = σ
2
< . Define
S
n
= X
1
+ ··· + X
n
.
Then for all finite intervals (a, b),
lim
n→∞
P
a
S
n
σ
n
b
=
Z
b
a
1
2π
e
1
2
t
2
dt.
Note that the final term is the pdf of a standard normal. We say
S
n
σ
n
D
N(0, 1).
To show this, we will use the continuity theorem without proof:
Theorem (Continuity theorem). If the random variables
X
1
, X
2
, ···
have mgf’s
m
1
(
θ
)
, m
2
(
θ
)
, ···
and
m
n
(
θ
)
m
(
θ
) as
n
for all
θ
, then
X
n
D
the
random variable with mgf m(θ).
We now provide a sketch-proof of the central limit theorem:
Proof. wlog, assume µ = 0, σ
2
= 1 (otherwise replace X
i
with
X
i
µ
σ
).
Then
m
X
i
(θ) = E[e
θX
i
] = 1 + θE[X
i
] +
θ
2
2!
E[X
2
i
] + ···
= 1 +
1
2
θ
2
+
1
3!
θ
3
E[X
3
i
] + ···
Now consider S
n
/
n. Then
E[e
θS
n
/
n
] = E[e
θ(X
1
+...+X
n
)/
n
]
= E[e
θX
1
/
n
] ···E[e
θX
n
/
n
]
=
E[e
θX
1
/
n
]
n
=
1 +
1
2
θ
2
1
n
+
1
3!
θ
3
E[X
3
]
1
n
3/2
+ ···
n
e
1
2
θ
2
as
n
since (1 +
a/n
)
n
e
a
. And this is the mgf of the standard normal.
So the result follows from the continuity theorem.
Note that this is not a very formal proof, since we have to require E[X
3
] to
be finite. Also, sometimes the moment generating function is not defined. But
this will work for many “nice” distributions we will ever meet.
The proper proof uses the characteristic function
χ
X
(θ) = E[e
iθX
].
An important application is to use the normal distribution to approximate a
large binomial.
Let
X
i
B
(1
, p
). Then
S
n
B
(
n, p
). So
E
[
S
n
] =
np
and
var
(
S
n
) =
p
(1
p
).
So
S
n
np
p
np(1 p)
D
N(0, 1).
Example. Suppose two planes fly a route. Each of
n
passengers chooses a plane
at random. The number of people choosing plane 1 is
S B
(
n,
1
2
). Suppose
each has s seats. What is
F (s) = P(S > s),
i.e. the probability that plane 1 is over-booked? We have
F (s) = P(S > s) = P
S n/2
q
n ·
1
2
·
1
2
>
s n/2
n/2
.
Since
S np
n/2
N (0, 1),
we have
F (s) 1 Φ
s n/2
n/2
.
For example, if
n
= 1000 and
s
= 537, then
S
n
n/2
n/2
2
.
34, Φ(2
.
34)
0
.
99,
and
F
(
s
)
0
.
01. So with only 74 seats as buffer between the two planes, the
probability of overbooking is just 1/100.
Example. An unknown proportion
p
of the electorate will vote Labour. It is
desired to find
p
without an error not exceeding 0
.
005. How large should the
sample be?
We estimate by
p
=
S
n
n
,
where X
i
B(1, p). Then
P(|p
p| 0.005) = P(|S
n
np| 0.005n)
= P
|S
n
np|
p
np(1 p)
| {z }
N(0,1)
0.005n
p
np(1 p)
We want |p
p| 0.005 with probability 0.95. Then we want
0.005n
p
np(1 p)
Φ
1
(0.975) = 1.96.
(we use 0.975 instead of 0.95 since we are doing a two-tailed test) Since the
maximum possible value of p(1 p) is 1/4, we have
n 38416.
In practice, we don’t have that many samples. Instead, we go by
P(|p
< p| 0.03) 0.95.
This just requires n 1068.
Example (Estimating
π
with Buffon’s needle). Recall that if we randomly toss
a needle of length
to a floor marked with parallel lines a distance
L
apart, the
probability that the needle hits the line is p =
2
πL
.
X
θ
L
Suppose we toss the pin n times, and it hits the line N times. Then
N N (np, np(1 p))
by the Central limit theorem. Write
p
for the actual proportion observed. Then
ˆπ =
2
(N/n)L
=
π2ℓ/(πL)
p
=
πp
p + (p
p)
= π
1
p
p
p
+ ···
Hence
ˆπ π
p p
p
.
We know
p
N
p,
p(1 p)
n
.
So we can find
ˆπ π N
0,
π
2
p(1 p)
np
2
= N
0,
π
2
(1 p)
np
We want a small variance, and that occurs when
p
is the largest. Since
p
= 2
ℓ/πL
,
this is maximized with = L. In this case,
p =
2
π
,
and
ˆπ π N
0,
(π 2)π
2
2n
.
If we want to estimate π to 3 decimal places, then we need
P(|ˆπ π| 0.001) 0.95.
This is true if and only if
0.001
s
2n
(π 2)(π
2
)
Φ
1
(0.975) = 1.96
So
n
2
.
16
×
10
7
. So we can obtain
π
to 3 decimal places just by throwing a
stick 20 million times! Isn’t that exciting?