IA Probability (Full)

Part IA — Probability

Based on lectures by R. Weber

Notes taken by Dexter Chua

Lent 2015

These notes are not endorsed by the lecturers, and I have modified them (often

significantly) after lectures. They are nowhere near accurate representations of what

was actually lectured, and in particular, all errors are almost surely mine.

Basic concepts

Classical probability, equally likely outcomes. Combinatorial analysis, permutations

and combinations. Stirling’s formula (asymptotics for log n! proved). [3]

Axiomatic approach

Axioms (countable case). Probability spaces. Inclusion-exclusion formula. Continuity

and subadditivity of probability measures. Independence. Binomial, Poisson and geo-

metric distributions. Relation between Poisson and binomial distributions. Conditional

probability, Bayes’s formula. Examples, including Simpson’s paradox. [5]

Discrete random variables

Exp ectation. Functions of a random variable, indicator function, variance, standard

deviation. Covariance, independence of random variables. Generating functions: sums

of indep endent random variables, random sum formula, moments.

Conditional expectation. Random walks: gambler’s ruin, recurrence relations. Dif-

ference equations and their solution. Mean time to absorption. Branching processes:

generating functions and extinction probability. Combinatorial applications of generat-

ing functions. [7]

Continuous random variables

Distributions and density functions. Expectations; exp ectation of a function of a

random variable. Uniform, normal and exponential random variables. Memoryless

prop erty of exponential distribution. Joint distributions: transformation of random

variables (including Jacobians), examples. Simulation: generating continuous random

variables, independent normal random variables. Geometrical probability: Bertrand’s

paradox, Buffon’s needle. Correlation coefficient, bivariate normal random variables. [6]

Inequalities and limits

Markov’s inequality, Chebyshev’s inequality. Weak law of large numbers. Convexity:

Jensen’s inequality for general random variables, AM/GM inequality.

Moment generating functions and statement (no proof) of continuity theorem. State-

ment of central limit theorem and sketch of proof. Examples, including sampling. [3]

Contents

0 Introduction

1 Classical probability

1.1 Classical probability

1.2 Counting

1.3 Stirling’s formula

2 Axioms of probability

2.1 Axioms and definitions

2.2 Inequalities and formulae

2.3 Independence

2.4 Important discrete distributions

2.5 Conditional probability

3 Discrete random variables

3.1 Discrete random variables

3.2 Inequalities

3.3 Weak law of large numbers

3.4 Multiple random variables

3.5 Probability generating functions

4 Interesting problems

4.1 Branching processes

4.2 Random walk and gambler’s ruin

5 Continuous random variables

5.1 Continuous random variables

5.2 Stochastic ordering and inspection paradox

5.3 Jointly distributed random variables

5.4 Geometric probability

5.5 The normal distribution

5.6 Transformation of random variables

5.7 Moment generating functions

6 More distributions

6.1 Cauchy distribution

6.2 Gamma distribution

6.3 Beta distribution*

6.4 More on the normal distribution

6.5 Multivariate normal

7 Central limit theorem

0 Introduction

In every day life, we often encounter the use of the term probability, and they

are used in many different ways. For example, we can hear people say:

(i) The probability that a fair coin will land heads is 1/2.

(ii)

The probability that a selection of 6 members wins the National Lottery

Lotto jackpot is 1 in





= 13983816 or 7.15112 × 10

−8

(iii) The probability that a drawing pin will land ’point up’ is 0.62.

(iv)

The probability that a large earthquake will occur on the San Andreas

Fault in the next 30 years is about 21%

(v) The probability that humanity will be extinct by 2100 is about 50%

The first two cases are things derived from logic. For example, we know that

the coin either lands heads or tails. By definition, a fair coin is equally likely to

land heads or tail. So the probability of either must be 1/2.

The third is something probably derived from experiments. Perhaps we did

1000 experiments and 620 of the pins point up. The fourth and fifth examples

belong to yet another category that talks about are beliefs and predictions.

We call the first kind “classical probability”, the second kind “frequentist

probability” and the last “subjective probability”. In this course, we only

consider classical probability.

1 Classical probability

We start with a rather informal introduction to probability. Afterwards, in

Chapter 2, we will have a formal axiomatic definition of probability and formally

study their properties.

1.1 Classical probability

Definition (Classical probability). Classical probability applies in a situation

when there are a finite number of equally likely outcome.

A classical example is the problem of points.

Example. A and B play a game in which they keep throwing coins. If a head

lands, then

gets a point. Otherwise,

gets a point. The first person to get

10 points wins a prize.

Now suppose

has got 8 points and

has got 7, but the game has to end

because an earthquake struck. How should they divide the prize? We answer

this by finding the probability of

winning. Someone must have won by the

end of 19 rounds, i.e. after 4 more rounds. If

wins at least 2 of them, then

wins. Otherwise, B wins.

The number of ways this can happen is













= 11, while there are

16 possible outcomes in total. So A should get 11/16 of the prize.

In general, consider an experiment that has a random outcome.

Definition (Sample space). The set of all possible outcomes is the sample space,

Ω. We can lists the outcomes as ω

, ω

, ··· ∈ Ω. Each ω ∈ Ω is an outcome.

Definition (Event). A subset of Ω is called an event.

Example. When rolling a dice, the sample space is

{

}

, and each

item is an outcome. “Getting an odd number” and “getting 3” are two possible

events.

In probability, we will be dealing with sets a lot, so it would be helpful to

come up with some notation.

Definition (Set notations). Given any two events A, B ⊆ Ω,

– The complement of A is A

= A

′

A = Ω \ A.

– “A or B” is the set A ∪ B.

– “A and B” is the set A ∩ B.

– A and B are mutually exclusive or disjoint if A ∩B = ∅.

– If A ⊆ B, then A occurring implies B occurring.

Definition (Probability). Suppose Ω =

{ω

, ω

, ··· , ω

}

. Let

A ⊆

Ω be an

event. Then the probability of A is

P(A) =

Number of outcomes in A

Number of outcomes in Ω

|A|

Here we are assuming that each outcome is equally likely to happen, which

is the case in (fair) dice rolls and coin flips.

Example. Suppose

digits are drawn at random from a table of random digits

from 0 to 9. What is the probability that

(i) No digit exceeds k;

(ii) The largest digit drawn is k?

The sample space is Ω = {(a

, a

, ··· , a

) : 0 ≤ a

≤ 9}. Then |Ω| = 10

Let

= [

no digit exceeds k

] =

{

(

, ··· , a

) : 0

≤ a

≤ k}

. Then

(k + 1)

. So

P (A

) =

(k + 1)

Now let

= [

largest digit drawn is k

]. We can find this by finding all outcomes

in which no digits exceed

, and subtract it by the number of outcomes in which

no digit exceeds k − 1. So |B

| = |A

| − |A

k−1

| and

P (B

) =

(k + 1)

− k

1.2 Counting

To find probabilities, we often need to count things. For example, in our example

above, we had to count the number of elements in B

Example. A menu has 6 starters, 7 mains and 6 desserts. How many possible

meals combinations are there? Clearly 6 × 7 ×6 = 252.

Here we are using the fundamental rule of counting:

Theorem (Fundamental rule of counting). Suppose we have to make

multiple

choices in sequence. There are

possibilities for the first choice,

possibilities

for the second etc. Then the total number of choices is m

× m

× ···m

Example. How many ways can 1

, ··· , n

be ordered? The first choice has

possibilities, the second has

n−

1 possibilities etc. So there are

n×

(

n−

×···×

1 =

n!.

Sampling with or without replacement

Suppose we have to pick

items from a total of

items. We can model this as

follows: Let

{

, ··· , n}

be the list. Let

{

, ··· , x}

be the items.

Then each way of picking the items is a function

N → X

with

(

) = item at

the ith position.

Definition (Sampling with replacement). When we sample with replacement,

after choosing at item, it is put back and can be chosen again. Then any sampling

function f satisfies sampling with replacement.

Definition (Sampling without replacement). When we sample without replace-

ment, after choosing an item, we kill it with fire and cannot choose it again.

Then f must be an injective function, and clearly we must have x ≥ n.

We can also have sampling with replacement, but we require each item to be

chosen at least once. In this case, f must be surjective.

Example. Suppose

{a, b, c}

and

{p, q, r, s}

. How many injective

functions are there N → X?

When we choose

(

), we have 4 options. When we choose

(

), we have

3 left. When we choose

(

), we have 2 choices left. So there are 24 possible

choices.

Example. I have

keys in my pocket. We select one at random once and try

to unlock. What is the possibility that I succeed at the rth trial?

Suppose we do it with replacement. We have to fail the first

r −

1 trials and

succeed in the rth. So the probability is

(n − 1)(n −1) ···(n − 1)(1)

(n − 1)

r−1

Now suppose we are smarter and try without replacement. Then the probability

(n − 1)(n −2) ···(n − r + 1)(1)

n(n − 1) ···(n − r + 1)

Example (Birthday problem). How many people are needed in a room for there

to be a probability that two people have the same birthday to be at least a half?

Suppose

(

) is the probability that, in a room of

people, there is a birthday

match.

We solve this by finding the probability of no match, 1

− f

(

). The total

number of possibilities of birthday combinations is 365

. For nobody to have

the same birthday, the first person can have any birthday. The second has 364

else to choose, etc. So

P(no match) =

365 · 364 ·363 ···(366 − r)

365 · 365 ·365 ···365

If we calculate this with a computer, we find that

(22) = 0

475695 and

(23) =

0.507297.

While this might sound odd since 23 is small, this is because we are thinking

about the wrong thing. The probability of match is related more to the number of

pairs of people, not the number of people. With 23 people, we have 23

2 =

253 pairs, which is quite large compared to 365.

Sampling with or without regard to ordering

There are cases where we don’t care about, say list positions. For example, if we

pick two representatives from a class, the order of picking them doesn’t matter.

In terms of the function

N → X

, after mapping to

(1)

, f

(2)

, ··· , f

(

we can

– Leave the list alone

–

Sort the list ascending. i.e. we might get (2

4) and (4

5). If we don’t

care about list positions, these are just equivalent to (2, 4, 5).

–

Re-number each item by the number of the draw on which it was first seen.

For example, we can rename (2

2) and (5

5) both as (1

1). This

happens if the labelling of items doesn’t matter.

– Both of above. So we can rename (2, 5, 2) and (8, 5, 5) both as (1, 1, 2).

Total number of cases

Combining these four possibilities with whether we have replacement, no replace-

ment, or “everything has to be chosen at least once”, we have 12 possible cases

of counting. The most important ones are:

– Replacement + with ordering: the number of ways is x

–

Without replacement + with ordering: the number of ways is

(n)

x(x − 1) ···(x − n + 1).

–

Without replacement + without order: we only care which items get

selected. The number of ways is





= C

= x

(n)

/n!.

–

Replacement + without ordering: we only care how many times the item

got chosen. This is equivalent to partitioning

into

···

Say n = 6 and k = 3. We can write a particular partition as

∗∗ | ∗ | ∗ ∗ ∗

So we have

k −

1 symbols and

k −

1 of them are bars. So the number

of ways is



n+k−1

k−1



Multinomial coefficient

Suppose that we have to pick n items, and each item can either be an apple or

an orange. The number of ways of picking such that

apples are chosen is, by

definition,





In general, suppose we have to fill successive positions in a list of length

, with replacement, from a set of

items. The number of ways of doing so

such that item

is picked

times is defined to be the multinomial coefficient



,···,n



Definition (Multinomial coefficient). A multinomial coefficient is



, n

, ··· , n







n − n



···



n − n

··· − n

k−1



! ···n

It is the number of ways to distribute

items into

positions, in which the

position has n

items.

Example. We know that

(x + y)

= x





n−1

y + ··· + y

If we have a trinomial, then

(x + y + z)



, n



Example. How many ways can we deal 52 cards to 4 player, each with a hand

of 13? The total number of ways is



13, 13, 13, 13



52!

(13!)

= 53644737765488792839237440000 = 5.36 × 10

While computers are still capable of calculating that, what if we tried more

power cards? Suppose each person has n cards. Then the number of ways is

(4n)!

(n!)

which is huge. We can use Stirling’s Formula to approximate it:

1.3 Stirling’s formula

Before we state and prove Stirling’s formula, we prove a weaker (but examinable)

version:

Proposition. log n! ∼ n log n

Proof. Note that

log n! =

k=1

log k.

Now we claim that

log x dx ≤

log k ≤

n+1

log x dx.

This is true by considering the diagram:

ln x

ln(x − 1)

We actually evaluate the integral to obtain

n log n − n + 1 ≤ log n! ≤ (n + 1) log(n + 1) −n;

Divide both sides by n log n and let n → ∞. Both sides tend to 1. So

log n!

n log n

→ 1.

Now we prove Stirling’s Formula:

Theorem (Stirling’s formula). As n → ∞,

log



n!e



= log

√

2π + O





Corollary.

n! ∼

√

2πn

−n

Proof. (non-examinable) Define

= log



n!e

n+1/2



= log n! − (n + 1/2) log n + n

Then

− d

n+1

= (n + 1/2) log



n + 1



− 1.

Write t = 1/(2n + 1). Then

− d

n+1

log



1 + t

1 − t



− 1.

We can simplifying by noting that

log(1 + t) − t = −

−

+ ···

log(1 − t) + t = −

−

− ···

Then if we subtract the equations and divide by 2t, we obtain

− d

n+1

+ ···

1 − t

(2n + 1)

− 1



−

n + 1



By summing these bounds, we know that

− d



1 −



Then we know that

is bounded below by

+ something, and is decreasing

since

− d

n+1

is positive. So it converges to a limit

. We know

is a lower

bound for d

since (d

) is decreasing.

Suppose

m > n

. Then

−d



−



. So taking the limit as

m → ∞

we obtain an upper bound for d

: d

< A + 1/(12n). Hence we know that

A < d

< A +

12n

However, all these results are useless if we don’t know what

is. To find

, we

have a small detour to prove a formula:

Take

π/2

sin

. This is decreasing for increasing

sin

gets

smaller. We also know that

π/2

sin

θ dθ



−cos θ sin

n−1



π/2

(n − 1) cos

θ sin

n−2

θ dθ

= 0 +

π/2

(n − 1)(1 −sin

θ) sin

n−2

θ dθ

= (n − 1)(I

n−2

− I

)

n − 1

n−2

We can directly evaluate the integral to obtain I

= π/2, I

= 1. Then

···

2n − 1

π/2 =

(2n)!

n!)

2n+1

···

2n + 1

n!)

(2n + 1)!

So using the fact that I

is decreasing, we know that

1 ≤

2n+1

≤

2n−1

2n+1

= 1 +

→ 1.

Using the approximation

∼ n

n+1/2

−n+A

, where

is the limit we want to

find, we can approximate

2n+1

= π(2n + 1)



((2n)!)

4n+1

(n!)



∼ π(2n + 1)

→

2π

Since the last expression is equal to 1, we know that

log

√

2π

. Hooray for

magic!

This approximation can be improved:

Proposition (non-examinable). We use the 1

term from the proof above

to get a better approximation:

√

2πn

n+1/2

−n+

12n+1

≤ n! ≤

√

2πn

n+1/2

−n+

12n

Example. Suppose we toss a coin 2

times. What is the probability of equal

number of heads and tails? The probability is





(2n)!

(n!)

∼

√

nπ

Example. Suppose we draw 26 cards from 52. What is the probability of getting

13 reds and 13 blacks? The probability is











= 0.2181.

2 Axioms of probability

2.1 Axioms and definitions

So far, we have semi-formally defined some probabilistic notions. However, what

we had above was rather restrictive. We were only allowed to have a finite

number of possible outcomes, and all outcomes occur with the same probability.

However, most things in the real world do not fit these descriptions. For example,

we cannot use this to model a coin that gives heads with probability π

−1

In general, “probability” can be defined as follows:

Definition (Probability space). A probability space is a triple (Ω

, F, P

). Ω is a

set called the sample space,

is a collection of subsets of Ω, and

F →

is the probability measure.

F has to satisfy the following axioms:

(i) ∅, Ω ∈ F.

(ii) A ∈ F ⇒ A

∈ F.

(iii) A

, A

, ··· ∈ F ⇒

∞

i=1

∈ F.

And P has to satisfy the following Kolmogorov axioms:

(i) 0 ≤ P(A) ≤ 1 for all A ∈ F

(ii) P(Ω) = 1

(iii)

For any countable collection of events

, A

, ···

which are disjoint, i.e.

∩ A

= ∅ for all i, j, we have

[

P(A

Items in Ω are known as the outcomes, items in

are known as the events, and

P(A) is the probability of the event A.

If Ω is finite (or countable), we usually take

to be all the subsets of Ω, i.e.

the power set of Ω. However, if Ω is, say,

, we have to be a bit more careful

and only include nice subsets, or else we cannot have a well-defined P.

Often it is not helpful to specify the full function

. Instead, in discrete cases,

we just specify the probabilities of each outcome, and use the third axiom to

obtain the full P.

Definition (Probability distribution). Let Ω =

{ω

, ω

, ···}

. Choose numbers

, p

, ··· such that

∞

i=1

= 1. Let p(ω

) = p

. Then define

P(A) =

∈A

p(ω

This

(

) satisfies the above axioms, and

, p

, ···

is the probability distribution

Using the axioms, we can quickly prove a few rather obvious results.

Theorem.

(i) P(∅) = 0

(ii) P(A

) = 1 − P(A)

(iii) A ⊆ B ⇒ P(A) ≤ P(B)

(iv) P(A ∪ B) = P(A) + P(B) −P(A ∩ B).

Proof.

(i) Ω and ∅ are disjoint. So P(Ω) + P(∅) = P(Ω ∪∅) = P(Ω). So P(∅) = 0.

(ii) P(A) + P(A

) = P(Ω) = 1 since A and A

are disjoint.

(iii) Write B = A ∪(B ∩ A

). Then

P (B) = P(A) + P(B ∩ A

) ≥ P(A).

(iv) P

(

A ∪ B

) =

(

) +

(

B ∩ A

). We also know that

(

) =

(

A ∩ B

) +

P(B ∩ A

). Then the result follows.

From above, we know that

(

A ∪ B

)

≤ P

(

) +

(

). So we say that

is a

subadditive function. Also,

(

A ∩ B

) +

(

A ∪ B

)

≤ P

(

) +

(

) (in fact both

sides are equal!). We say P is submodular.

The next theorem is better expressed in terms of limits.

Definition (Limit of events). A sequence of events

, A

, ···

is increasing if

⊆ A

···. Then we define the limit as

lim

n→∞

∞

[

Similarly, if they are decreasing, i.e. A

⊇ A

···, then

lim

n→∞

∞

Theorem. If A

, A

, ··· is increasing or decreasing, then

lim

n→∞

P(A

) = P



lim

n→∞



Proof. Take B

= A

, B

= A

\ A

. In general,

= A

n−1

[

Then

[

∞

[

∞

[

Then

P(lim A

) = P

∞

[

= P

∞

[

∞

P(B

) (Axiom III)

= lim

n→∞

i=1

P(B

)

= lim

n→∞

[

= lim

n→∞

P(A

and the decreasing case is proven similarly (or we can simply apply the above to

2.2 Inequalities and formulae

Theorem (Boole’s inequality). For any A

, A

, ···,

∞

[

i=1

≤

∞

i=1

P(A

This is also known as the “union bound”.

Proof.

Our third axiom states a similar formula that only holds for disjoint sets.

So we need a (not so) clever trick to make them disjoint. We define

= A

\ A

= A

i−1

[

k=1

So we know that

[

But the B

are disjoint. So our Axiom (iii) gives

[

= P

[

P (B

) ≤

P (A

) .

Where the last inequality follows from (iii) of the theorem above.

Example. Suppose we have countably infinite number of biased coins. Let

= [

th toss head] and

(

) =

. Suppose

∞

< ∞

. What is the

probability that there are infinitely many heads?

The event “there is at least one more head after the

th coin toss” is

∞

k=i

There are infinitely many heads if and only if there are unboundedly many coin

tosses, i.e. no matter how high

is, there is still at least more more head after

the ith toss.

So the probability required is

∞

i=1

∞

[

k=i

= lim

i→∞

∞

[

k=i

≤ lim

i→∞

∞

k=i

= 0

Therefore P(infinite number of heads) = 0.

Example (Erd¨os 1947). Is it possible to colour a complete

-graph (i.e. a graph

vertices with edges between every pair of vertices) red and black such that

there is no k-vertex complete subgraph with monochrome edges?

Erd¨os said this is possible if





1−

(

)

< 1.

We colour edges randomly, and let

= [

th subgraph has monochrome edges].

Then the probability that at least one subgraph has monochrome edges is



[



≤

P(A

) =





2 · 2

−

(

)

The last expression is obtained since there are





ways to choose a subgraph; a

monochrome subgraph can be either red or black, thus the multiple of 2; and

the probability of getting all red (or black) is 2

−

(

)

If this probability is less than 1, then there must be a way to colour them in

which it is impossible to find a monochrome subgraph, or else the probability is

1. So if





1−

(

)

< 1, the colouring is possible.

Theorem (Inclusion-exclusion formula).

[

P(A

) −

P(A

∩ A

) +

P(A

∩ A

) − ···

+ (−1)

n−1

P(A

∩ ··· ∩ A

Proof. Perform induction on n. n = 2 is proven above.

Then

P(A

∪ A

∪ ···A

) = P(A

) + P(A

∪ ··· ∪ A

) − P

[

i=2

∩ A

)

Then we can apply the induction hypothesis for

n −

1, and expand the mess.

The details are very similar to that in IA Numbers and Sets.

Example. Let 1

, ··· , n

be randomly permuted to

(1)

, π

(2)

, ··· , π

(

). If

i = π(i) for all i, we say we have a derangement.

Let A

= [i = π(i)].

Then

[

i=1

P(A

) −

P(A

∩ A

) + ···

= n ·

−





n − 1





n − 1

n − 2

+ ···

= 1 −

− ··· + (−1)

n−1

→ e

−1

So the probability of derangement is 1 − P(

) ≈ 1 − e

−1

≈ 0.632.

Recall that, from inclusion exclusion,

P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(AB) − P(BC) − P(AC) + P(ABC),

where

(

) is a shorthand for

(

A ∩ B

). If we only take the first three terms,

then we get Boole’s inequality

P(A ∪ B ∪ C) ≤ P(A) + P(B) + P(C).

In general

Theorem (Bonferroni’s inequalities). For any events

, A

, ··· , A

and 1

≤

r ≤ n, if r is odd, then

[

≤

P(A

) −

P(A

) +

P(A

) + ···

<···<i

P(A

···A

If r is even, then

[

≥

P(A

) −

P(A

) +

P(A

) + ···

−

<···<i

P(A

···A

Proof. Easy induction on n.

Example. Let Ω =

{

, ··· , m}

and 1

≤ j, k ≤ m

. Write

{

, ··· , k}

Then

∩ A

= {1, 2, ··· , min(j, k)} = A

min(j,k)

and

∪ A

= {1, 2, ··· , max(j, k)} = A

max(j,k)

We also have P(A

) = k/m.

Now let 1

≤ x

, ··· , x

≤ m

be some numbers. Then Bonferroni’s inequality

says



[



≥

P(A

) −

i<j

P(A

∩ A

max{x

, x

, ··· , x

} ≥

−

min{x

, x

2.3 Independence

Definition (Independent events). Two events A and B are independent if

P(A ∩ B) = P(A)P(B).

Otherwise, they are said to be dependent.

Two events are independent if they are not related to each other. For example,

if you roll two dice separately, the outcomes will be independent.

Proposition. If A and B are independent, then A and B

are independent.

Proof.

P(A ∩ B

) = P(A) − P(A ∩B)

= P(A) − P(A)P(B)

= P(A)(1 − P(B))

= P(A)P(B

)

This definition applies to two events. What does it mean to say that three

or more events are independent?

Example. Roll two fair dice. Let

and

be the event that the first and

second die is odd respectively. Let

= [sum is odd]. The event probabilities

are as follows:

Event Probability

1/2

∩ A

1/4

∩ A

1/4

∩ A

1/4

∩ A

We see that

and

are independent,

and

are independent, and

and

are independent. However, the collection of all three are not independent,

since if A

and A

are true, then A

cannot possibly be true.

From the example above, we see that just because a set of events is pairwise

independent does not mean they are independent all together. We define:

Definition (Independence of multiple events). Events

, A

, ···

are said to

be mutually independent if

P(A

∩ A

∩ ··· ∩ A

) = P(A

)P(A

) ···P(A

)

for any i

, i

, ···i

and r ≥ 2.

Example. Let

be the event that

and

roll the same. We roll 4 dice. Then

P(A

∩ A

) =

= P(A

)P(A

But

P(A

∩ A

) =

= P(A

)P(A

So they are not mutually independent.

We can also apply this concept to experiments. Suppose we model two

independent experiments with Ω

{α

, α

, ···}

and Ω

{β

, β

, ···}

with

probabilities

(

) =

and

(

) =

. Further suppose that these two

experiments are independent, i.e.

P((α

, β

)) = p

for all i, j. Then we can have a new sample space Ω = Ω

× Ω

Now suppose

A ⊆

Ω

and

B ⊆

Ω

are results (i.e. events) of the two

experiments. We can view them as subspaces of Ω by rewriting them as

A ×

Ω

and Ω

× B. Then the probability

P(A ∩ B) =

∈A,β

∈B

∈A

∈B

= P(A)P(B).

So we say the two experiments are “independent” even though the term usually

refers to different events in the same experiment. We can generalize this to

independent experiments, or even countably many infinite experiments.

2.4 Important discrete distributions

We’re now going to quickly go through a few important discrete probability

distributions. By discrete we mean the sample space is countable. The sample

space is Ω = {ω

, ω

, ···} and p

= P({ω

}).

Definition (Bernoulli distribution). Suppose we toss a coin. Ω =

{H, T }

and

p ∈ [0, 1]. The Bernoulli distribution, denoted B(1, p) has

P(H) = p; P(T ) = 1 − p.

Definition (Binomial distribution). Suppose we toss a coin

times, each with

probability p of getting heads. Then

P(HHT T ···T ) = pp(1 − p) ···(1 − p).

P(two heads) =





(1 − p)

n−2

In general,

P(k heads) =





(1 − p)

n−k

We call this the binomial distribution and write it as B(n, p).

Definition (Geometric distribution). Suppose we toss a coin with probability

of getting heads. The probability of having a head after k consecutive tails is

= (1 − p)

This is geometric distribution. We say it is memoryless because how many tails

we’ve got in the past does not give us any information to how long I’ll have to

wait until I get a head.

Definition (Hypergeometric distribution). Suppose we have an urn with

red

balls and

black balls. We choose

balls. The probability that there are

red balls is

P(k red) =





n−k







Definition (Poisson distribution). The Poisson distribution denoted P (λ) is

−λ

for k ∈ N.

What is this weird distribution? It is a distribution used to model rare events.

Suppose that an event happens at a rate of

. We can think of this as there

being a lot of trials, say

of them, and each has a probability

λ/n

of succeeding.

As we take the limit n → ∞, we obtain the Poisson distribution.

Theorem (Poisson approximation to binomial). Suppose

n → ∞

and

p →

such that np = λ. Then





(1 − p)

n−k

→

−λ

Proof.





(1 − p)

n−k

n(n − 1) ···(n − k + 1)

(np)



1 −



n−k

→

−λ

since (1 − a/n)

→ e

−a

2.5 Conditional probability

Definition (Conditional probability). Suppose

is an event with

(

)

For any event A ⊆ Ω, the conditional probability of A given B is

P(A | B) =

P(A ∩ B)

P(B)

We interpret as the probability of A happening given that B has happened.

Note that if A and B are independent, then

P(A | B) =

P(A ∩ B)

P(B)

P(A)P(B)

P(B)

= P(A).

Example. In a game of poker, let A

= [player i gets royal flush]. Then

P(A

) = 1.539 × 10

−6

and

P(A

| A

) = 1.969 × 10

−6

It is significantly bigger, albeit still incredibly tiny. So we say “good hands

attract”.

If P(A | B) > P(A), then we say that B attracts A. Since

P(A ∩ B)

P(B)

> P(A) ⇔

P(A ∩ B)

P(A)

> P(B),

attracts

if and only if

attracts

. We can also say

repels

attracts

Theorem.

(i) P(A ∩ B) = P(A | B)P(B).

(ii) P(A ∩ B ∩ C) = P(A | B ∩ C)P(B | C)P(C).

(iii) P(A | B ∩ C) =

P(A∩B|C)

P(B|C)

(iv)

The function

(

· | B

) restricted to subsets of

is a probability function

(or measure).

Proof.

Proofs of (i), (ii) and (iii) are trivial. So we only prove (iv). To prove

this, we have to check the axioms.

(i) Let A ⊆ B. Then P(A | B) =

P(A∩B)

P(B)

≤ 1.

(ii) P(B | B) =

P(B)

= 1.

(iii) Let A

be disjoint events that are subsets of B. Then

[



∩ B)

P(B)

P (

)

P(B)

P(A

)

P(B)

P(A

∩ B)

P(B)

P(A

| B).

Definition (Partition). A partition of the sample space is a collection of disjoint

events {B

}

∞

i=0

such that

= Ω.

For example, “odd” and “even” partition the sample space into two events.

The following result should be clear:

Proposition. If

is a partition of the sample space, and

is any event, then

P(A) =

∞

i=1

P(A ∩ B

) =

∞

i=1

P(A | B

)P(B

Example. A fair coin is tossed repeatedly. The gambler gets +1 for head, and

−1 for tail. Continue until he is broke or achieves $a. Let

= P(goes broke | starts with $x),

and B

be the event that he gets head on the first toss. Then

= P(B

x+1

+ P(B

x−1

x+1

x−1

We have two boundary conditions

= 1,

= 0. Then solving the recurrence

relation, we have

= 1 −

Theorem (Bayes’ formula). Suppose

is a partition of the sample space, and

A and B

all have non-zero probability. Then for any B

P(B

| A) =

P(A | B

)P(B

)

P(A | B

)P(B

)

Note that the denominator is simply P(A) written in a fancy way.

Example (Screen test). Suppose we have a screening test that tests whether a

patient has a particular disease. We denote positive and negative results as +

and

−

respectively, and

denotes the person having disease. Suppose that the

test is not absolutely accurate, and

P(+ | D) = 0.98

P(+ | D

) = 0.01

P(D) = 0.001.

So what is the probability that a person has the disease given that he received a

positive result?

P(D | +) =

P(+ | D)P(D)

P(+ | D)P(D) + P(+ | D

)P(D

)

0.98 · 0.001

0.98 · 0.001 + 0.01 ·0.999

= 0.09

So this test is pretty useless. Even if you get a positive result, since the disease is

so rare, it is more likely that you don’t have the disease and get a false positive.

Example. Consider the two following cases:

(i) I have 2 children, one of whom is a boy.

(ii) I have two children, one of whom is a son born on a Tuesday.

What is the probability that both of them are boys?

(i) P(BB | BB ∪ BG) =

1/4

1/4+2/4

(ii)

Let

∗

denote a boy born on a Tuesday, and

a boy not born on a

Tuesday. Then

P(B

∗

∪ B

∗

B | BB

∗

∪ B

∗

∪ B

∗

G) =

+ 2 ·

How can we understand this? It is much easier to have a boy born on a Tuesday

if you have two boys than one boy. So if we have the information that a boy

is born on a Tuesday, it is now less likely that there is just one boy. In other

words, it is more likely that there are two boys.

3 Discrete random variables

With what we’ve got so far, we are able to answer questions like “what is the

probability of getting a heads?” or “what is the probability of getting 10 heads

in a row?”. However, we cannot answer questions like “what do we expect to

get on average?”. What does it even mean to take the average of a “heads” and

a “tail”?

To make some sense of this notion, we have to assign, to each outcome, a

number. For example, if let “heads” correspond to 1 and “tails” correspond to 0.

Then on average, we can expect to get 0

5. This is the idea of a random variable.

3.1 Discrete random variables

Definition (Random variable). A random variable

taking values in a set Ω

is a function X : Ω → Ω

. Ω

is usually a set of numbers, e.g. R or N.

Intuitively, a random variable assigns a “number” (or a thing in Ω

) to each

event (e.g. assign 6 to the event “dice roll gives 6”).

Definition (Discrete random variables). A random variable is discrete if Ω

finite or countably infinite.

Notation. Let T ⊆ Ω

, define

P(X ∈ T ) = P({ω ∈ Ω : X(ω) ∈ T }).

i.e. the probability that the outcome is in T.

Here, instead of talking about the probability of getting a particular outcome

or event, we are concerned with the probability of a random variable taking a

particular value. If Ω is itself countable, then we can write this as

P(X ∈ T ) =

ω∈Ω:X(ω)∈T

Example. Let

be the value shown by rolling a fair die. Then Ω

{1, 2, 3, 4, 5, 6}. We know that

P(X = i) =

We call this the discrete uniform distribution.

Definition (Discrete uniform distribution). A discrete uniform distribution

is a discrete distribution with finitely many possible outcomes, in which each

outcome is equally likely.

Example. Suppose we roll two dice, and let the values obtained by

and

Then the sum can be represented by X + Y , with

Ω

X+Y

= {2, 3, ··· , 12}.

This shows that we can add random variables to get a new random variable.

Notation. We write

(x) = P(X = x).

We can also write X ∼ B(n, p) to mean

P(X = r) =





(1 − p)

n−r

and similarly for the other distributions we have come up with before.

Definition (Expectation). The expectation (or mean) of a real-valued

equal to

E[X] =

ω∈Ω

X(ω).

provided this is absolutely convergent. Otherwise, we say the expectation doesn’t

exist. Alternatively,

E[X] =

x∈Ω

ω:X(ω)=x

X(ω)

x∈Ω

ω:X(ω)=x

x∈Ω

xP (X = x).

We are sometimes lazy and just write EX.

This is the “average” value of

we expect to get. Note that this definition

only holds in the case where the sample space Ω is countable. If Ω is continuous

(e.g. the whole of R), then we have to define the expectation as an integral.

Example. Let X be the sum of the outcomes of two dice. Then

E[X] = 2 ·

+ 3 ·

+ ··· + 12 ·

= 7.

Note that

[

] can be non-existent if the sum is not absolutely convergent.

However, it is possible for the expected value to be infinite:

Example (St. Petersburg paradox). Suppose we play a game in which we keep

tossing a coin until you get a tail. If you get a tail on the

th round, then I pay

you $2

. The expected value is

E[X] =

· 2 +

· 4 +

· 8 + ··· = ∞.

This means that on average, you can expect to get an infinite amount of money!

In real life, though, people would hardly be willing to pay $20 to play this game.

There are many ways to resolve this paradox, such as taking into account the

fact that the host of the game has only finitely many money and thus your real

expected gain is much smaller.

Example. We calculate the expected values of different distributions:

(i) Poisson P (λ). Let X ∼ P (λ). Then

(r) =

−λ

E[X] =

∞

r=0

rP (X = r)

∞

r=0

rλ

−λ

∞

r=1

r−1

−λ

(r − 1)!

= λ

∞

r=0

−λ

= λ.

(ii) Let X ∼ B(n, p). Then

E[X] =

rP (x = r)





(1 − p)

n−r

r!(n −r)!

(1 − p)

n−r

= np

r=1

(n − 1)!

(r − 1)![(n − 1) − (r − 1)]!

r−1

(1 − p)

(n−1)−(r−1)

= np

n−1



n − 1



(1 − p)

n−1−r

= np.

Given a random variable

, we can create new random variables such as

+ 3 or

. Formally, let

R → R

and

be a real-valued random variable.

Then f (X) is a new random variable that maps ω 7→ f (X(ω)).

Example. if

a, b, c

are constants, then

and (

X −c

)

are random variables,

defined as

(a + bX)(ω) = a + bX(ω)

(X − c)

(ω) = (X(ω) − c)

Theorem.

(i) If X ≥ 0, then E[X] ≥ 0.

(ii) If X ≥ 0 and E[X] = 0, then P(X = 0) = 1.

(iii) If a and b are constants, then E[a + bX] = a + bE[X].

(iv)

and

are random variables, then

[

] =

[

] +

[

]. This is

true even if X and Y are not independent.

(v) E[X] is a constant that minimizes E[(X − c)

] over c.

Proof.

(i) X ≥ 0 means that X(ω) ≥ 0 for all ω. Then

E[X] =

X(ω) ≥ 0.

(ii)

If there exists

such that

(

)

0 and

0, then

[

]

0. So

X(ω) = 0 for all ω.

(iii)

E[a + bX] =

(a + bX(ω))p

= a + b

= a + b E[X].

(iv)

E[X+Y ] =

[X(ω)+Y (ω)] =

X(ω)+

Y (ω) = E[X]+E[Y ].

(v)

E[(X − c)

] = E[(X − E[X] + E[X] − c)

]

= E[(X − E[X])

+ 2(E[X] − c)(X − E[X]) + (E[X] − c)

]

= E(X − E[X])

+ 0 + (E[X] −c)

This is clearly minimized when

[

]. Note that we obtained the zero

in the middle because E[X − E[X]] = E[X] − E[X] = 0.

An easy generalization of (iv) above is

Theorem. For any random variables

, X

, ···X

, for which the following

expectations exist,

i=1

E[X

Proof.

p(ω)[X

(ω) + ··· + X

(ω)] =

p(ω)X

(ω) + ··· +

p(ω)X

(ω).

Definition (Variance and standard deviation). The variance of a random

variable X is defined as

var(X) = E[(X − E[X])

The standard deviation is the square root of the variance,

var(X).

This is a measure of how “dispersed” the random variable

is. If we have a

low variance, then the value of X is very likely to be close to E[X].

Theorem.

(i) var X ≥ 0. If var X = 0, then P(X = E[X]) = 1.

(ii) var

(

) =

var

(

). This can be proved by expanding the definition

and using the linearity of the expected value.

(iii) var(X) = E[X

] − E[X]

, also proven by expanding the definition.

Example (Binomial distribution). Let

X ∼ B

(

n, p

) be a binomial distribution.

Then E[X] = np. We also have

E[X(X − 1)] =

r=0

r(r − 1)

r!(n −r)!

(1 − p)

n−r

= n(n − 1)p

r=2



n − 2

r − 2



r−2

(1 − p)

(n−2)−(r−2)

= n(n − 1)p

The sum goes to 1 since it is the sum of all probabilities of a binomial

(

n −

, p

)

So E[X

] = n(n − 1)p

+ E[X] = n(n − 1)p

+ np. So

var(X) = E[X

] − (E[X])

= np(1 − p) = npq.

Example (Poisson distribution). If

X ∼ P

(

), then

[

] =

, and

var

(

) =

since P (λ) is B(n, p) with n → ∞, p → 0, np → λ.

Example (Geometric distribution). Suppose

(

) =

for

= 0

, ···

Then

E[X] =

∞

rpq

= pq

∞

r−1

= pq

∞

= pq

∞

= pq

1 − q

(1 − q)

Then

E[X(X − 1)] =

∞

r(r − 1)pq

= pq

∞

r(r − 1)q

r−2

= pq

1 − q

2pq

(1 − q)

So the variance is

var(X) =

2pq

(1 − q)

−

Definition (Indicator function). The indicator function or indicator variable

I[A] (or I

) of an event A ⊆ Ω is

I[A](ω) =

(

1 ω ∈ A

0 ω ∈ A

This indicator random variable is not interesting by itself. However, it is a

rather useful tool to prove results.

It has the following properties:

Proposition.

– E[I[A]] =

p(ω)I[A](ω) = P(A).

– I[A

] = 1 − I[A].

– I[A ∩ B] = I[A]I[B].

– I[A ∪ B] = I[A] + I[B] −I[A]I[B].

– I[A]

= I[A].

These are easy to prove from definition. In particular, the last property

comes from the fact that I[A] is either 0 and 1, and 0

= 0, 1

= 1.

Example. Let 2

people (

husbands and

wives, with

n >

2) sit alternate

man-woman around the table randomly. Let

be the number of couples sitting

next to each other.

Let A

= [ith couple sits together]. Then

N =

i=1

I[A

Then

E[N] = E

I[A

]



I[A

]



= nE



I[A

]



= nP(A

) = n ·

= 2.

We also have

E[N

] = E





I[A

]





= E





I[A

]

+ 2

i<j

I[A

]I[A

]





= nE



I[A

]



+ n(n −1)E



I[A

]I[A

]



We have

[

]

[

]] =

(

∩ A

) =



n−1

n−2

n−1



. Plugging in,

we ultimately obtain var(N) =

2(n−2)

n−1

In fact, as n → ∞, N ∼ P (2).

We can use these to prove the inclusion-exclusion formula:

Theorem (Inclusion-exclusion formula).

[

P(A

) −

P(A

∩ A

) +

P(A

∩ A

) − ···

+ (−1)

n−1

P(A

∩ ··· ∩ A

Proof. Let I

be the indicator function for A

. Write

<···<i

···I

and

= E[S

] =

<···<i

P(A

∩ ··· ∩ A

Then

1 −

j=1

(1 − I

) = S

− S

+ S

··· + (−1)

n−1

[

= E

1 −

(1 − I

)

= s

− s

+ s

− ··· + (−1)

n−1

We can extend the idea of independence to random variables. Two random

variables are independent if the value of the first does not affect the value of the

second.

Definition (Independent random variables). Let

, X

, ··· , X

be discrete

random variables. They are independent iff for any x

, x

, ··· , x

P(X

= x

, ··· , X

= x

) = P(X

= x

) ···P(X

= x

Theorem. If

, ··· , X

are independent random variables, and

, ··· , f

are

functions R → R, then f

), ··· , f

) are independent random variables.

Proof.

Note that given a particular

, there can be many different

for which

(

) =

. When finding

(

) =

), we need to sum over all

such that

) = f

. Then

P(f

) = y

, ···f

) = y

) =

)=y

P(X

= x

, ··· , X

= x

)

)=y

i=1

P(X

= x

)

i=1

)=y

P(X

= x

)

i=1

P(f

) = y

Note that the switch from the second to third line is valid since they both expand

to the same mess.

Theorem. If

, ··· , X

are independent random variables and all the following

expectations exists, then

E[X

Proof. Write R

for the range of X

. Then

∈R

···

∈R

···x

× P(X

= x

, ··· , X

= x

)

i=1

∈R

P(X

= x

)

i=1

E[X

Corollary. Let

, ···X

be independent random variables, and

, f

, ···f

are functions R → R. Then

)

E[f

)].

Theorem. If X

, X

, ···X

are independent random variables, then

var





var(X

Proof.

var





= E









−



i

= E





i=j





−



E[X

]



E[X

] +

i=j

E[X

]E[X

] −

(E[X

])

−

i=j

E[X

]E[X

]

E[X

] − (E[X

])

Corollary. Let

, X

, ···X

be independent identically distributed random

variables (iid rvs). Then

var





var(X

Proof.

var





var





var(X

)

n var(X

)

var(X

)

This result is important in statistics. This means that if we want to reduce

the variance of our experimental results, then we can repeat the experiment

many times (corresponding to a large

), and then the sample average will have

a small variance.

Example. Let

be iid

, p

), i.e.

(1) =

and

(0) = 1

− p

. Then

Y = X

+ X

+ ··· + X

∼ B(n, p).

Since

var

(

) =

[

]

−

(

[

])

p − p

− p

), we have

var

(

) =

np(1 − p).

Example. Suppose we have two rods of unknown lengths

a, b

. We can measure

the lengths, but is not accurate. Let

and

be the measured value. Suppose

E[A] = a, var(A) = σ

E[B] = b, var(B) = σ

We can measure it more accurately by measuring

and

A − B

Then we estimate a and b by

ˆa =

X + Y

b =

X − Y

Then E[ˆa] = a and E[

b] = b, i.e. they are unbiased. Also

var(ˆa) =

var(X + Y ) =

2σ

and similarly for

. So we can measure it more accurately by measuring the

sticks together instead of separately.

3.2 Inequalities

Here we prove a lot of different inequalities which may be useful for certain

calculations. In particular, Chebyshev’s inequality will allow us to prove the

weak law of large numbers.

Definition (Convex function). A function

: (

a, b

)

→ R

is convex if for all

, x

∈ (a, b) and λ

, λ

≥ 0 such that λ

+ λ

= 1,

f(x

) + λ

f(x

) ≥ f (λ

+ λ

It is strictly convex if the inequality above is strict (except when

or λ

= 0).

+ λ

f (x

) + λ

f (x

)

A function is concave if −f is convex.

A useful criterion for convexity is

Proposition. If

is differentiable and

′′

(

)

≥

0 for all

x ∈

(

a, b

), then it is

convex. It is strictly convex if f

′′

(x) > 0.

Theorem (Jensen’s inequality). If f : (a, b) → R is convex, then

i=1

f(x

) ≥ f

i=1

for all p

, p

, ··· , p

such that p

≥ 0 and

= 1, and x

∈ (a, b).

This says that E[f(X)] ≥ f(E[X]) (where P(X = x

) = p

is strictly convex, then equalities hold only if all

are equal, i.e.

takes only one possible value.

Proof. Induct on n. It is true for n = 2 by the definition of convexity. Then

f(p

+ ··· + p

) = f



+ (p

+ ··· + p

)

+ ··· + l

+ ··· + p



≤ p

f(x

) + (p

+ ···p



+ ··· + p



≤ p

f(x

) + (p

+ ··· + p

)



( )

f(x

) + ··· +

( )

f(x

)



= p

f(x

) + ··· + p

where the ( ) is p

+ ··· + p

Strictly convex case is proved with

≤

replaced by

by definition of strict

convexity.

Corollary (AM-GM inequality). Given x

, ··· , x

positive reals, then





1/n

≤

Proof.

Take

(

) =

−log x

. This is convex since its second derivative is

−2

Take P(x = x

) = 1/n. Then

E[f(x)] =

−log x

= −log GM

and

f(E[x]) = −log

= −log AM

Since

(

[

])

≤ E

[

(

)], AM

≥

GM. Since

−log x

is strictly convex, AM = GM

only if all x

are equal.

Theorem (Cauchy-Schwarz inequality). For any two random variables X, Y ,

(E[XY ])

≤ E[X

]E[Y

Proof. If Y = 0, then both sides are 0. Otherwise, E[Y

] > 0. Let

w = X − Y ·

E[XY ]

E[Y

]

Then

E[w

] = E



− 2XY

E[XY ]

E[Y

]

+ Y

(E[XY ])

(E[Y

])



= E[X

] − 2

(E[XY ])

E[Y

]

(E[XY ])

E[Y

]

= E[X

] −

(E[XY ])

E[Y

]

Since E[w

] ≥ 0, the Cauchy-Schwarz inequality follows.

Theorem (Markov inequality). If

is a random variable with

E|X| < ∞

and

ε > 0, then

P(|X| ≥ ε) ≤

E|X|

Proof. We make use of the indicator function. We have

I[|X| ≥ ε] ≤

|X|

This is proved by exhaustion: if

|X| ≥ ε

, then LHS = 1 and RHS

≥

1; If

|X| < ε

then LHS = 0 and RHS is non-negative.

Take the expected value to obtain

P(|X| ≥ ε) ≤

E|X|

Similarly, we have

Theorem (Chebyshev inequality). If

is a random variable with

[

]

< ∞

and ε > 0, then

P(|X| ≥ ε) ≤

E[X

]

Proof. Again, we have

I[{|X| ≥ ε}] ≤

Then take the expected value and the result follows.

Note that these are really powerful results, since they do not make any

assumptions about the distribution of

. On the other hand, if we know

something about the distribution, we can often get a larger bound.

An important corollary is that if µ = E[X], then

P(|X − µ| ≥ ε) ≤

E[(X − µ)

]

var X

3.3 Weak law of large numbers

Theorem (Weak law of large numbers). Let

, X

, ···

be iid random variables,

with mean µ and var σ

Let S

i=1

Then for all ε > 0,





− µ



≥ ε



→ 0

as n → ∞.

We say,

tends to µ (in probability), or

→

µ.

Proof. By Chebyshev,





− µ



≥ ε



≤



− µ



E(S

− nµ)

var(S

)

var(X

)

nε

→ 0

Note that we cannot relax the “independent” condition. For example, if

···

= 1 or 0, each with probability 1

2. Then

/n →

since it is either 1 or 0.

Example. Suppose we toss a coin with probability p of heads. Then

number of heads

number of tosses

Since E[X

] = p, then the weak law of large number tells us that

→

This means that as we toss more and more coins, the proportion of heads will

tend towards p.

Since we called the above the weak law, we also have the strong law, which

is a stronger statement.

Theorem (Strong law of large numbers).



→ µ as n → ∞



= 1.

We say

→

µ,

where “as” means “almost surely”.

It can be shown that the weak law follows from the strong law, but not the

other way round. The proof is left for Part II because it is too hard.

3.4 Multiple random variables

If we have two random variables, we can study the relationship between them.

Definition (Covariance). Given two random variables X, Y , the covariance is

cov(X, Y ) = E[(X − E[X])(Y −E[Y ])].

Proposition.

(i) cov(X, c) = 0 for constant c.

(ii) cov(X + c, Y ) = cov(X, Y ).

(iii) cov(X, Y ) = cov(Y, X).

(iv) cov(X, Y ) = E[XY ] − E[X]E[Y ].

(v) cov(X, X) = var(X).

(vi) var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ).

(vii) If X, Y are independent, cov(X, Y ) = 0.

These are all trivial to prove and proof is omitted.

It is important to note that

cov

(

X, Y

) = 0 does not imply

and

are

independent.

Example.

–

Let (

X, Y

) = (2

(

−

, −

1) or (

−

1) with equal probabilities of 1

These are not independent since Y = 0 ⇒ X = 2.

However, cov(X, Y ) = E[XY ] −E[X]E[Y ] = 0 −0 · 0 = 0.

–

If we randomly pick a point on the unit circle, and let the coordinates be

(

X, Y

), then

[

] =

[

] =

[

] = 0 by symmetry. So

cov

(

X, Y

) = 0

but

and

are clearly not independent (they have to satisfy

= 1).

The covariance is not that useful in measuring how well two variables correlate.

For one, the covariance can (potentially) have dimensions, which means that the

numerical value of the covariance can depend on what units we are using. Also,

the magnitude of the covariance depends largely on the variance of

and

themselves. To solve these problems, we define

Definition (Correlation coefficient). The correlation coefficient of

and

corr(X, Y ) =

cov(X, Y )

var(X) var(Y )

Proposition. |corr(X, Y )| ≤ 1.

Proof. Apply Cauchy-Schwarz to X − E[X] and Y − E[Y ].

Again, zero correlation does not necessarily imply independence.

Alternatively, apart from finding a fixed covariance or correlation number,

we can see how the distribution of

depends on

. Given two random variables

X, Y

(

x, Y

) is known as the joint distribution. From this joint

distribution, we can retrieve the probabilities

(

) and

(

). We can

also consider different conditional expectations.

Definition (Conditional distribution). Let

and

be random variables (in

general not independent) with joint distribution

(

x, Y

). Then the

marginal distribution (or simply distribution) of X is

P(X = x) =

y∈Ω

P(X = x, Y = y).

The conditional distribution of X given Y is

P(X = x | Y = y) =

P(X = x, Y = y)

P(Y = y)

The conditional expectation of X given Y is

E[X | Y = y] =

x∈Ω

xP(X = x | Y = y).

We can view

[

X | Y

] as a random variable in

: given a value of

, we return

the expectation of X.

Example. Consider a dice roll. Let

= 1 denote an even roll and

= 0 denote

an odd roll. Let

be the value of the roll. Then

[

X | Y

] = 3 +

, ie 4 if even,

3 if odd.

Example. Let X

, ··· , X

be iid B(1, p). Let Y = X

+ ··· + X

. Then

P(X

= 1 | Y = r) =

P(X

= 1,

= r − 1)

P(Y = r)



n−1

r−1



r−1

(1 − p)

(n−1)−(r−1)





(1 − p)

n−1

E[X

| Y ] = 1 ·

+ 0



1 −



Note that this is a random variable!

Theorem. If X and Y are independent, then

E[X | Y ] = E[X]

Proof.

E[X | Y = y] =

xP(X = x | Y = y)

xP(X = x)

= E[X]

We know that the expected value of a dice roll given it is even is 4, and the

expected value given it is odd is 3. Since it is equally likely to be even or odd,

the expected value of the dice roll is 3.5. This is formally captured by

Theorem (Tower property of conditional expectation).

[X | Y ]] = E

[X],

where the subscripts indicate what variable the expectation is taken over.

Proof.

[X | Y ]] =

P(Y = y)E[X | Y = y]

P(Y = y)

xP(X = x | Y = y)

xP(X = x, Y = y)

P(X = x, Y = y)

xP(X = x)

= E[X].

This is also called the law of total expectation. We can also state it as:

suppose A

, A

, ··· , A

is a partition of Ω. Then

E[X] =

i:P(A

)>0

E[X | A

]P(A

3.5 Probability generating functions

Consider a random variable X, taking values 0, 1, 2, ···. Let p

= P(X = r).

Definition (Probability generating function (pgf)). The probability generating

function (pgf) of X is

p(z) = E[z

] =

∞

r=0

P(X = r)z

= p

+ p

z + p

··· =

∞

This is a power series (or polynomial), and converges if |z| ≤ 1, since

|p(z)| ≤

| ≤

= 1.

We sometimes write as p

(z) to indicate what the random variable.

This definition might seem a bit out of the blue. However, it turns out to be

a rather useful algebraic tool that can concisely summarize information about

the probability distribution.

Example. Consider a fair di.e. Then p

= 1/6 for r = 1, ··· , 6. So

p(z) = E[z

] =

(z + z

+ ··· + z

) =



1 − z



Theorem. The distribution of

is uniquely determined by its probability

generating function.

Proof.

By definition,

(0),

′

(0) etc. (where

′

is the derivative of

In general,

p(z)



z=0

= i!p

So we can recover (p

, p

, ···) from p(z).

Theorem (Abel’s lemma).

E[X] = lim

z→1

′

(z).

If p

′

(z) is continuous, then simply E[X] = p

′

(1).

Note that this theorem is trivial if

′

(1) exists, as long as we know that we

can differentiate power series term by term. What is important here is that even

′

(1) doesn’t exist, we can still take the limit and obtain the expected value,

e.g. when E[X] = ∞.

Proof. For z < 1, we have

′

(z) =

∞

r−1

≤

∞

= E[X].

So we must have

lim

z→1

′

(z) ≤ E[X].

On the other hand, for any ε, if we pick N large, then

≥ E[X] −ε.

E[X] − ε ≤

= lim

z→1

r−1

≤ lim

z→1

∞

r−1

= lim

z→1

′

(z).

So E[X] ≤ lim

z→1

′

(z). So the result follows

Theorem.

E[X(X − 1)] = lim

z→1

′′

(z).

Proof. Same as above.

Example. Consider the Poisson distribution. Then

= P(X = r) =

−λ

Then

p(z) = E[z

] =

∞

−λ

= e

λz

−λ

= e

λ(z−1)

We can have a sanity check:

(1) = 1, which makes sense, since

(1) is the sum

of probabilities.

We have

E[X] =

λ(z−1)



z=1

= λ,

and

E[X(X − 1)] =

λ(z−1)



z=1

= λ

var(X) = E[X

] − E[X]

= λ

+ λ −λ

= λ.

Theorem. Suppose

, X

, ··· , X

are independent random variables with pgfs

, p

, ··· , p

. Then the pgf of X

+ X

+ ··· + X

is p

(z)p

(z) ···p

(z).

Proof.

E[z

+···+X

] = E[z

···z

] = E[z

] ···E[z

] = p

(z) ···p

(z).

Example. Let X ∼ B(n, p). Then

p(z) =

r=0

P(X = r)z





(1 −p)

n−r

= (pz + (1 −p))

= (pz + q)

(

) is the product of

copies of

. But

is the pgf of

Y ∼ B

, p

This shows that

···

(which we already knew), i.e. a

binomial distribution is the sum of Bernoulli trials.

Example. If

and

are independent Poisson random variables with parame-

ters λ, µ respectively, then

E[t

X+Y

] = E[t

]E[t

] = e

λ(t−1)

µ(t−1)

= e

(λ+µ)(t−1)

So X + Y ∼ P(λ + µ).

We can also do it directly:

P(X + Y = r) =

i=0

P(X = i, Y = r − i) =

i=0

P(X = i)P(X = r − i),

but is much more complicated.

We can use pgf-like functions to obtain some combinatorial results.

Example. Suppose we want to tile a 2

× n

bathroom by 2

1 tiles. One way

to do it is

We can do it recursively: suppose there are

ways to tile a 2

×n

grid. Then if

we start tiling, the first tile is either vertical, in which we have

n−1

ways to tile

the remaining ones; or the first tile is horizontal, in which we have

n−2

ways to

tile the remaining. So

= f

n−1

+ f

n−2

which is simply the Fibonacci sequence, with f

= f

= 1.

Let

F (z) =

∞

n=0

Then from our recurrence relation, we obtain

= f

n−1

+ f

n−2

∞

n=2

∞

n=2

n−1

∞

n=2

n−2

Since f

= f

= 1, we have

F (z) −f

− zf

= z(F (z) − f

) + z

F (z).

Thus F (z) = (1 −z − z

)

−1

. If we write

(1 +

√

5), α

(1 −

√

5).

then we have

F (z) = (1 −z − z

)

−1

(1 − α

z)(1 − α

− α



1 − α

−

1 − α



− α

∞

n=0

− α

∞

n=0

n+1

− α

n+1

− α

Example. A Dyck word is a string of brackets that match, such as (), or ((())()).

There is only one Dyck word of length 2, (). There are 2 of length 4, (()) and

()(). Similarly, there are 5 Dyck words of length 5.

Let

be the number of Dyck words of length 2

. We can split each Dyck

word into (

)

, where

and

are Dyck words. Since the lengths of

and w

must sum up to 2(n − 1),

n+1

i=0

n−i

. (∗)

We again use pgf-like functions: let

c(x) =

∞

n=0

From (∗), we can show that

c(x) = 1 + xc(x)

We can solve to show that

c(x) =

1 −

√

1 − 4x

∞





n + 1

noting that C

= 1. Then

n + 1





Sums with a random number of terms

A useful application of generating functions is the sum with a random number

of random terms. For example, an insurance company may receive a random

number of claims, each demanding a random amount of money. Then we have

a sum of a random number of terms. This can be answered using probability

generating functions.

Example. Let

, X

, ··· , X

be iid with pgf

(

) =

[

]. Let

be a random

variable independent of

with pgf

(

). What is the pgf of

···

E[z

] = E[z

+···+X

]

= E

+...+X

| N ]

| {z }

assuming fixed N

]

∞

n=0

P(N = n)E[z

+···+X

]

∞

n=0

P(N = n)E[z

]E[z

] ···E[z

]

∞

n=0

P(N = n)(E[z

])

∞

n=0

P(N = n)p(z)

= h(p(z))

since h(x) =

∞

n=0

P(N = n)x

E[S] =

h(p(z))



z=1

= h

′

(p(1))p

′

(1)

= E[N]E[X

]

To calculate the variance, use the fact that

E[S(S − 1)] =

h(p(z))



z=1

Then we can find that

var(S) = E[N] var(X

) + E[X

] var(N).

4 Interesting problems

Here we are going to study two rather important and interesting probabilistic

processes — branching processes and random walks. Solutions to these will

typically involve the use of probability generating functions.

4.1 Branching processes

Branching processes are used to model population growth by reproduction. At

the beginning, there is only one individual. At each iteration, the individual

produces a random number of offsprings. In the next iteration, each offspring

will individually independently reproduce randomly according to the same

distribution. We will ask questions such as the expected number of individuals

in a particular generation and the probability of going extinct.

Consider

, X

, ···

, where

is the number of individuals in the

generation. We assume the following:

(i) X

= 1

(ii)

Each individual lives for unit time and produces

offspring with probability

(iii) Suppose all offspring behave independently. Then

n+1

= Y

+ Y

+ ··· + Y

where

are iid random variables, which is the same as

(the superscript

denotes the generation).

It is useful to consider the pgf of a branching process. Let

(

) be the pgf of

. Then

F (z) = E[z

] = E[z

] =

∞

k=0

Define

(z) = E[z

The main theorem of branching processes here is

Theorem.

n+1

(z) = F

(F (z)) = F (F (F (···F (z) ···)))) = F (F

(z)).

Proof.

n+1

(z) = E[z

n+1

]

= E[E[z

n+1

| X

]]

∞

k=0

P(X

= k)E[z

n+1

| X

= k]

∞

k=0

P(X

= k)E[z

+···+Y

| X

= k]

∞

k=0

P(X

= k)E[z

]E[z

] ···E[z

]

∞

k=0

P(X

= k)(E[z

])

k=0

P(X

= k)F (z)

= F

(F (z))

Theorem. Suppose

E[X

] =

= µ

and

var(X

) = E[(X − µ)

] =

(k − µ)

< ∞.

Then

E[X

] = µ

, var X

= σ

n−1

(1 + µ + µ

+ ··· + µ

n−1

Proof.

E[X

] = E[E[X

| X

n−1

]]

= E[µX

n−1

]

= µE[X

n−1

]

Then by induction, E[X

] = µ

(since X

= 1).

To calculate the variance, note that

var(X

) = E[X

] − (E[X

])

and hence

E[X

] = var(X

) + (E[X])

We then calculate

E[X

] = E[E[X

| X

n−1

]]

= E[var(X

) + (E[X

])

| X

n−1

]

= E[X

n−1

var(X

) + (µX

n−1

)

]

= E[X

n−1

+ (µX

n−1

)

]

= σ

n−1

+ µ

E[X

n−1

var X

= E[X

] − (E[X

])

= µ

E[X

n−1

] + σ

n−1

− µ

(E[X

n−1

])

= µ

(E[X

n−1

] − E[X

n−1

]

) + σ

n−1

= µ

var(X

n−1

) + σ

n−1

= µ

var(X

n−2

) + σ

(µ

n−1

+ µ

)

= ···

= µ

2(n−1)

var(X

) + σ

(µ

n−1

+ µ

+ ··· + µ

2n−3

)

= σ

n−1

(1 + µ + ··· + µ

n−1

Of course, we can also obtain this using the probability generating function as

well.

Extinction probability

Let

be the event

= 0, ie extinction has occurred by the

th generation.

Let q be the probability that extinction eventually occurs. Let

A =

∞

[

n=1

= [extinction eventually occurs].

Since A

⊆ A

⊆ ···, we know that

q = P(A) = lim

n→∞

P(A

) = lim

n→∞

P(X

= 0).

But

P(X

= 0) = F

(0),

since F

(0) =

P(X

= k)z

. So

F (q) = F



lim

n→∞

(0)



= lim

n→∞

F (F

(0)) = lim

n→∞

n+1

(0) = q.

F (q) = q.

Alternatively, using the law of total probability

q =

P(X

= k)P(extinction | X

= k) =

= F (q),

where the second equality comes from the fact that for the whole population to

go extinct, each individual population must go extinct.

This means that to find the probability of extinction, we need to find a fixed

point of F . However, if there are many fixed points, which should we pick?

Theorem. The probability of extinction q is the smallest root to the equation

q = F (q). Write µ = E[X

]. Then if µ ≤ 1, then q = 1; if µ > 1, then q < 1.

Proof.

To show that it is the smallest root, let

be the smallest root. Then note

that 0

≤ α ⇒ F

(0)

≤ F

(

) =

since

is increasing (proof: write the function

out!). Hence F (F (0)) ≤ α. Continuing inductively, F

(0) ≤ α for all n. So

q = lim

n→∞

(0) ≤ α.

So q = α.

To show that

= 1 when

µ ≤

1, we show that

= 1 is the only root. We

know that

′

(

)

, F

′′

(

)

≥

0 for

z ∈

1) (proof: write it out again!). So

increasing and convex. Since

′

(1) =

µ ≤

1, it must approach (1

1) from above

the F = z line. So it must look like this:

F (z)

So z = 1 is the only root.

4.2 Random walk and gambler’s ruin

Here we’ll study random walks, using gambler’s ruin as an example.

Definition (Random walk). Let

, ··· , X

be iid random variables such

that

= +1 with probability

, and

−

1 with probability 1

− p

. Let

+ X

+ ··· + X

. Then (S

, S

, ··· , S

) is a 1-dimensional random walk.

If p = q =

, we say it is a symmetric random walk.

Example. A gambler starts with $

, with

z < a

, and plays a game in which he

wins $1 or loses $1 at each turn with probabilities

and

respectively. What

are

= P(random walk hits a before 0 | starts at z),

and

= P(random walk hits 0 before a | starts at z)?

He either wins his first game, with probability

, or loses with probability

. So

= qp

z−1

+ pp

z+1

for 0 < z < a, and p

= 0, p

= 1.

Try p

= t

. Then

− t + q = (pt − q)(t − 1) = 0,

noting that p = 1 − q. If p = q, then

= A1

+ B





Since p

= 0, we get A = −B. Since p

= 1, we obtain

1 − (q/p)

If p = q, then p

= A + Bz = z/a.

Similarly, (or perform the substitutions p 7→ q, q 7→ p and z 7→ a − z)

(q/p)

− (q/p)

(q/p)

− 1

if p = q, and

a − z

if p = q. Since p

+ q

= 1, we know that the game will eventually end.

What if a → ∞? What is the probability of going bankrupt?

P(path hits 0 ever) = P

∞

[

a=z+1

[path hits 0 before a]

= lim

a→∞

P(path hits 0 before a)

= lim

a→∞

(

(q/p)

p > q

1 p ≤ q.

So if the odds are against you (i.e. the probability of losing is greater than the

probability of winning), then no matter how small the difference is, you are

bound to going bankrupt eventually.

Duration of the game

Let

= expected time until the random walk hits 0 or

, starting from

. We

first show that this is bounded. We know that there is one simple way to hit 0

: get +1 or

−

1 for

times in a row. This happens with probability

and takes

steps. So even if this were the only way to hit 0 or

, the expected

duration would be

. So we must have

≤

+ q

This is a very crude bound, but it is sufficient to show that it is bounded, and

we can meaningfully apply formulas to this finite quantity.

We have

= E[duration]

= E[E[duration | X

]]

= pE[duration | X

= 1] + qE[duration | X

= −1]

= p(1 + D

z+1

) + q(1 + D

z−1

)

= 1 + pD

z+1

+ qD

z−1

subject to D

= D

= 0.

We first find a particular solution by trying D

= Cz. Then

Cz = 1 + pC(z + 1) + qC(z − 1).

C =

q − p

for p = q. Then we find the complementary solution. Try D

= t

− t + q = 0,

which has roots 1 and q/p. So the general solution is

= A + B





q − p

Putting in the boundary conditions,

q − p

−

q − p

1 − (q/p)

If p = q, then to find the particular solution, we have to try

= Cz

Then we find C = −1. So

= −z

+ A + Bz.

Then the boundary conditions give

= z(a − z).

Using generating functions

We can use generating functions to solve the problem as well.

Let U

z,n

= P(random walk absorbed at 0 at n | start in z).

We have the following conditions:

0,0

= 1;

z,0

= 0 for 0

< z ≤ a

;

0,n

= U

a,n

= 0 for n > 0.

We define a pgf-like function.

(s) =

∞

n=0

z,n

We know that

z,n+1

= pU

z+1,n

+ qU

z−1,n

Multiply by s

n+1

and sum on n = 0, 1, ···. Then

(s) = psU

z+1

(s) + qsU

z−1

(s).

We try U

(s) = [λ(s)]

. Then

λ(s) = psλ(s)

+ qs.

Then

(s), λ

(s) =

1 ±

1 − 4pqs

2ps

(s) = A(s)λ

(s)

+ B(s)λ

(s)

Since U

(s) = 1 and U

(s) = 0, we know that

A(s) + B(s) = 1

and

A(s)λ

(s)

+ B(s)λ

(s)

= 0.

Then we find that

(s) =

(s)

− λ

(s)

− λ

(s)

Since λ

(s)λ

(s) =

, we can “simplify” this to obtain

(s) =





(s)

a−z

− λ

(s)

a−z

(s)

− λ

(s)

We see that

(1) =

. We can apply the same method to find the generating

function for absorption at

, say

(

). Then the generating function for the

duration is U

+ V

. Hence the expected duration is D

= U

′

(1) + V

′

(1).

5 Continuous random variables

5.1 Continuous random variables

So far, we have only looked at the case where the outcomes Ω are discrete.

Consider an experiment where we throw a needle randomly onto the ground

and record the angle it makes with a fixed horizontal. Then our sample space is

Ω = {ω ∈ R : 0 ≤ ω < 2π}. Then we have

P(ω ∈ [0, θ]) =

2π

, 0 ≤ θ ≤ 2π.

With continuous distributions, we can no longer talk about the probability of

getting a particular number, since this is always zero. For example, we will

almost never get an angle of exactly 0.42 radians.

Instead, we can only meaningfully talk about the probability of

falling into

a particular range. To capture the distribution of

, we want to define a function

such that for each

and small

δx

, the probability of

lying in [

x, x

δx

] is

given by

(

)

δx

(

δx

). This

is known as the probability density function.

Integrating this, we know that the probability of

X ∈

[

a, b

] is

(

) d

. We

take this as the definition of f.

Definition (Continuous random variable). A random variable

: Ω

→ R

continuous if there is a function f : R → R

≥0

such that

P(a ≤ X ≤ b) =

f(x) dx.

We call f the probability density function, which satisfies

– f ≥ 0

–

∞

−∞

f(x) = 1.

Note that P(X = a) = 0 since it is

f(x) dx. Then we also have





[

a∈Q

[X = a]





= 0,

since it is a countable union of probability 0 events (and axiom 3 states that the

probability of the countable union is the sum of probabilities, i.e. 0).

Definition (Cumulative distribution function). The cumulative distribution

function (or simply distribution function) of a random variable

(discrete,

continuous, or neither) is

F (x) = P(X ≤ x).

We can see that F (x) is increasing and F (x) → 1 as x → ∞.

In the case of continuous random variables, we have

F (x) =

−∞

f(z) dz.

Then

is continuous and differentiable. In general,

′

(

) =

(

) whenever

is differentiable.

The name of continuous random variables comes from the fact that

(

) is

(absolutely) continuous.

The cdf of a continuous random variable might look like this:

The cdf of a discrete random variable might look like this:

The cdf of a random variable that is neither discrete nor continuous might look

like this:

Note that we always have

P(a < x ≤ b) = F (b) − F (a).

This will be equal to

f(x) dx in the case of continuous random variables.

Definition (Uniform distribution). The uniform distribution on [a, b] has pdf

f(x) =

b − a

Then

F (x) =

f(z) dz =

x − a

b − a

for a ≤ x ≤ b.

If X follows a uniform distribution on [a, b], we write X ∼ U[a, b].

Definition (Exponential random variable). The exponential random variable

with parameter λ has pdf

f(x) = λe

−λx

and

F (x) = 1 − e

−λx

for x ≥ 0.

We write X ∼ E(λ).

Then

P(a ≤ x ≤ b) =

f(z) dz = e

−λa

− e

−λb

Proposition. The exponential random variable is memoryless, i.e.

P(X ≥ x + z | X ≥ x) = P(X ≥ z).

This means that, say if

measures the lifetime of a light bulb, knowing it has

already lasted for 3 hours does not give any information about how much longer

it will last.

Recall that the geometric random variable is the discrete memoryless random

variable.

Proof.

P(X ≥ x + z | X ≥ x) =

P(X ≥ x + z)

P(X ≥ x)

∞

x+z

f(u) du

∞

f(u) du

−λ(x+z)

−λx

= e

−λz

= P(X ≥ z).

Similarly, given that, you have lived for

days, what is the probability of

dying within the next δx days?

P(X ≤ x + δx | X ≥ x) =

P(x ≤ X ≤ x + δx)

P(X ≥ x)

λe

−λx

δx

−λx

= λδx.

So it is independent of how old you currently are, assuming your survival follows

an exponential distribution.

In general, we can define the hazard rate to be

h(x) =

f(x)

1 − F (x)

Then

P(x ≤ X ≤ x + δx | X ≥ x) =

P(x ≤ X ≤ x + δx)

P(X ≥ x)

δxf(x)

1 − F (x)

= δx · h(x).

In the case of exponential distribution, h(x) is constant and equal to λ.

Similar to discrete variables, we can define the expected value and the

variance. However, we will (obviously) have to replace the sum with an integral.

Definition (Expectation). The expectation (or mean) of a continuous random

variable is

E[X] =

∞

−∞

xf(x) dx,

provided not both

∞

xf(x) dx and

−∞

xf(x) dx are infinite.

Theorem. If X is a continuous random variable, then

E[X] =

∞

P(X ≥ x) dx −

∞

P(X ≤ −x) dx.

This result is more intuitive in the discrete case:

∞

x=0

xP(X = x) =

∞

x=0

∞

y=x+1

P(X = y) =

∞

x=0

P(X > x),

where the first equality holds because on both sides, we have

copies of

(

)

in the sum.

Proof.

∞

P(X ≥ x) dx =

∞

f(y) dy dx

∞

I[y ≥ x]f(y) dy dx

∞



∞

I[x ≤ y] dx



f(y) dy

∞

yf(y) dy.

We can similarly show that

∞

P(X ≤ −x) dx = −

−∞

yf(y) dy.

Example. Suppose X ∼ E(λ). Then

P(X ≥ x) =

∞

λe

−λt

dt = e

−λx

E[X] =

∞

−λx

dx =

Definition (Variance). The variance of a continuous random variable is

var(X) = E[(X − E[X])

] = E[X

] − (E[X])

So we have

var(X) =

∞

−∞

f(x) dx −



∞

−∞

xf(x) dx



Example. Let X ∼ U[a, b]. Then

E[X] =

b − a

dx =

a + b

var(X) =

b − a

dx − (E[X])

(b − a)

Apart from the mean, or expected value, we can also have other notions of

“average values”.

Definition (Mode and median). Given a pdf f(x), we call ˆx a mode if

f(ˆx) ≥ f(x)

for all

. Note that a distribution can have many modes. For example, in the

uniform distribution, all x are modes.

We say it is a median if

ˆx

−∞

f(x) dx =

∞

ˆx

f(x) dx.

For a discrete random variable, the median is ˆx such that

P(X ≤ ˆx) ≥

, P(X ≥ ˆx) ≥

Here we have a non-strict inequality since if the random variable, say, always

takes value 0, then both probabilities will be 1.

Suppose that we have an experiment whose outcome follows the distribution

. We can perform the experiment many times and obtain many results

, ··· , X

. The average of these results is known as the sample mean.

Definition (Sample mean). If

, ··· , X

is a random sample from some

distribution, then the sample mean is

X =

5.2 Stochastic ordering and inspection paradox

We want to define a (partial) order on two different random variables. For

example, we might want to say that

+ 2

> X

, where

is a random variable.

A simple definition might be expectation ordering, where

X ≥ Y

[

]

≥

[

]. This, however, is not satisfactory since

X ≥ Y

and

Y ≥ X

does not imply

X = Y . Instead, we can use the stochastic order.

Definition (Stochastic order). The stochastic order is defined as:

X ≥

P(X > t) ≥ P(Y > t) for all t.

This is clearly transitive. Stochastic ordering implies expectation ordering,

since

X ≥

Y ⇒

∞

P(X > x) dx ≥

∞

P(y > x) dx ⇒ E[X] ≥ E[Y ].

Alternatively, we can also order things by hazard rate.

Example (Inspection paradox). Suppose that

families have children attending

a school. Family

has

children at the school, where

, ··· , X

are iid

random variables, with

(

) =

. Suppose that the average family size is

µ.

Now pick a child at random. What is the probability distribution of his

family size? Let

be the index of the family from which she comes (which is a

random variable). Then

P(X

= k | J = j) =

P(J = j, X

= k)

P(J = j)

The denominator is 1

. The numerator is more complex. This would require

the

th family to have

members, which happens with probability

; and

that we picked a member from the

th family, which happens with probability

i=j

. So

P(X

= k | J = j) = E

nkp

k +

i=j

Note that this is independent of j. So

P(X

= k) = E

nkp

k +

i=j

Also, P(X

= k) = p

. So

P(X

= k)

P(X

= k)

= E

k +

i=j

This is increasing in

, and greater than 1 for

k > µ

. So the average value of the

family size of the child we picked is greater than the average family size. It can

be shown that X

is stochastically greater than X

This means that if we pick the children randomly, the sample mean of the

family size will be greater than the actual mean. This is since for the larger a

family is, the more likely it is for us to pick a child from the family.

5.3 Jointly distributed random variables

Definition (Joint distribution). Two random variables

X, Y

have joint distri-

bution F : R

7→ [0, 1] defined by

F (x, y) = P(X ≤ x, Y ≤ y).

The marginal distribution of X is

(x) = P(X ≤ x) = P(X ≤ x, Y < ∞) = F (x, ∞) = lim

y→∞

F (x, y)

Definition (Jointly distributed random variables). We say

, ··· , X

are

jointly distributed continuous random variables and have joint pdf

if for any

set A ⊆ R

P((X

, ··· , X

) ∈ A) =

,···x

)∈A

f(x

, ··· , x

) dx

···dx

where

f(x

, ··· , x

) ≥ 0

and

f(x

, ··· , x

) dx

···dx

= 1.

Example. In the case where n = 2,

F (x, y) = P(X ≤ x, Y ≤ y) =

−∞

f(x, y) dx dy.

If F is differentiable, then

f(x, y) =

∂

∂x∂y

F (x, y).

Theorem. If

and

are jointly continuous random variables, then they are

individually continuous random variables.

Proof. We prove this by showing that X has a density function.

We know that

P(X ∈ A) = P(X ∈ A, Y ∈ (−∞, +∞))

x∈A

∞

−∞

f(x, y) dy dx

x∈A

(x) dx

(x) =

∞

−∞

f(x, y) dy

is the (marginal) pdf of X.

Definition (Independent continuous random variables). Continuous random

variables X

, ··· , X

are independent if

P(X

∈ A

, X

∈ A

, ··· , X

∈ A

) = P(X

∈ A

)P(X

∈ A

) ···P(X

∈ A

)

for all A

⊆ Ω

If we let F

and f

be the cdf, pdf of X, then

F (x

, ··· , x

) = F

) ···F

)

and

f(x

, ··· , x

) = f

) ···f

)

are each individually equivalent to the definition above.

To show that two (or more) random variables are independent, we only have

to factorize the joint pdf into factors that each only involve one variable.

Example. If (

, X

) takes a random value from [0

1], then

(

, x

) = 1.

Then we can see that

(

, x

) = 1

1 =

(

)

· f

(

). So

and

are

independent.

On the other hand, if (

, Y

) takes a random value from [0

1] with

the restriction that

≤ Y

, then they are not independent, since

(

, x

) =

2I[Y

≤ Y

], which cannot be split into two parts.

Proposition. For independent continuous random variables X

(i) E[

] =

E[X

]

(ii) var(

) =

var(X

)

5.4 Geometric probability

Often, when doing probability problems that involve geometry, we can visualize

the outcomes with the aid of a picture.

Example. Two points

and

are chosen independently on a line segment of

length

. What is the probability that

|X − Y | ≤ ℓ

? By “at random”, we mean

f(x, y) =

since each of X and Y have pdf 1/L.

We can visualize this on a graph:

ℓ

Here the two axes are the values of

and

, and

is the permitted region.

The total area of the white part is simply the area of a square with length

L − ℓ

So the area of A is L

− (L −ℓ)

= 2Lℓ − ℓ

. So the desired probability is

f(x, y) dx dy =

2Lℓ − ℓ

Example (Bertrand’s paradox). Suppose we draw a random chord in a circle.

What is the probability that the length of the chord is greater than the length

of the side of an inscribed equilateral triangle?

There are three ways of “drawing a random chord”.

(i)

We randomly pick two end points over the circumference independently.

Now draw the inscribed triangle with the vertex at one end point. For the

length of the chord to be longer than a side of the triangle, the other end

point must between the two other vertices of the triangle. This happens

with probability 1/3.

(ii)

wlog the chord is horizontal, on the lower side of the circle. The mid-point is

uniformly distributed along the middle (dashed) line. Then the probability

of getting a long line is 1/2.

(iii)

The mid point of the chord is distributed uniformly across the circle. Then

you get a long line if and only if the mid-point lies in the smaller circle

shown below. This occurs with probability 1/4.

We get different answers for different notions of “random”! This is why when

we say “randomly”, we should be explicit in what we mean by that.

Example (Buffon’s needle). A needle of length

ℓ

is tossed at random onto a

floor marked with parallel lines a distance

apart, where

ℓ ≤ L

. Let

be the

event that the needle intersects a line. What is P(A)?

ℓ

Suppose that X ∼ U [0, L] and θ ∼ U [0, π]. Then

f(x, θ) =

Lπ

This touches a line iff X ≤ ℓ sin θ. Hence

P(A) =

θ=0

ℓ sin θ

| {z }

P(X≤ℓ sin θ)

dθ =

2ℓ

πL

Since the answer involves

, we can estimate

by conducting repeated exper-

iments! Suppose we have

hits out of

tosses. Then an estimator for

ˆp =

. Hence

ˆπ =

2ℓ

ˆpL

We will later find out that this is a really inefficient way of estimating

(as you

could have guessed).

5.5 The normal distribution

Definition (Normal distribution). The normal distribution with parameters

µ, σ

, written N(µ, σ

) has pdf

f(x) =

√

2πσ

exp



−

(x − µ)

2σ



for −∞ < x < ∞.

It looks approximately like this:

The standard normal is when µ = 0, σ

= 1, i.e. X ∼ N(0, 1).

We usually write

(

) for the pdf and Φ(

) for the cdf of the standard normal.

This is a rather important probability distribution. This is partly due to the

central limit theorem, which says that if we have a large number of iid random

variables, then the distribution of their averages are approximately normal. Many

distributions in physics and other sciences are also approximately or exactly

normal.

We first have to show that this makes sense, i.e.

Proposition.

∞

−∞

√

2πσ

−

2σ

(x−µ)

dx = 1.

Proof. Substitute z =

(x−µ)

. Then

I =

∞

−∞

√

2π

−

dz.

Then

∞

−∞

√

2π

−x

∞

√

2π

−y

∞

2π

−r

r dr dθ

= 1.

We also have

Proposition. E[X] = µ.

Proof.

E[X] =

√

2πσ

∞

−∞

−(x−µ)

/2σ

√

2πσ

∞

−∞

(x − µ)e

−(x−µ)

/2σ

dx +

√

2πσ

∞

−∞

µe

−(x−µ)

/2σ

dx.

The first term is antisymmetric about

and gives 0. The second is just

times

the integral we did above. So we get µ.

Also, by symmetry, the mode and median of a normal distribution are also

both µ.

Proposition. var(X) = σ

Proof.

We have

var

(

) =

[

]

−

(

[

])

. Substitute

X−µ

. Then

[

] = 0,

E[Z

] =

E[X

Then

var(Z) =

√

2π

∞

−∞

−z



−

√

2π

−z



∞

−∞

√

2π

∞

−∞

−z

= 0 + 1

= 1

So var X = σ

Example. UK adult male heights are normally distributed with mean 70” and

standard deviation 3”. In the Netherlands, these figures are 71” and 3”.

What is

(

Y > X

), where

and

are the heights of randomly chosen UK

and Netherlands males, respectively?

We have

X ∼ N

(70

) and

Y ∼ N

(71

). Then (as we will show in later

lectures) Y − X ∼ N (1, 18).

P(Y > X) = P(Y −X > 0) = P



Y −X − 1

√

−1

√



= 1 − Φ(−1/

√

18),

since

(Y −X)−1

√

∼ N (0, 1), and the answer is approximately 0.5931.

Now suppose that in both countries, the Olympic male basketball teams are

selected from that portion of male whose hight is at least above 4” above the

mean (which corresponds to the 9

1% tallest males of the country). What is the

probability that a randomly chosen Netherlands player is taller than a randomly

chosen UK player?

For the second part, we have

P(Y > X | X ≥ 74, Y ≥ 75) =

x=74

(x) dx +

∞

x=75

∞

y=x

(y)ϕ

(x) dy dx

∞

x=74

(x) dx

∞

y=75

(y) dy

which is approximately 0.7558. So even though the Netherlands people are only

slightly taller, if we consider the tallest bunch, the Netherlands people will be

much taller on average.

5.6 Transformation of random variables

We will now look at what happens when we apply a function to random variables.

We first look at the simple case where there is just one variable, and then move

on to the general case where we have multiple variables and can mix them

together.

Single random variable

Theorem. If

is a continuous random variable with a pdf

(

), and

(

)

is a continuous, strictly increasing function with

−1

(

) differentiable, then

Y = h(X) is a random variable with pdf

(y) = f

−1

(y))

−1

(y).

Proof.

(y) = P(Y ≤ y)

= P(h(X) ≤ y)

= P(X ≤ h

−1

(y))

= F (h

−1

(y))

Take the derivative with respect to y to obtain

(y) = F

′

(y) = f(h

−1

(y))

−1

(y).

It is often easier to redo the proof than to remember the result.

Example. Let X ∼ U[0, 1]. Let Y = −log X. Then

P(Y ≤ y) = P(−log X ≤ y)

= P(X ≥ e

−y

)

= 1 − e

−y

But this is the cumulative distribution function of

(1). So

is exponentially

distributed with λ = 1.

In general, we get the following result:

Theorem. Let

U ∼ U

1]. For any strictly increasing distribution function

the random variable X = F

−1

U has distribution function F .

Proof.

P(X ≤ x) = P(F

−1

(U) ≤ x) = P(U ≤ F (x)) = F (x).

This condition “strictly increasing” is needed for the inverse to exist. If it is

not, we can define

−1

(u) = inf{x : F (x) ≥ u, 0 < u < 1},

and the same result holds.

This can also be done for discrete random variables

(

) =

by letting

X = x

j−1

i=0

≤ U <

i=0

for U ∼ U[0, 1].

Multiple random variables

Suppose X

, X

, ··· , X

are random variables with joint pdf f. Let

= r

, ··· , X

)

= r

, ··· , X

)

= r

, ··· , X

For example, we might have Y

and Y

= X

+ X

Let

R ⊆ R

such that

((

, ··· , X

)

∈ R

) = 1, i.e.

is the set of all values

) can take.

Suppose

is the image of

under the above transformation, and the map

R → S is bijective. Then there exists an inverse function

= s

, ··· , Y

)

= s

, ··· , Y

)

= s

, ··· , Y

For example, if

, X

refers to the coordinates of a random point in Cartesian

coordinates, Y

, Y

might be the coordinates in polar coordinates.

Definition (Jacobian determinant). Suppose

∂s

∂y

exists and is continuous at

every point (y

, ··· , y

) ∈ S. Then the Jacobian determinant is

J =

∂(s

, ··· , s

)

∂(y

, ··· , y

)

= det







∂s

∂y

···

∂s

∂y

∂s

∂y

···

∂s

∂y







Take

A ⊆ R

and

(

). Then using results from IA Vector Calculus, we

get

P((X

, ··· , X

) ∈ A) =

f(x

, ··· , x

) dx

···dx

f(s

, ···y

), s

, ··· , s

)|J| dy

··· dy

= P((Y

, ···Y

) ∈ B).

Proposition. (Y

, ··· , Y

) has density

g(y

, ··· , y

) = f(s

, ··· , y

), ···s

, ··· , y

))|J|

if (y

, ··· , y

) ∈ S, 0 otherwise.

Example. Suppose (X, Y ) has density

f(x, y) =

(

4xy 0 ≤ x ≤ 1, 0 ≤ y ≤ 1

0 otherwise

We see that X and Y are independent, with each having a density f(x) = 2x.

Define U = X/Y , V = XY . Then we have X =

√

UV and Y =

V/U.

The Jacobian is

det



∂x/∂u ∂x/∂v

∂y/∂u ∂y/∂v



= det



v/u

u/v

−

v/u

1/uv



Alternatively, we can find this by considering

det



∂u/∂x ∂u/∂y

∂v/∂x ∂u/∂y



= 2u,

and then inverting the matrix. So

g(u, v) = 4

√

if (u, v) is in the image S, 0 otherwise. So

g(u, v) =

I[(u, v) ∈ S].

Since this is not separable, we know that U and V are not independent.

In the linear case, life is easy. Suppose

Y =













= A













= AX

Then X = A

−1

Y. Then

∂x

∂y

= (A

−1

)

. So |J| = |det(A

−1

)| = |det A|

−1

. So

g(y

, ··· , y

) =

|det A|

f(A

−1

y).

Example. Suppose

, X

have joint pdf

(

, x

). Suppose we want to find

the pdf of

. We let

. Then

Y −Z

and

. Then







1 1

0 1





= AX

Then |J| = 1/|det A| = 1. Then

g(y, z) = f(y − z, z)

(y) =

∞

−∞

f(y − z, z) dz =

∞

−∞

f(z, y − z) dz.

If X

and X

are independent, f(x

, x

) = f

). Then

g(y) =

∞

−∞

(z)f

(y − z) dz.

Non-injective transformations

We previously discussed transformation of random variables by injective maps.

What if the mapping is not? There is no simple formula for that, and we have

to work out each case individually.

Example. Suppose X has pdf f. What is the pdf of Y = |X|?

We use our definition. We have

P(|X| ∈ (a, b)) =

f(x) +

−a

−b

f(x) dx =

(f(x) + f(−x)) dx.

(x) = f(x) + f(−x),

which makes sense, since getting

|X|

is equivalent to getting

X = −x.

Example. Suppose

∼ E

(

)

, X

∼ E

(

) are independent random variables.

Let Y = min(X

, X

). Then

P(Y ≥ t) = P(X

≥ t, X

≥ t)

= P(X

≥ t)P(X

≥ t)

= e

−λt

−µt

= e

−(λ+µ)t

So Y ∼ E(λ + µ).

Given random variables, not only can we ask for the minimum of the variables,

but also ask for, say, the second-smallest one. In general, we define the order

statistics as follows:

Definition (Order statistics). Suppose that

, ··· , X

are some random

variables, and

, ··· , Y

, ··· , X

arranged in increasing order, i.e.

≤

≤ ··· ≤ Y

. This is the order statistics.

We sometimes write Y

= X

(i)

Assume the X

are iid with cdf F and pdf f. Then the cdf of Y

P(Y

≤ y) = P(X

≤ y, ··· , X

≤ y) = P(X

≤ y) ···P(X

≤ y) = F(y)

So the pdf of Y

F (y)

= nf(y)F (y)

n−1

Also,

P(Y

≥ y) = P(X

≥ y, ··· , X

≥ y) = (1 −F (y))

What about the joint distribution of Y

, Y

G(y

, y

) = P(Y

≤ y

, Y

≤ y

)

= P(Y

≤ y

) − P(Y

≥ y

, Y

≤ y

)

= F (y

)

− (F (y

) − F (y

))

Then the pdf is

∂

∂y

G(y

, y

) = n(n − 1)(F (y

) − F (y

))

n−2

f(y

)f(y

We can think about this result in terms of the multinomial distribution. By defi-

nition, the probability that

∈

[

, y

) and

∈

(

−δ, y

] is approximately

g(y

, y

Suppose that

is sufficiently small that all other

n −

’s are very unlikely

to fall into [

, y

) and (

− δ, y

]. Then to find the probability required,

we can treat the sample space as three bins. We want exactly one

to fall

into the first and last bins, and

n −

’s to fall into the middle one. There are



1,n−2,1



= n(n − 1) ways of doing so.

The probability of each thing falling into the middle bin is

(

)

− F

(

and the probabilities of falling into the first and last bins are

(

)

and

(

)

Then the probability of Y

∈ [y

, y

+ δ) and Y

∈ (y

− δ, y

] is

n(n − 1)(F (y

) − F (y

))

n−2

f(y

)f(y

)δ

and the result follows.

We can also find the joint distribution of the order statistics, say

, since it

is just given by

g(y

, ···y

) = n!f(y

) ···f (y

≤ y

≤ ··· ≤ y

, 0 otherwise. We have this formula because there are

combinations of

, ··· , x

that produces a given order statistics

, ··· , y

, and

the pdf of each combination is f(y

) ···f (y

In the case of iid exponential variables, we find a nice distribution for the

order statistic.

Example. Let

, ··· , X

be iid

(

), and

, ··· , Y

be the order statistic.

Let

= Y

− Y

= Y

− Y

n−1

These are the distances between the occurrences. We can write this as a Z =

with

A =







1 0 0 ··· 0

−1 1 0 ··· 0

0 0 0 ··· 1







Then

det

(

) = 1 and hence

|J|

= 1. Suppose that the pdf of

, ··· , Z

is, say

h. Then

h(z

, ··· , z

) = g(y

, ··· , y

) · 1

= n!f(y

) ···f (y

)

= n!λ

−λ(y

+···+y

)

= n!λ

−λ(nz

+(n−1)z

+···+z

)

i=1

(λi)e

−(λi)z

n+1−i

Since h is expressed as a product of n density functions, we have

∼ E((n + 1 −i)λ).

with all Z

independent.

5.7 Moment generating functions

is a continuous random variable, then the analogue of the probability

generating function is the moment generating function:

Definition (Moment generating function). The moment generating function of

a random variable X is

m(θ) = E[e

θX

For those θ in which m(θ) is finite, we have

m(θ) =

∞

−∞

θx

f(x) dx.

We can prove results similar to that we had for probability generating

functions.

We will assume the following without proof:

Theorem. The mgf determines the distribution of

provided

(

) is finite for

all θ in some interval containing the origin.

Definition (Moment). The rth moment of X is E[X

Theorem. The

th moment

is the coefficient of

in the power series

expansion of m(θ), and is

E[X

] =

dθ

m(θ)



θ=0

= m

(n)

(0).

Proof. We have

θX

= 1 + θX +

+ ··· .

m(θ) = E[e

θX

] = 1 + θE[X] +

E[X

] + ···

Example. Let X ∼ E(λ). Then its mgf is

E[e

θX

] =

∞

θx

λe

−λx

dx = λ

∞

−(λ−θ)x

dx =

λ − θ

where 0 < θ < λ. So

E[X] = m

′

(0) =

(λ − θ)



θ=0

Also,

E[X

] = m

′′

(0) =

2λ

(λ − θ)



θ=0

var(X) = E[X

] − E[X]

−

Theorem. If

and

are independent random variables with moment generat-

ing functions m

(θ), m

(θ), then X + Y has mgf m

X+Y

(θ) = m

(θ)m

(θ).

Proof.

E[e

θ(X+Y )

] = E[e

θX

θY

] = E[e

θX

]E[e

θY

] = m

(θ)m

(θ).

6 More distributions

6.1 Cauchy distribution

Definition (Cauchy distribution). The Cauchy distribution has pdf

f(x) =

π(1 + x

)

for −∞ < x < ∞.

We check that this is a genuine distribution:

∞

−∞

π(1 + x

)

dx =

π/2

−π/2

dθ = 1

with the substitution x = tan θ. The distribution is a bell-shaped curve.

Proposition. The mean of the Cauchy distribution is undefined, while

[

] =

∞.

Proof.

E[X] =

∞

π(1 + x

)

dx +

−∞

π(1 + x

)

dx = ∞ − ∞

which is undefined, but E[X

] = ∞ + ∞ = ∞.

Suppose

X, Y

are independent Cauchy distributions. Let

. Then

f(z) =

∞

−∞

(x)f

(z − x) dx

∞

−∞

(1 + x

)(1 + (z − x)

)

1/2

π(1 + (z/2)

)

for all

−∞ < z < ∞

(the integral can be evaluated using a tedious partial

fraction expansion).

has a Cauchy distribution. Alternatively the arithmetic mean of

Cauchy random variables is a Cauchy random variable.

By induction, we can show that

(

···

) follows Cauchy distribution.

This becomes a “counter-example” to things like the weak law of large numbers

and the central limit theorem. Of course, this is because those theorems require

the random variable to have a mean, which the Cauchy distribution lacks.

Example.

(i)

If Θ

∼ U

[

−

], then

tan θ

has a Cauchy distribution. For example,

if we fire a bullet at a wall 1 meter apart at a random random angle

, the

vertical displacement follows a Cauchy distribution.

X = tan θ

(ii) If X, Y ∼ N (0, 1), then X/Y has a Cauchy distribution.

6.2 Gamma distribution

Suppose

, ··· , X

are iid

(

). Let

···

. Then the mgf of

θ(X

+...+X

)

= E



θX



···E



θX







θX





λ − θ



We call this a gamma distribution.

We claim that this has a distribution given by the following formula:

Definition (Gamma distribution). The gamma distribution Γ(n, λ) has pdf

f(x) =

n−1

−λx

(n − 1)!

We can show that this is a distribution by showing that it integrates to 1.

We now show that this is indeed the sum of n iid E(λ):

E[e

θX

] =

∞

θx

n−1

−λx

(n − 1)!



λ − θ



∞

(λ − θ)

n−1

−(λ−θ)x

(n − 1)!

The integral is just integrating over Γ(n, λ − θ), which gives one. So we have

E[e

θX

] =



λ − θ



6.3 Beta distribution*

Suppose

, ··· , X

are iid

1]. Let

≤ Y

≤ ··· ≤ Y

be the order

statistics. Then the pdf of Y

f(y) =

(i − 1)!(n −i)!

i−1

(1 − y)

n−i

Note that the leading term is the multinomial coefficient



i−1,1,n−1



. The formula

is obtained using the same technique for finding the pdf of order statistics.

This is the beta distribution: Y

∼ β(i, n − i + 1). In general

Definition (Beta distribution). The beta distribution β(a, b) has pdf

f(x; a, b) =

Γ(a + b)

Γ(a)Γ(b)

a−1

(1 − x)

b−1

for 0 ≤ x ≤ 1.

This has mean a/(a + b).

Its moment generating function is

m(θ) = 1 +

∞

k=1

k−1

r=0

a + r

a + b + r

which is horrendous!

6.4 More on the normal distribution

Proposition. The moment generating function of N(µ, σ

) is

E[e

θX

] = exp



θµ +



Proof.

E[e

θX

] =

∞

−∞

θx

√

2πσ

−

(x−µ)

dx.

Substitute z =

x−µ

. Then

E[e

θX

] =

∞

−∞

√

2π

θ(µ+σz)

−

= e

θµ+

∞

−∞

√

2π

−

(z−θσ)

| {z }

pdf of N (σθ,1)

= e

θµ+

Theorem. Suppose

X, Y

are independent random variables with

X ∼ N

(

, σ

and Y ∼ (µ

, σ

). Then

(i) X + Y ∼ N (µ

+ µ

, σ

+ σ

(ii) aX ∼ N(aµ

, a

Proof.

(i)

E[e

θ(X+Y )

] = E[e

θX

] · E[e

θY

]

= e

θ+

· e

θ+

= e

(µ

+µ

)θ+

(σ

+σ

)θ

which is the mgf of N(µ

+ µ

, σ

+ σ

(ii)

E[e

θ(aX)

] = E[e

(θa)X

]

= e

µ(aθ)+

(aθ)

= e

(aµ)θ+

)θ

Finally, suppose

X ∼ N

1). Write

(

) =

√

2π

−x

for its pdf. It would

be very difficult to find a closed form for its cumulative distribution function,

but we can find an upper bound for it:

P(X ≥ x) =

∞

ϕ(t) dt

≤

∞



1 +



ϕ(t) dt

ϕ(x)

To see the last step works, simply differentiate the result and see that you get



1 +



ϕ(x). So

P(X ≥ x) ≤

√

2π

−

Then

log P(X ≥ x) ∼ −

6.5 Multivariate normal

Let X

, ··· , X

be iid N(0, 1). Then their joint density is

g(x

, ··· , x

) =

i=1

ϕ(x

)

√

2π

−

(2π)

n/2

−

(2π)

n/2

−

where x = (x

, ··· , x

)

This result works if

, ··· , X

are iid

1). Suppose we are interested in

Z = µ + AX,

where

is an invertible

n × n

matrix. We can think of this as

measurements

Z that are affected by underlying standard-normal factors X. Then

X = A

−1

(Z − µ)

and

|J| = |det(A

−1

)| =

det A

f(z

, ··· , z

) =

(2π)

n/2

det A

exp



−



−1

(z − µ))

−1

(z − µ))





(2π)

n/2

det A

exp



−

(z − µ)

−1

(z − µ)



(2π)

n/2

√

det Σ

exp



−

(z − µ)

−1

(z − µ)



where Σ = AA

and Σ

−1

= (A

−1

)

−1

. We say

Z =













∼ M V N(µ, Σ) or N (µ, Σ).

This is the multivariate normal.

What is this matrix Σ? Recall that

cov

(

, Z

) =

[(

− µ

)(

− µ

)]. It

turns out this covariance is the i, jth entry of Σ, since

E[(Z − µ)(Z −µ)

] = E[AX(AX)

]

= E(AXX

) = AE[XX

= AIA

= AA

= Σ

So we also call Σ the covariance matrix.

In the special case where n = 1, this is a normal distribution and Σ = σ

Now suppose

, ··· , Z

have covariances 0. Then Σ =

diag

(

, ··· , σ

Then

f(z

, ··· , z

) =

√

2πσ

−

2σ

−µ

)

So Z

, ··· , Z

are independent, with Z

∼ N (µ

, σ

Here we proved that if

cov

= 0, then the variables are independent. However,

this is only true when

are multivariate normal. It is generally not true for

arbitrary distributions.

For these random variables that involve vectors, we will need to modify our

definition of moment generating functions. We define it to be

m(θ) = E[e

] = E[e

+···+θ

Bivariate normal

This is the special case of the multivariate normal when

= 2. Since there

aren’t too many terms, we can actually write them out.

The bivariate normal has

Σ =



ρσ



Then

corr(X

, X

) =

cov(X

, X

)

var(X

) var(X

)

ρσ

= ρ.

And

−1

1 − ρ



−2

−ρσ

−1

−ρσ

−1

−2



The joint pdf of the bivariate normal with zero mean is

f(x

, x

) =

2πσ

1 − ρ

exp



−

2(1 − ρ

)



−

2ρx



If the mean is non-zero, replace x

with x

− µ

The joint mgf of the bivariate normal is

m(θ

, θ

) = e

+θ

(θ

+2θ

ρσ

+θ

)

Nice and elegant.

7 Central limit theorem

Suppose

, ··· , X

are iid random variables with mean

and variance

. Let

= X

+ ··· + X

. Then we have previously shown that

var(S

√

n) = var



− nµ

√



= σ

Theorem (Central limit theorem). Let

, X

, ···

be iid random variables with

E[X

] = µ, var(X

) = σ

< ∞. Define

= X

+ ··· + X

Then for all finite intervals (a, b),

lim

n→∞



a ≤

− nµ

√

≤ b



√

2π

−

dt.

Note that the final term is the pdf of a standard normal. We say

− nµ

√

→

N(0, 1).

To show this, we will use the continuity theorem without proof:

Theorem (Continuity theorem). If the random variables

, X

, ···

have mgf’s

(

)

, m

(

)

, ···

and

(

)

→ m

(

) as

n → ∞

for all

, then

→

the

random variable with mgf m(θ).

We now provide a sketch-proof of the central limit theorem:

Proof. wlog, assume µ = 0, σ

= 1 (otherwise replace X

with

−µ

Then

(θ) = E[e

θX

] = 1 + θE[X

] +

E[X

] + ···

= 1 +

E[X

] + ···

Now consider S

√

n. Then

E[e

θS

√

] = E[e

θ(X

+...+X

√

]

= E[e

θX

√

] ···E[e

θX

√

]



E[e

θX

√

]





1 +

E[X

]

3/2

+ ···



→ e

n → ∞

since (1 +

a/n

)

→ e

. And this is the mgf of the standard normal.

So the result follows from the continuity theorem.

Note that this is not a very formal proof, since we have to require E[X

] to

be finite. Also, sometimes the moment generating function is not defined. But

this will work for many “nice” distributions we will ever meet.

The proper proof uses the characteristic function

(θ) = E[e

iθX

An important application is to use the normal distribution to approximate a

large binomial.

Let

∼ B

, p

). Then

∼ B

(

n, p

). So

[

] =

and

var

(

) =

−p

− np

np(1 − p)

→

N(0, 1).

Example. Suppose two planes fly a route. Each of

passengers chooses a plane

at random. The number of people choosing plane 1 is

S ∼ B

(

). Suppose

each has s seats. What is

F (s) = P(S > s),

i.e. the probability that plane 1 is over-booked? We have

F (s) = P(S > s) = P





S − n/2

n ·

s − n/2

√

n/2





Since

S − np

√

n/2

∼ N (0, 1),

we have

F (s) ≈ 1 − Φ



s − n/2

√

n/2



For example, if

= 1000 and

= 537, then

−n/2

√

n/2

≈

34, Φ(2

34)

≈

99,

and

(

)

≈

01. So with only 74 seats as buffer between the two planes, the

probability of overbooking is just 1/100.

Example. An unknown proportion

of the electorate will vote Labour. It is

desired to find

without an error not exceeding 0

005. How large should the

sample be?

We estimate by

′

where X

∼ B(1, p). Then

P(|p

′

− p| ≤ 0.005) = P(|S

− np| ≤ 0.005n)

= P







− np|

np(1 − p)

| {z }

≈N(0,1)

≤

0.005n

np(1 − p)







We want |p

′

− p| ≤ 0.005 with probability ≥ 0.95. Then we want

0.005n

np(1 − p)

≥ Φ

−1

(0.975) = 1.96.

(we use 0.975 instead of 0.95 since we are doing a two-tailed test) Since the

maximum possible value of p(1 −p) is 1/4, we have

n ≥ 38416.

In practice, we don’t have that many samples. Instead, we go by

P(|p

′

< p| ≤ 0.03) ≥ 0.95.

This just requires n ≥ 1068.

Example (Estimating

with Buffon’s needle). Recall that if we randomly toss

a needle of length

ℓ

to a floor marked with parallel lines a distance

apart, the

probability that the needle hits the line is p =

2ℓ

πL

ℓ

Suppose we toss the pin n times, and it hits the line N times. Then

N ≈ N (np, np(1 −p))

by the Central limit theorem. Write

′

for the actual proportion observed. Then

ˆπ =

2ℓ

(N/n)L

π2ℓ/(πL)

′

πp

p + (p

′

− p)

= π



1 −

′

− p

+ ···



Hence

ˆπ − π ≈

p − p

′

We know

′

∼ N



p(1 − p)



So we can find

ˆπ − π ∼ N



p(1 − p)



= N



(1 − p)



We want a small variance, and that occurs when

is the largest. Since

= 2

ℓ/πL

this is maximized with ℓ = L. In this case,

p =

and

ˆπ − π ≈ N



(π − 2)π



If we want to estimate π to 3 decimal places, then we need

P(|ˆπ − π| ≤ 0.001) ≥ 0.95.

This is true if and only if

0.001

(π − 2)(π

)

≥ Φ

−1

(0.975) = 1.96

n ≥

. So we can obtain

to 3 decimal places just by throwing a

stick 20 million times! Isn’t that exciting?