II Probability and Measure (Full)

Part II — Probability and Measure

Based on lectures by J. Miller

Notes taken by Dexter Chua

Michaelmas 2016

These notes are not endorsed by the lecturers, and I have modified them (often

significantly) after lectures. They are nowhere near accurate representations of what

was actually lectured, and in particular, all errors are almost surely mine.

Analysis II is essential

Measure spaces,

-algebras,

-systems and uniqueness of extension, statement *and

proof* of Carath´eodory’s extension theorem. Construction of Lebesgue measure on

The Borel

-algebra of

. Existence of non-measurable subsets of

. Lebesgue-Stieltjes

measures and probability distribution functions. Independence of events, independence

of σ-algebras. The Borel–Cantelli lemmas. Kolmogorov’s zero-one law. [6]

Measurable functions, random variables, independence of random variables. Construc-

tion of the integral, expectation. Convergence in measure and convergence almost

everywhere. Fatou’s lemma, monotone and dominated convergence, differentiation

under the integral sign. Discussion of product measure and statement of Fubini’s

theorem. [6]

Chebyshev’s inequality, tail estimates. Jensen’s inequality. Completeness of

for

1 ≤ p ≤ ∞. The H¨older and Minkowski inequalities, uniform integrability. [4]

as a Hilbert space. Orthogonal projection, relation with elementary conditional

probability. Variance and covariance. Gaussian random variables, the multivariate

normal distribution. [2]

The strong law of large numbers, proof for independent random variables with bounded

fourth moments. Measure preserving transformations, Bernoulli shifts. Statements

*and proofs* of maximal ergodic theorem and Birkhoff’s almost everywhere ergodic

theorem, proof of the strong law. [4]

The Fourier transform of a finite measure, characteristic functions, uniqueness and

inversion. Weak convergence, statement of L´evy’s convergence theorem for characteristic

functions. The central limit theorem. [2]

Contents

0 Introduction

1 Measures

1.1 Measures

1.2 Probability measures

2 Measurable functions and random variables

2.1 Measurable functions

2.2 Constructing new measures

2.3 Random variables

2.4 Convergence of measurable functions

2.5 Tail events

3 Integration

3.1 Definition and basic properties

3.2 Integrals and limits

3.3 New measures from old

3.4 Integration and differentiation

3.5 Product measures and Fubini’s theorem

4 Inequalities and L

spaces

4.1 Four inequalities

4.2 L

spaces

4.3 Orthogonal projection in L

4.4 Convergence in L

(P) and uniform integrability

5 Fourier transform

5.1 The Fourier transform

5.2 Convolutions

5.3 Fourier inversion formula

5.4 Fourier transform in L

5.5 Properties of characteristic functions

5.6 Gaussian random variables

6 Ergodic theory

6.1 Ergodic theorems

7 Big theorems

7.1 The strong law of large numbers

7.2 Central limit theorem

0 Introduction

In measure theory, the main idea is that we want to assign “sizes” to different

sets. For example, we might think [0

⊆ R

has size 2, while perhaps

Q ⊆ R

has

size 0. This is known as a measure. One of the main applications of a measure

is that we can use it to come up with a new definition of an integral. The idea

is very simple, but it is going to be very powerful mathematically.

Recall that if

: [0

→ R

is continuous, then the Riemann integral of

defined as follows:

(i) Take a partition 0 = t

< t

< ··· < t

= 1 of [0, 1].

(ii) Consider the Riemann sum

j=1

f(t

)(t

− t

j−1

)

(iii) The Riemann integral is

f = Limit of Riemann sums as the mesh size of the partition → 0.

···

k+1

···

The idea of measure theory is to use a different approximation scheme. Instead

of partitioning the domain, we partition the range of the function. We fix some

numbers r

< r

< ··· < r

We then approximate the integral of f by

j=1

· (“size of f

−1

([r

j−1

, r

])”).

We then define the integral as the limit of approximations of this type as the

mesh size of the partition → 0.

We can make an analogy with bankers — If a Riemann banker is given a stack

of money, they would just add the values of the money in order. A measure-

theoretic banker will sort the bank notes according to the type, and then find

the total value by multiplying the number of each type by the value, and adding

up.

Why would we want to do so? It turns out this leads to a much more

general theory of integration on much more general spaces. Instead of integrating

functions [

a, b

]

→ R

only, we can replace the domain with any measure space.

Even in the context of

, this theory of integration is much much more powerful

than the Riemann sum, and can integrate a much wider class of functions. While

you probably don’t care about those pathological functions anyway, being able

to integrate more things means that we can state more general theorems about

integration without having to put in funny conditions.

That was all about measures. What about probability? It turns out the

concepts we develop for measures correspond exactly to many familiar notions

from probability if we restrict it to the particular case where the total measure

of the space is 1. Thus, when studying measure theory, we are also secretly

studying probability!

1 Measures

In the course, we will write

% f

for “

converges to

monotonically

increasingly”, and

& f

similarly. Unless otherwise specified, convergence is

taken to be pointwise.

1.1 Measures

The starting point of all these is to come up with a function that determines

the “size” of a given set, known as a measure. It turns out we cannot sensibly

define a size for all subsets of [0

1]. Thus, we need to restrict our attention to a

collection of “nice” subsets. Specifying which subsets are “nice” would involve

specifying a σ-algebra.

This section is mostly technical.

Definition

(

-algebra)

Let

be a set. A

-algebra

is a collection of

subsets of E such that

(i) ∅ ∈ E.

(ii) A ∈ E implies that A

= X \ A ∈ E.

(iii) For any sequence (A

) in E, we have that

[

∈ E.

The pair (E, E) is called a measurable space.

Note that the axioms imply that

-algebras are also closed under countable

intersections, as we have A ∩ B = (A

∪ B

)

Definition

(Measure)

A measure on a measurable space (

E, E

) is a function

µ : E → [0, ∞] such that

(i) µ(∅) = 0

(ii) Countable additivity: For any disjoint sequence (A

) in E, then

[

∞

n=1

µ(A

Example.

Let

be any countable set, and

(

) be the set of all subsets

. A mass function is any function

E →

, ∞

]. We can then define a

measure by setting

µ(A) =

x∈A

m(x).

In particular, if we put

(

) = 1 for all

x ∈ E

, then we obtain the counting

measure.

Countable spaces are nice, because we can always take

(

), and the

measure can be defined on all possible subsets. However, for “bigger” spaces, we

have to be more careful. The set of all subsets is often “too large”. We will see

a concrete and also important example of this later.

In general,

-algebras are often described on large spaces in terms of a smaller

set, known as the generating sets.

Definition

(Generator of

-algebra)

Let

be a set, and that

A ⊆ P

(

) be a

collection of subsets of E. We define

σ(A) = {A ⊆ E : A ∈ E for all σ-algebras E that contain A}.

In other words

(

) is the smallest sigma algebra that contains

. This is

known as the sigma algebra generated by A.

Example.

Take

, and

{{x}

x ∈ Z}

. Then

(

) is just

(

), since

every subset of E can be written as a countable union of singletons.

Example.

Take

, and let

{{x, x

+ 1

, x

+ 2

, x

+ 3

, ···}

x ∈ E}

. Then

again σ(E) is the set of all subsets of E.

The following is the most important σ-algebra in the course:

Definition

(Borel

-algebra)

Let

, and

{U ⊆ R

U is open}

. Then

σ(A) is known as the Borel σ-algebra, which is not the set of all subsets of R.

We can equivalently define this by

{

(

a, b

) :

a < b, a, b ∈ Q}

. Then

(

)

is also the Borel σ-algebra.

Often, we would like to prove results that allow us to deduce properties

about the

-algebra just by checking it on a generating set. However, usually,

we cannot just check it on an arbitrary generating set. Instead, the generating

set has to satisfy some nice closure properties. We are now going to introduce a

bunch of many different definitions that you need not aim to remember (except

when exams are near).

Definition

(

-system)

Let

be a collection of subsets of

. Then

is called

a π-system if

(i) ∅ ∈ A

(ii) If A, B ∈ A, then A ∩ B ∈ A.

Definition

(d-system)

Let

be a collection of subsets of

. Then

is called

a d-system if

(i) E ∈ A

(ii) If A, B ∈ A and A ⊆ B, then B \ A ∈ A

(iii) For all increasing sequences (A

) in A, we have that

∈ A.

The point of d-systems and

-systems is that they separate the axioms of a

σ-algebra into two parts. More precisely, we have

Proposition.

A collection

is a

-algebra if and only if it is both a

-system

and a d-system.

This follows rather straightforwardly from the definitions.

The following definitions are also useful:

Definition

(Ring)

A collection of subsets

is a ring on

∅ ∈ A

and for all

A, B ∈ A, we have B \ A ∈ A and A ∪ B ∈ A.

Definition

(Algebra)

A collection of subsets

is an algebra on

∅ ∈ A

and for all A, B ∈ A, we have A

∈ A and A ∪ B ∈ A.

So an algebra is like a

-algebra, but it is just closed under finite unions only,

rather than countable unions.

While the names

-system and

-system are rather arbitrary, we can make

some sense of the names “ring” and “algebra”. Indeed, a ring forms a ring

(without unity) in the algebraic sense with symmetric difference as “addition”

and intersection as “multiplication”. Then the empty set acts as the additive

identity, and

, if present, acts as the multiplicative identity. Similarly, an

algebra is a boolean subalgebra under the boolean algebra P (E).

A very important lemma about these things is Dynkin’s lemma:

Lemma

(Dynkin’s

-system lemma)

Let

be a

-system. Then any d-system

which contains A contains σ(A).

This will be very useful in the future. If we want to show that all elements of

(

) satisfy a particular property for some generating

-system

, we just have

to show that the elements of

satisfy that property, and that the collection of

things that satisfy the property form a d-system.

While this use case might seem rather contrived, it is surprisingly common

when we have to prove things.

Proof.

Let

be the intersection of all d-systems containing

, i.e. the smallest

d-system containing

. We show that

contains

(

). To do so, we will show

that D is a π-system, hence a σ-algebra.

There are two steps to the proof, both of which are straightforward verifica-

tions:

(i) We first show that if B ∈ D and A ∈ A, then B ∩ A ∈ D.

(ii) We then show that if A, B ∈ D, then A ∩ B ∈ D.

Then the result immediately follows from the second part.

We let

= {B ∈ D : B ∩ A ∈ D for all A ∈ A}.

We note that

⊇ A

because

is a

-system, and is hence closed under

intersections. We check that

is a d-system. It is clear that

E ∈ D

. If we

have B

, B

∈ D

, where B

⊆ B

, then for any A ∈ A, we have

\ B

) ∩ A = (B

∩ A) \ (B

∩ A).

By definition of

, we know

∩ A

and

∩ A

are elements of

. Since

a d-system, we know this intersection is in D. So B

\ B

∈ D

Finally, suppose that (

) is an increasing sequence in

, with

Then for every A ∈ A, we have that



[



∩ A =

[

∩ A) = B ∩ A ∈ D.

Therefore B ∈ D

Therefore

is a d-system contained in

, which also contains

. By our

choice of D, we know D

= D.

We now let

= {B ∈ D : B ∩ A ∈ D for all A ∈ D}.

Since

, we again have

A ⊆ D

, and the same argument as above implies

that

is a d-system which is between

and

. But the only way that can

happen is if D

= D, and this implies that D is a π-system.

After defining all sorts of things that are “weaker versions” of

-algebras, we

now defined a bunch of measure-like objects that satisfy fewer properties. Again,

no one really remembers these definitions:

Definition

(Set function)

Let

be a collection of subsets of

with

∅ ∈ A

. A

set function function µ : A → [0, ∞] such that µ(∅) = 0.

Definition

(Increasing set function)

A set function is increasing if it has the

property that for all A, B ∈ A with A ⊆ B, we have µ(A) ≤ µ(B).

Definition

(Additive set function)

A set function is additive if whenever

A, B ∈ A and A ∪ B ∈ A, A ∩ B = ∅, then µ(A ∪ B) = µ(A) + µ(B).

Definition

(Countably additive set function)

A set function is countably addi-

tive if whenever A

is a sequence of disjoint sets in A with ∪A

∈ A, then

[

µ(A

Under these definitions, a measure is just a countable additive set function

defined on a σ-algebra.

Definition

(Countably subadditive set function)

A set function is countably

subadditive if whenever (A

) is a sequence of sets in A with

∈ A, then

[

≤

µ(A

The big theorem that allows us to construct measures is the Caratheodory

extension theorem. In particular, this will help us construct the Lebesgue measure

on R.

Theorem

(Caratheodory extension theorem)

Let

be a ring on

, and

a countably additive set function on

. Then

extends to a measure on the

σ-algebra generated by A.

Proof.

(non-examinable) We start by defining what we want our measure to be.

For B ⊆ E, we set

∗

(B) = inf

(

µ(A

) : (A

) ∈ A and B ⊆

[

)

If it happens that there is no such sequence, we set this to be

∞

. This measure is

known as the outer measure. It is clear that

∗

(

) = 0, and that

∗

is increasing.

We say a set A ⊆ E is µ

∗

-measurable if

∗

(B) = µ

∗

(B ∩ A) + µ

∗

(B ∩ A

)

for all B ⊆ E. We let

M = {µ

∗

-measurable sets}.

We will show the following:

(i) M is a σ-algebra containing A.

(ii) µ

∗

is a measure on M with µ

∗

= µ.

Note that it is not true in general that

(

). However, we will always

have M ⊇ σ(A).

We are going to break this up into five nice bite-size chunks.

Claim. µ

∗

is countably subadditive.

Suppose

B ⊆

. We need to show that

∗

(

)

≤

∗

(

). We can

wlog assume that

∗

(

) is finite for all

, or else the inequality is trivial. Let

ε >

0. Then by definition of the outer measure, for each

, we can find a

sequence (B

n,m

)

∞

m=1

in A with the property that

⊆

[

n,m

and

∗

) +

≥

µ(B

n,m

Then we have

B ⊆

[

⊆

[

n,m

Thus, by definition, we have

∗

(B) ≤

n,m

∗

n,m

) ≤



∗

) +



= ε +

∗

Since ε was arbitrary, we are done.

Claim. µ

∗

agrees with µ on A.

In the first example sheet, we will show that if

is a ring and

is a countably

additive set function on

, then

is in fact countably subadditive and increasing.

Assuming this, suppose that

(

) are in

and

A ⊆

. Then by

subadditivity, we have

µ(A) ≤

µ(A ∩ A

) ≤

µ(A

using that

is countably subadditivity and increasing. Note that we have to do

this in two steps, rather than just applying countable subadditivity, since we did

not assume that

∈ A. Taking the infimum over all sequences, we have

µ(A) ≤ µ

∗

(A).

Also, we see by definition that

(

)

≥ µ

∗

(

), since

covers

. So we get that

µ(A) = µ

∗

(A) for all A ∈ A.

Claim. M contains A.

Suppose that A ∈ A and B ⊆ E. We need to show that

∗

(B) = µ

∗

(B ∩ A) + µ

∗

(B ∩ A

Since

∗

is countably subadditive, we immediately have

∗

(

)

≤ µ

∗

(

B ∩ A

) +

∗

(

B ∩ A

). For the other inequality, we first observe that it is trivial if

∗

(

)

is infinite. If it is finite, then by definition, given

ε >

0, we can find some (

)

in A such that B ⊆

and

∗

(B) + ε ≥

µ(B

Then we have

B ∩ A ⊆

[

∩ A)

B ∩ A

⊆

[

∩ A

)

We notice that B

∩ A

= B

\ A ∈ A. Thus, by definition of µ

∗

, we have

∗

(B ∩ A) + µ

∗

(B ∩ A

) ≤

µ(B

∩ A) +

µ(B

∩ A

)

(µ(B

∩ A) + µ(B

∩ A

))

µ(B

)

≤ µ

∗

) + ε.

Since ε was arbitrary, the result follows.

Claim. We show that M is an algebra.

We first show that E ∈ M. This is true since we obviously have

∗

(B) = µ

∗

(B ∩ E) + µ

∗

(B ∩ E

)

for all B ⊆ E.

Next, note that if A ∈ M, then by definition we have, for all B,

∗

(B) = µ

∗

(B ∩ A) + µ

∗

(B ∩ A

Now note that this definition is symmetric in

and

. So we also have

∈ M .

Finally, we have to show that

is closed under intersection (which is

equivalent to being closed under union when we have complements). Suppose

, A

∈ M and B ⊆ E. Then we have

∗

(B) = µ

∗

(B ∩ A

) + µ

∗

(B ∩ A

)

= µ

∗

(B ∩ A

∩ A

) + µ

∗

(B ∩ A

∩ A

) + µ

∗

(B ∩ A

)

= µ

∗

(B ∩ (A

∩ A

)) + µ

∗

(B ∩ (A

∩ A

)

∩ A

)

+ µ

∗

(B ∩ (A

∩ A

)

∩ A

)

= µ

∗

(B ∩ (A

∩ A

)) + µ

∗

(B ∩ (A

∩ A

)

So we have A

∩ A

∈ M. So M is an algebra.

Claim. M is a σ-algebra, and µ

∗

is a measure on M.

To show that

is a

-algebra, we need to show that it is closed under

countable unions. We let (

) be a disjoint collection of sets in

, then we

want to show that A =

∈ M and µ

∗

(A) =

∗

Suppose that B ⊆ E. Then we have

∗

(B) = µ

∗

(B ∩ A

) + µ

∗

(B ∩ A

)

Using the fact that A

∈ M and A

∩ A

= ∅, we have

= µ

∗

(B ∩ A

) + µ

∗

(B ∩ A

) + µ

∗

(B ∩ A

∩ A

)

= ···

i=1

∗

(B ∩ A

) + µ

∗

(B ∩ A

∩ ··· ∩ A

)

≥

i=1

∗

(B ∩ A

) + µ

∗

(B ∩ A

Taking the limit as n → ∞, we have

∗

(B) ≥

∞

i=1

∗

(B ∩ A

) + µ

∗

(B ∩ A

By the countable-subadditivity of µ

∗

, we have

∗

(B ∩ A) ≤

∞

i=1

∗

(B ∩ A

Thus we obtain

∗

(B) ≥ µ

∗

(B ∩ A) + µ

∗

(B ∩ A

By countable subadditivity, we also have inequality in the other direction. So

equality holds. So A ∈ M. So M is a σ-algebra.

To see that µ

∗

is a measure on M, note that the above implies that

∗

(B) =

∞

i=1

(B ∩ A

) + µ

∗

(B ∩ A

Taking B = A, this gives

∗

(A) =

∞

i=1

(A ∩ A

) + µ

∗

(A ∩ A

) =

∞

i=1

∗

Note that when

itself is actually a

-algebra, the outer measure can be

simply written as

∗

(B) = inf{µ(A) : A ∈ A, B ⊆ A}.

Caratheodory gives us the existence of some measure extending the set function

. Could there be many? In general, there could. However, in the special

case where the measure is finite, we do get uniqueness.

Theorem.

Suppose that

, µ

are measures on (

E, E

) with

(

) =

(

)

∞

. If

is a

-system with

(

) =

, and

agrees with

, then

Proof. Let

D = {A ∈ E : µ

(A) = µ

(A)}

We know that

D ⊇ A

. By Dynkin’s lemma, it suffices to show that

is a

d-system. The things to check are:

(i) E ∈ D — this follows by assumption.

(ii) If A, B ∈ D with A ⊆ B, then B \ A ∈ D. Indeed, we have the equations

(B) = µ

(A) + µ

(B \ A) < ∞

(B) = µ

(A) + µ

(B \ A) < ∞.

Since

(

) =

(

) and

(

) =

(

), we must have

(

B \ A

) =

(B \ A).

(iii) Let (A

) ∈ D be an increasing sequence with

= A. Then

(A) = lim

n→∞

) = lim

n→∞

) = µ

(A).

So A ∈ D.

The assumption that

(

) =

(

)

< ∞

is necessary. The theorem does

not necessarily hold without it. We can see this from a simple counterexample:

Example. Let E = Z, and let E = P (E). We let

A = {{x, x + 1, x + 2, ···} : x ∈ E}∪ {∅}.

This is a

-system with

(

) =

. We let

(

) be the number of elements in

and

= 2

(

). Then obviously

, but

(

) =

∞

(

) for

A ∈ A

Definition

(Borel

-algebra)

Let

be a topological space. We define the

Borel σ-algebra as

B(E) = σ({U ⊆ E : U is open}).

We write B for B(R).

Definition

(Borel measure and Radon measure)

A measure

on (

E, B

(

)) is

called a Borel measure. If

(

)

< ∞

for all

K ⊆ E

compact, then

is a Radon

measure.

The most important example of a Borel measure we will consider is the

Lebesgue measure.

Theorem. There exists a unique Borel measure µ on R with µ([a, b]) = b − a.

Proof.

We first show uniqueness. Suppose

˜µ

is another measure on

satisfying

the above property. We want to apply the previous uniqueness theorem, but our

measure is not finite. So we need to carefully get around that problem.

For each n ∈ Z, we set

(A) = µ(A ∩ (n, n + 1]))

˜µ

(A) = ˜µ(A ∩ (n, n + 1]))

Then

and

˜µ

are finite measures on

which agree on the

-system of intervals

of the form (

a, b

] with

a, b ∈ R

a < b

. Therefore we have

˜µ

for all

n ∈ Z

Now we have

µ(A) =

n∈Z

µ(A ∩ (n, n + 1]) =

n∈Z

(A) =

n∈Z

˜µ

(A) = ˜µ(A)

for all Borel sets A.

To show existence, we want to use the Caratheodory extension theorem. We

let A be the collection of finite, disjoint unions of the form

A = (a

, b

] ∪ (a

, b

] ∪ ··· ∪ (a

, b

Then

is a ring of subsets of

, and

(

) =

(details are to be checked on

the first example sheet).

We set

µ(A) =

i=1

− a

We note that µ is well-defined, since if

A = (a

, b

] ∪ ··· ∪ (a

, b

] = (˜a

] ∪ ··· ∪ (˜a

then

i=1

− a

) =

i=1

(

− ˜a

Also, if

is additive,

A, B ∈ A

A ∩ B

∅

and

A ∪ B ∈ A

, we obviously have

µ(A ∪ B) = µ(A) + µ(B). So µ is additive.

Finally, we have to show that

is in fact countably additive. Let (

) be a

disjoint sequence in

, and let

∞

i=1

∈ A

. Then we need to show that

µ(A) =

∞

n=1

µ(A

Since µ is additive, we have

µ(A) = µ(A

) + µ(A \ A

)

= µ(A

) + µ(A

) + µ(A \ A

∪ A

)

i=1

µ(A

) + µ

A \

[

i=1

To finish the proof, we show that

A \

[

i=1

→ 0 as n → ∞.

We are going to reduce this to the finite intersection property of compact sets in

if (

) is a sequence of compact sets in

with the property that

m=1

∅

for all n, then

∞

m=1

6= ∅.

We first introduce some new notation. We let

= A \

[

m=1

We now suppose, for contradiction, that

(

)

6→

0 as

n → ∞

. Since the

’s

are decreasing, there must exist ε > 0 such that µ(B

) ≥ 2ε for every n.

For each

, we take

∈ A

with the property that

⊆ B

and

(

)

≤

. This is possible since each

is just a finite union of intervals. Thus we

have

µ(B

) − µ

m=1

= µ

m=1

≤ µ

[

m=1

\ C

)

≤

m=1

µ(B

\ C

)

≤

m=1

≤ ε.

On the other hand, we also know that µ(B

) ≥ 2ε.

m=1

≥ ε

for all

. We now let that

m=1

. Then

(

)

≥ ε

, and in particular

6= ∅ for all n.

Thus, the finite intersection property says

∅ 6=

∞

n=1

⊆

∞

n=1

= ∅.

This is a contradiction. So we have µ(B

) → 0 as n → ∞. So done.

Definition

(Lebesgue measure)

The Lebesgue measure is the unique Borel

measure µ on R with µ([a, b]) = b − a.

Note that the Lebesgue measure is not a finite measure, since

(

) =

∞

However, it is a σ-finite measure.

Definition

(

-finite measure)

Let (

E, E

) be a measurable space, and

measure. We say

-finite if there exists a sequence (

) in

such that

= E and µ(E

) < ∞ for all n.

This is the next best thing we can hope after finiteness, and often proofs

that involve finiteness carry over to σ-finite measures.

Proposition. The Lebesgue measure is translation invariant, i.e.

µ(A + x) = µ(A)

for all A ∈ B and x ∈ R, where

A + x = {y + x, y ∈ A}.

Proof. We use the uniqueness of the Lebesgue measure. We let

(A) = µ(A + x)

for

A ∈ B

. Then this is a measure on

satisfying

([

a, b

]) =

b − a

. So the

uniqueness of the Lebesgue measure shows that µ

= µ.

It turns out that translation invariance actually characterizes the Lebesgue

measure.

Proposition.

Let

˜µ

be a Borel measure on

that is translation invariant and

µ([0, 1]) = 1. Then ˜µ is the Lebesgue measure.

Proof. We show that any such measure must satisfy

µ([a, b]) = b − a.

By additivity and translation invariance, we can show that

([

p, q

]) =

q −p

for all

rational

p < q

. By considering

([

p, p

+ 1

]) for all

and using the increasing

property, we know

(

{p}

) = 0. So

(([

p, q

)) =

((

p, q

]) =

((

p, q

)) =

q − p

for

all rational p, q.

Finally, by countable additivity, we can extend this to all real intervals. Then

the result follows from the uniqueness of the Lebesgue measure.

In the proof of the Caratheodory extension theorem, we constructed a measure

∗

on the

-algebra

∗

-measurable sets which contains

. This contains

(

), but could in fact be bigger than it. We call

the Lebesgue

-algebra.

Indeed, it can be given by

M = {A ∪ N : A ∈ B, N ⊆ B ∈ B with µ(B) = 0}.

If A ∪ N ∈ M, then µ(A ∪ N ) = µ(A). The proof is left for the example sheet.

It is also true that

is strictly larger than

, so there exists

A ∈ M

with

A 6∈ B. Construction of such a set was on last year’s exam (2016).

On the other hand, it is also true that not all sets are Lebesgue measurable.

This is a rather funny construction.

Example.

For

x, y ∈

1), we say

x ∼ y

x − y

is rational. This defines an

equivalence relation on [0

1). By the axiom of choice, we pick a representative

of each equivalence class, and put them into a set

S ⊆

1). We will show that

S is not Lebesgue measurable.

Suppose that

were Lebesgue measurable. We are going to get a contra-

diction to the countable additivity of the Lebesgue measure. For each rational

r ∈ [0, 1) ∩ Q, we define

= {s + r mod 1 : s ∈ S}.

By translation invariance, we know

is also Lebesgue measurable, and

(

) =

µ(S).

Also, by construction of

, we know (

)

r∈Q

is disjoint, and

r∈Q

= [0

1).

Now by countable additivity, we have

1 = µ([0, 1)) = µ





[

r∈Q





r∈Q

µ(S

) =

r∈Q

µ(S),

which is clearly not possible. Indeed, if

(

) = 0, then this says 1 = 0; If

µ(S) > 0, then this says 1 = ∞. Both are absurd.

1.2 Probability measures

Since the course is called “probability and measure”, we’d better start talking

about probability! It turns out the notions we care about in probability theory

are very naturally just special cases of the concepts we have previously considered.

Definition

(Probability measure and probability space)

Let (

E, E

) be a measure

space with the property that

(

) = 1. Then we often call

a probability measure,

and (E, E, µ) a probability space.

Probability spaces are usually written as (Ω, F, P) instead.

Definition

(Sample space)

In a probability space (Ω

, F, P

), we often call Ω

the sample space.

Definition

(Events)

In a probability space (Ω

, F, P

), we often call the elements

of F the events.

Definition

(Probaiblity)

In a probability space (Ω

, F, P

), if

A ∈ F

, we often

call P[A] the probability of the event A.

These are exactly the same things as measures, but with different names!

However, thinking of them as probabilities could make us ask different questions

about these measure spaces. For example, in probability, one is often interested

in independence.

Definition

(Independence of events)

A sequence of events (

) is said to be

independent if

n∈J

P[A

]

for all finite subsets J ⊆ N.

However, it turns out that talking about independence of events is usually

too restrictive. Instead, we want to talk about the independence of σ-algebras:

Definition

(Independence of

-algebras)

A sequence of

-algebras (

) with

⊆ F

for all

is said to be independent if the following is true: If (

) is a

sequence where A

∈ A

for all n, them (A

) is independent.

Proposition.

Events (

) are independent iff the

-algebras

(

) are inde-

pendent.

While proving this directly would be rather tedious (but not too hard), it is

an immediate consequence of the following theorem:

Theorem. Suppose A

and A

are π-systems in F. If

P[A

∩ A

] = P[A

]P[A

]

for all A

∈ A

and A

∈ A

, then σ(A

) and σ(A

) are independent.

Proof.

This will follow from two applications of the fact that a finite measure is

determined by its values on a π-system which generates the entire σ-algebra.

We first fix A

∈ A

. We define the measures

µ(A) = P[A ∩ A

]

and

ν(A) = P[A]P[A

]

for all

A ∈ F

. By assumption, we know

and

agree on

, and we have that

µ(Ω) = P[A

] = ν(Ω) ≤ 1 < ∞. So µ and ν agree on σ(A

). So we have

P[A

∩ A

] = µ(A

) = ν(A

) = P[A

]P[A

]

for all A

∈ σ(A

So we have now shown that if

and

are independent, then

and

(

) are independent. By symmetry, the same argument shows that

(

)

and σ(A

) are independent.

Say we are rolling a dice. Instead of asking what the probability of getting

a 6, we might be interested instead in the probability of getting a 6 infinitely

often. Intuitively, the answer is “it happens with probability 1”, because in each

dice roll, we have a probability of

of getting a 6, and they are all independent.

We would like to make this precise and actually prove it. It turns out that

the notions of “occurs infinitely often” and also “occurs eventually” correspond

to more analytic notions of lim sup and lim inf.

Definition (limsup and liminf). Let (A

) be a sequence of events. We define

lim sup A

[

m≥n

lim inf A

[

m≥n

To parse these definitions more easily, we can read

∩

as “for all”, and

∪

“there exits”. For example, we can write

lim sup A

= ∀n, ∃m ≥ n such that A

occurs

= {x : ∀n, ∃m ≥ n, x ∈ A

}

= {A

occurs infinitely often}

= {A

i.o.}

Similarly, we have

lim inf A

= ∃n, ∀m ≥ n such that A

occurs

= {x : ∃n, ∀m ≥ n, x ∈ A

}

= {A

occurs eventually}

= {A

e.v.}

We are now going to prove two “obvious” results, known as the Borel–Cantelli

lemmas. These give us necessary conditions for an event to happen infinitely

often, and in the case where the events are independent, the condition is also

sufficient.

Lemma (Borel–Cantelli lemma). If

P[A

] < ∞,

then

P[A

i.o.] = 0.

Proof. For each k, we have

P[A

i.o] = P





[

m≥n





≤ P





[

m≥k





≤

∞

m=k

P[A

]

→ 0

as k → ∞. So we have P[A

i.o.] = 0.

Note that we did not need to use the fact that we are working with a

probability measure. So in fact this holds for any measure space.

Lemma (Borel–Cantelli lemma II). Let (A

) be independent events. If

P[A

] = ∞,

then

P[A

i.o.] = 1.

Note that independence is crucial. If we flip a fair coin, and we set all the

to be equal to “getting a heads”, then

[

] =

∞

, but we certainly

do not have P[A

i.o.] = 1. Instead it is just

Proof.

By example sheet, if (

) is independent, then so is (

). Then we have

m=n

P[A

]

m=n

(1 − P[A

])

≤

m=n

exp(−P[A

])

= exp

−

m=n

P[A

]

→ 0

as N → ∞, as we assumed that

P[A

] = ∞. So we have

∞

m=n

= 0.

By countable subadditivity, we have

[

∞

m=n

= 0.

This in turn implies that

∞

[

m=n

= 1 − P

[

∞

m=n

= 1.

So we are done.

2 Measurable functions and random variables

We’ve had enough of measurable sets. As in most of mathematics, not only

should we talk about objects, but also maps between objects. Here we want to

talk about maps between measure spaces, known as measurable functions. In

the case of a probability space, a measurable function is a random variable!

In this chapter, we are going to start by defining a measurable function and

investigate some of its basic properties. In particular, we are going to prove the

monotone class theorem, which is the analogue of Dynkin’s lemma for measurable

functions. Afterwards, we turn to the probabilistic aspects, and see how we can

make sense of the independence of random variables. Finally, we are going to

consider different notions of “convergence” of functions.

2.1 Measurable functions

The definition of a measurable function is somewhat like the definition of a

continuous function, except that we replace “open” with “in the σ-algebra”.

Definition

(Measurable functions)

Let (

E, E

) and (

G, G

) be measure spaces.

A map f : E → G is measurable if for every A ∈ G, we have

−1

(A) = {x ∈ E : f(x) ∈ E} ∈ E.

If (G, G) = (R, B), then we will just say that f is measurable on E.

If (

G, G

) = ([0

, ∞

]

, B

), then we will just say that

is non-negative measurable.

If E is a topological space and E = B(E), then we call f a Borel function.

How do we actually check in practice that a function is measurable? It turns

out we are lucky. We can simply check that

−1

(

)

∈ E

for

in any generating

set Q of G.

Lemma.

Let (

E, E

) and (

G, G

) be measurable spaces, and

(

) for some

Q. If f

−1

(A) ∈ E for all A ∈ Q, then f is measurable.

Proof. We claim that

{A ⊆ G : f

−1

(A) ∈ E}

is a

-algebra on

. Then the result follows immediately by definition of

(

Indeed, this follows from the fact that

−1

preserves everything. More

precisely, we have

−1

[

−1

), f

−1

) = (f

−1

(A))

, f

−1

(∅) = ∅.

So if, say, all A

∈ A, then so is

Example.

In the particular case where we have a function

E → R

, we know

that

(

) is generated by (

−∞, y

] for

y ∈ R

. So we just have to check that

{x ∈ E : f(x) ≤ y} = f

−1

((−∞, y])) ∈ E.

Example.

Let

E, F

be topological spaces, and

E → F

be continuous. We

will see that

is a measurable function (under the Borel

-algebras). Indeed,

by definition, whenever

U ⊆ F

is open, we have

−1

(

) open as well. So

−1

(

)

∈ B

(

) for all

U ⊆ F

open. But since

(

) is the

-algebra generated

by the open sets, this implies that f is measurable.

This is one very important example. We can do another very important

example.

Example.

Suppose that

A ⊆ E

. The indicator function of

(

) :

E →

{0, 1} given by

(x) =

(

1 x ∈ A

0 x 6∈ A

Suppose we give

{

}

the non-trivial measure. Then

is a measurable function

iff A ∈ E.

Example. The identity function is always measurable.

Example.

Composition of measurable functions are measurable. More precisely,

if (

E, E

), (

F, F

) and (

G, G

) are measurable spaces, and the functions

E → F

and

F → G

are measurable, then the composition

g◦f

E → G

is measurable.

Indeed, if

A ∈ G

, then

−1

(

)

∈ F

, so

−1

(

−1

(

))

∈ E

. But

−1

(

−1

(

)) =

(g ◦ f )

−1

(A). So done.

Definition

(

-algebra generated by functions)

Now suppose we have a set

and a family of real-valued functions {f

: i ∈ I} on E. We then define

σ(f

: i ∈ I) = σ(f

−1

(A) : A ∈ B, i ∈ I).

This is the smallest

-algebra on

which makes all the

’s measurable.

This is analogous to the notion of initial topologies for topological spaces.

If we want to construct more measurable functions, the following definition

will be rather useful:

Definition

(Product measurable space)

Let (

E, E

) and (

G, G

) be measure

spaces. We define the product measure space as

E × G

whose

-algebra is

generated by the projections

E × G

E G

More explicitly, the σ-algebra is given by

E ⊗ G = σ({A × B : A ∈ E, B ∈ G}).

More generally, if (

, E

) is a collection of measure spaces, the product measure

space has underlying set

, and the

-algebra generated by the projection

maps π

→ E

This satisfies the following property:

Proposition.

Let

E → F

be functions. Then

}

are all measurable iff

(

) :

E →

is measurable, where the function (

) is defined by setting the

ith component of (f

)(x) to be f

(x).

Proof.

If the map (

) is measurable, then by composition with the projections

, we know that each f

is measurable.

Conversely, if all

are measurable, then since the

-algebra of

generated by sets of the form

−1

(

) :

A ∈ F

, and the pullback of such sets

along (f

) is exactly f

−1

(A), we know the function (f

) is measurable.

Using this, we can prove that a whole lot more functions are measurable.

Proposition.

Let (

E, E

) be a measurable space. Let (

n ∈ N

) be a sequence

of non-negative measurable functions on

. Then the following are measurable:

+ f

, f

, max{f

, f

}, min{f

, f

inf

, sup

, lim inf

, lim sup

The same is true with “real” replaced with “non-negative”, provided the new

functions are real (i.e. not infinity).

Proof.

This is an (easy) exercise on the example sheet. For example, the sum

+ f

can be written as the following composition.

E [0, ∞]

[0, ∞].

)

We know the second map is continuous, hence measurable. The first function is

also measurable since the f

are. So the composition is also measurable.

The product follows similarly, but for the infimum and supremum, we need to

check explicitly that the corresponding maps [0

, ∞

]

→

, ∞

] is measurable.

Notation. We will write

f ∧ g = min{f, g}, f ∨ g = max{f, g}.

We are now going to prove the monotone class theorem, which is a “Dynkin’s

lemma” for measurable functions. As in the case of Dynkin’s lemma, it will

sound rather awkward but will prove itself to be very useful.

Theorem

(Monotone class theorem)

Let (

E, E

) be a measurable space, and

A ⊆ E

be a

-system with

(

) =

. Let

be a vector space of functions such

that

(i) The constant function 1 = 1

is in V.

(ii) The indicator functions 1

∈ V for all A ∈ A

(iii) V is closed under bounded, monotone limits.

More explicitly, if (

) is a bounded non-negative sequence in

% f

(pointwise) and f is also bounded, then f ∈ V.

Then V contains all bounded measurable functions.

Note that the conditions for

is pretty like the conditions for a d-system,

where taking a bounded, monotone limit is something like taking increasing

unions.

Proof. We first deduce that 1

∈ V for all A ∈ E.

D = {A ∈ E : 1

∈ V}.

We want to show that

. To do this, we have to show that

is a

-system.

(i) Since 1

∈ V, we know E ∈ D.

(ii) If 1

∈ V , then 1 − 1

= 1

E\A

∈ V. So E \ A ∈ D.

(iii)

If (

) is an increasing sequence in

, then

→ 1

monotonically

increasingly. So 1

is in D.

So, by Dynkin’s lemma, we know

. So

contains indicators of all measur-

able sets. We will now try to obtain any measurable function by approximating.

Suppose that

is bounded and non-negative measurable. We want to show

that f ∈ V. To do this, we approximate it by letting

= 2

−n

fc =

∞

k=0

−n

{k2

−n

≤f<(k+1)2

−n

}

Note that since

is bounded, this is a finite sum. So it is a finite linear

combination of indicators of elements in

. So

∈ V

, and 0

≤ f

→ f

monotonically. So f ∈ V.

More generally, if f is bounded and measurable, then we can write

f = (f ∨ 0) + (f ∧ 0) ≡ f

− f

−

Then f

and f

−

are bounded and non-negative measurable. So f ∈ V.

Unfortunately, we will not have a chance to use this result until the next

chapter where we discuss integration. There we will use this a lot.

2.2 Constructing new measures

We are going to look at two ways to construct new measures on spaces based on

some measurable function we have.

Definition

(Image measure)

Let (

E, E

) and (

G, G

) be measure spaces. Suppose

is a measure on

and

E → G

is a measurable function. We define the

image measure ν = µ ◦ f

−1

on G by

ν(A) = µ(f

−1

(A)).

It is a routine check that this is indeed a measure.

If we have a strictly increasing continuous function, then we know it is

invertible (if we restrict the codomain appropriately), and the inverse is also

strictly increasing. It is also clear that these conditions are necessary for an

inverse to exist. However, if we relax the conditions a bit, we can get some sort

of “pseudoinverse” (some categorists may call them “left adjoints” (and will tell

you that it is a trivial consequence of the adjoint functor theorem)).

Recall that a function

is right continuous if

& x

implies

(

)

→ g

(

and similarly f is left continuous if x

% x implies f (x

) → f (x).

Lemma. Let g : R → R be non-constant, non-decreasing and right continuous.

We set

g(±∞) = lim

x→±∞

g(x).

We set I = (g(−∞), g(∞)). Since g is non-constant, this is non-empty.

Then there is a non-decreasing, left continuous function

I → R

such that

for all x ∈ I and y ∈ R, we have

x ≤ g(y) ⇔ f (x) ≤ y.

Thus, taking the negation of this, we have

x > g(y) ⇔ f (x) > y.

Explicitly, for x ∈ I, we define

f(x) = inf{y ∈ R : x ≤ g(y)}.

Proof. We just have to verify that it works. For x ∈ I, consider

= {y ∈ R : x ≤ g(y)}.

Since

is non-decreasing, if

y ∈ J

and

≥ y

, then

∈ J

. Since

right-continuous, if y

∈ J

is such that y

& y, then y ∈ J

. So we have

= [f(x), ∞).

Thus, for f ∈ R, we have

x ≤ g(y) ⇔ f (x) ≤ y.

So we just have to prove the remaining properties of

. Now for

x ≤ x

, we have

⊆ J

. So f(x) ≤ f (x

). So f is non-decreasing.

Similarly, if

% x

, then we have

. So

(

)

→ f

(

). So this

is left continuous.

Example. If g is given by the function

then f is given by

This allows us to construct new measures on R with ease.

Theorem.

Let

R → R

be non-constant, non-decreasing and right continuous.

Then there exists a unique Radon measure dg on B such that

dg((a, b]) = g(b) − g(a).

Moreover, we obtain all non-zero Radon measures on R in this way.

We have already seen an instance of this when we

was the identity function.

Given the lemma, this is very easy.

Proof.

Take

and

as in the previous lemma, and let

be the restriction of

the Lebesgue measure to Borel subsets of

. Now

is measurable since it is left

continuous. We define dg = µ ◦ f

−1

. Then we have

dg((a, b]) = µ({x ∈ I : a < f(x) ≤ b})

= µ({x ∈ I : g(a) < x ≤ g(b)})

= µ((g(a), g(b)]) = g(b) − g(a).

So dg is a Radon measure with the required property.

There are no other such measures by the argument used for uniqueness of

the Lebesgue measure.

To show we get all non-zero Radon measures this way, suppose we have a

Radon measure ν on R, we want to produce a g such that ν = dg. We set

g(y) =

(

−ν((y, 0]) y ≤ 0

ν((0, y]) y > 0

Then

((

a, b

]) =

(

)

− g

(

). We see that

is non-zero, so

is non-constant.

It is also easy to see it is non-decreasing and right continuous. So

= d

continuity.

2.3 Random variables

We are now going to look at these ideas in the context of probability. It turns

out they are concepts we already know and love!

Definition

(Random variable)

Let (Ω

, F, P

) be a probability space, and (

E, E

)

a measurable space. Then an

-valued random variable is a measurable function

X : Ω → E.

By default, we will assume the random variables are real.

Usually, when we have a random variable

, we might ask questions like

“what is the probability that

X ∈ A

?”. In other words, we are asking for the

“size” of the set of things that get sent to A. This is just the image measure!

Definition

(Distribution/law)

Given a random variable

: Ω

→ E

, the

distribution or law of X is the image measure µ

: P ◦ X

−1

. We usually write

P(X ∈ A) = µ

(A) = P(X

−1

(A)).

, then

is determined by its values on the

-system of intervals

(−∞, y]. We set

(x) = µ

((−∞, x]) = P(X ≤ x)

This is known as the distribution function of X.

Proposition. We have

(x) →

(

0 x → −∞

1 x → +∞

Also, F

(x) is non-decreasing and right-continuous.

We call any function F with these properties a distribution function.

Definition

(Distribution function)

A distribution function is a non-decreasing,

right continuous function f : R → [0, 1] satisfying

(x) →

(

0 x → −∞

1 x → +∞

We now want to show that every distribution function is indeed a distribution.

Proposition.

Let

be any distribution function. Then there exists a probability

space (Ω, F, P) and a random variable X such that F

= F .

Proof. Take (Ω, F, P) = ((0, 1), B(0, 1), Lebesgue). We take X : Ω → R to be

X(ω) = inf{x : ω ≤ f(x)}.

Then we have

X(ω) ≤ x ⇐⇒ w ≤ F (x).

So we have

(x) = P[X ≤ x] = P[(0, F (x)]] = F (x).

Therefore F

= F .

This construction is actually very useful in practice. If we are writing

a computer program and want to sample a random variable, we will use this

procedure. The computer usually comes with a uniform (pseudo)-random number

generator. Then using this procedure allows us to produce random variables of

any distribution from a uniform sample.

The next thing we want to consider is the notion of independence of random

variables. Recall that for random variables

X, Y

, we used to say that they are

independent if for any A, B, we have

P[X ∈ A, Y ∈ B] = P[X ∈ A]P[Y ∈ B].

But this is exactly the statement that the

-algebras generated by

and

are

independent!

Definition

(Independence of random variables)

A family (

) of random vari-

ables is said to be independent if the family of

-algebras (

(

)) is independent.

Proposition. Two real-valued random variables X, Y are independent iff

P[X ≤ x, Y ≤ y] = P[X ≤ x]P[Y ≤ y].

More generally, if (

) is a sequence of real-valued random variables, then they

are independent iff

P[x

≤ x

, ··· , x

≤ x

] =

j=1

P[X

≤ x

]

for all n and x

Proof.

The

⇒

direction is obvious. For the other direction, we simply note that

{(−∞, x] : x ∈ R} is a generating π-system for the Borel σ-algebra of R.

In probability, we often say things like “let

, X

, ···

be iid random vari-

ables”. However, how can we guarantee that iid random variables do indeed

exist? We start with the less ambitious goal of finding iid

Bernoulli

2) random

variables:

Proposition. Let

(Ω, F, P) = ((0, 1), B(0, 1), Lebesgue).

be our probability space. Then there exists as sequence

of independent

Bernoulli(1/2) random variables.

Proof.

Suppose we have

ω ∈

Ω = (0

1). Then we write

as a binary expansion

ω =

∞

n=1

−n

where

∈ {

}

. We make the binary expansion unique by disallowing infinite

sequences of zeroes.

We define

(

) =

. We will show that

is measurable. Indeed, we can

write

(ω) = ω

= 1

(1/2,1]

(ω),

where

(1/2,1]

is the indicator function. Since indicator functions of measurable

sets are measurable, we know R

is measurable. Similarly, we have

(ω) = 1

(1/4,1/2]

(ω) + 1

(3/4,1]

(ω).

So this is also a measurable function. More generally, we can do this for any

(ω): we have

(ω) =

n−1

j=1

−n

(2j−1),2

−n

(2j)]

(ω).

So each

is a random variable, as each can be expressed as a sum of indicators

of measurable sets.

Now let’s calculate

P[R

= 1] =

2n−1

j=1

−n

((2j) − (2j − 1)) =

2n−1

j=1

−n

Then we have

P[R

= 0] = 1 − P[R

= 1] =

as well. So R

∼ Bernoulli(1/2).

We can straightforwardly check that (

) is an independent sequence, since

for n 6= m, we have

P[R

= 0 and R

= 0] =

= P[R

= 0]P[R

= 0].

We will now use the (

) to construct any independent sequence for any

distribution.

Proposition. Let

(Ω, F, P) = ((0, 1), B(0, 1), Lebesgue).

Given any sequence (

) of distribution functions, there is a sequence (

) of

independent random variables with F

= F

for all n.

Proof. Let m : N

→ N be any bijection, and relabel

k,n

= R

m(k,n)

where the R

are as in the previous random variable. We let

∞

k=1

−k

k,n

Then we know that (

) is an independent sequence of random variables, and

each is uniform on (0, 1). As before, we define

(y) = inf{x : y ≤ F

(x)}.

We set

(

). Then (

) is a sequence of random variables with

= F

We end the section with a random fact: let (Ω

, F, P

) and

be as above.

Then

j=1

is the average of

independent of

Bernoulli

2) random

variables. The weak law of large numbers says for any ε > 0, we have







j=1

−



≥ ε





→ 0 as n → ∞.

The strong law of large numbers, which we will prove later, says that











ω :

j=1

→











= 1.

So “almost every number” in (0

1) has an equal proportion of 0’s and 1’s in its

binary expansion. This is known as the normal number theorem.

2.4 Convergence of measurable functions

The next thing to look at is the convergence of measurable functions. In measure

theory, wonderful things happen when we talk about convergence. In analysis,

most of the time we had to require uniform convergence, or even stronger

notions, if we want limits to behave well. However, in measure theory, the kinds

of convergence we talk about are somewhat pointwise in nature. In fact, it

will be weaker than pointwise convergence. Yet, we are still going to get good

properties out of them.

Definition

(Convergence almost everywhere)

Suppose that (

E, E, µ

) is a mea-

sure space. Suppose that (

)

, f

are measurable functions. We say

→ f

almost everywhere (a.e.) if

µ({x ∈ E : f

(x) 6→ f (x)}) = 0.

If (E, E, µ) is a probability space, this is called almost sure convergence.

To see this makes sense, i.e. the set in there is actually measurable, note that

{x ∈ E : f

(x) 6→ f (x)} = {x ∈ E : lim sup |f

(x) − f(x)| > 0}.

We have previously seen that

lim sup |f

−f|

is non-negative measurable. So the

set {x ∈ E : lim sup |f

(x) − f(x)| > 0} is measurable.

Another useful notion of convergence is convergence in measure.

Definition

(Convergence in measure)

Suppose that (

E, E, µ

) is a measure space.

Suppose that (

)

, f

are measurable functions. We say

→ f

in measure if for

each ε > 0, we have

µ({x ∈ E : |f

(x) − f(x)| ≥ ε}) → 0 as n → ∞,

then we say that f

→ f in measure.

If (

E, E, µ

) is a probability space, then this is called convergence in probability.

In the case of a probability space, this says

P(|X

− X| ≥ ε) → 0 as n → ∞

for all ε, which is how we state the weak law of large numbers in the past.

After we define integration, we can consider the norms of a function f by

kfk



|f(x)|



1/p

Then in particular, if

−fk

→

0, then

→ f

in measure, and this provides

an easy way to see that functions converge in measure.

In general, neither of these notions imply each other. However, the following

theorem provides us with a convenient dictionary to translate between the two

notions.

Theorem.

(i) If µ(E) < ∞, then f

→ f a.e. implies f

→ f in measure.

(ii)

For any

, if

→ f

in measure, then there exists a subsequence (

)

such that f

→ f a.e.

Proof.

(i) First suppose µ(E) < ∞, and fix ε > 0. Consider

µ({x ∈ E : |f

(x) − f(x)| ≤ ε}).

We use the result from the first example sheet that for any sequence of

events (A

), we have

lim inf µ(A

) ≥ µ(lim inf A

Applying to the above sequence says

lim inf µ({x : |f

(x) − f(x)| ≤ ε}) ≥ µ({x : |f

(x) − f(x)| ≤ ε eventually})

≥ µ({x ∈ E : |f

(x) − f(x)| → 0})

= µ(E).

As µ(E) < ∞, we have µ({x ∈ E : |f

(x) − f(x)| > ε}) → 0 as n → ∞.

(ii) Suppose that f

→ f in measure. We pick a subsequence (n

) such that



x ∈ E : |f

(x) − f(x)| >



≤ 2

−k

Then we have

∞

k=1



x ∈ E : f

(x) − f(x)| >



≤

∞

k=1

−k

= 1 < ∞.

By the first Borel–Cantelli lemma, we know



x ∈ E : |f

(x) − f(x)| >

i.o.



= 0.

So f

→ f a.e.

It is important that we assume that µ(E) < ∞ for the first part.

Example.

Consider (

E, E, µ

) = (

R, B, Lebesgue

). Take

(

) =

[n,∞)

(

Then

(

)

→

0 for all

, and in particular almost everywhere. However, we

have



x ∈ R : |f

(x)| >



= µ([n, ∞)) = ∞

for all n.

There is one last type of convergence we are interested in. We will only

first formulate it in the probability setting, but there is an analogous notion in

measure theory known as weak convergence, which we will discuss much later on

in the course.

Definition

(Convergence in distribution)

Let (

)

, X

be random variables

with distribution functions

and

, then we say

→ X

in distribution if

(x) → F

(x) for all x ∈ R at which F

is continuous.

Note that here we do not need that (

) and

live on the same probability

space, since we only talk about the distribution functions.

But why do we have the condition with continuity points? The idea is that

if the resulting distribution has a “jump” at

, it doesn’t matter which side of

the jump

(

) is at. Here is a simple example that tells us why this is very

important:

Example.

Let

to be uniform on [0

]. Intuitively, this should converge

to the random variable that is always zero.

We can compute

(x) =











0 x ≤ 0

nx 0 < x < 1/n

1 x ≥ 1/n

We can also compute the distribution of the zero random variable as

(

0 x < 0

1 x ≥ 0

But F

(0) = 0 for all n, while F

(0) = 1.

One might now think of cheating by cooking up some random variable such

that

is discontinuous at so many points that random, unrelated things converge

. However, this cannot be done, because

is a non-decreasing function,

and thus can only have countably many points of discontinuities.

The big theorem we are going to prove about convergence in distribution is

that actually it is very boring and doesn’t give us anything new.

Theorem (Skorokhod representation theorem of weak convergence).

(i)

If (

)

, X

are defined on the same probability space, and

→ X

probability. Then X

→ X in distribution.

(ii)

→ X

in distribution, then there exists random variables (

) and

defined on a common probability space with

and

such that

→

X a.s.

Proof. Let S = {x ∈ R : F

is continuous}.

(i)

Assume that

→ X

in probability. Fix

x ∈ S

. We need to show that

(x) → F

(x) as n → ∞.

We fix

ε >

0. Since

x ∈ S

, this implies that there is some

δ >

0 such that

(x − δ) ≥ F

(x) −

(x + δ) ≤ F

(x) +

We fix N large such that n ≥ N implies P[|X

− X| ≥ δ] ≤

. Then

(x) = P[X

≤ x]

= P[(X

− X) + X ≤ x]

We now notice that

{

(

−X

) +

X ≤ x} ⊆ {X ≤ x

δ}∪{|X

−X| > δ}

So we have

≤ P[X ≤ x + δ] + P[|X

− X| > δ]

≤ F

(x + δ) +

≤ F

(x) + ε.

We similarly have

(x) = P[X

≤ x]

≥ P[X ≤ x − δ] − P[|X

− X| > δ]

≥ F

(x − δ) −

≥ F

(x) − ε.

Combining, we have that

n ≥ N

implying

(

)

− F

(

)

| ≤ ε

. Since

was arbitrary, we are done.

(ii) Suppose X

→ X in distribution. We again let

(Ω, F, B) = ((0, 1), B((0, 1)), Lebesgue).

We let

(ω) = inf{x : ω ≤ F

(x)},

X(ω) = inf{x : ω ≤ F

(x)}.

Recall from before that

has the same distribution function as

for

all n, and

X has the same distribution as X. Moreover, we have

(ω) ≤ x ⇔ ω ≤ F

(x)

x <

(ω) ⇔ F

(x) < ω,

and similarly if we replace X

with X.

We are now going to show that with this particular choice, we have

→

a.s.

Note that

is a non-decreasing function (0

→ R

. Then by general

analysis,

X has at most countably many discontinuities. We write

Ω

= {ω ∈ (0, 1) :

X is continuous at ω

Then (0, 1) \ Ω

is countable, and hence has Lebesgue measure 0. So

P[Ω

] = 1.

We are now going to show that

(ω) →

X(ω) for all ω ∈ Ω

Note that

is a non-decreasing function, and hence the points of discon-

tinuity

R \ S

is also countable. So

is dense in

. Fix

ω ∈

Ω

and

ε >

We want to show that |

(ω) −

X(ω)| ≤ ε for all n large enough.

Since S is dense in R, we can find x

−

, x

in S such that

−

X(ω) < x

and

−x

−

< ε

. What we want to do is to use the characteristic property

X and F

to say that this implies

−

) < ω < F

Then since

→ F

at the points

−

, x

, for sufficiently large

, we

have

−

) < ω < F

Hence we have

−

(ω) < x

Then it follows that |

(ω) −

X(ω)| < ε.

However, this doesn’t work, since

(

)

< x

only implies

ω ≤ F

(

and our argument will break down. So we do a funny thing where we

introduce a new variable ω

Since

is continuous at

, we can find

∈

(

ω,

1) such that

(

)

≤ x

X(ω)

−

Then we have

−

X(ω) ≤

X(ω

) < x

Then we have

−

) < ω < ω

≤ F

So for sufficiently large n, we have

−

) < ω < F

So we have

−

(ω) ≤ x

and we are done.

2.5 Tail events

Finally, we are going to quickly look at tail events. These are events that depend

only on the asymptotic behaviour of a sequence of random variables.

Definition

(Tail

-algebra)

Let (

) be a sequence of random variables. We

let

= σ(X

n+1

, X

n+2

, ···),

and

T =

Then T is the tail σ-algebra.

Then

-measurable events and random variables only depend on the asymp-

totic behaviour of the X

’s.

Example. Let (X

) be a sequence of real-valued random variables. Then

lim sup

n→∞

j=1

, lim inf

n→∞

j=1

are T -measurable random variables. Finally,







lim

n→∞

j=1

exists







∈ T ,

since this is just the set of all points where the previous two things agree.

Theorem

(Kolmogorov 0-1 law)

Let (

) be a sequence of independent (real-

valued) random variables. If A ∈ T , then P[A] = 0 or 1.

Moreover, if

is a

-measurable random variable, then there exists a

constant c such that

P[X = c] = 1.

Proof.

The proof is very funny the first time we see it. We are going to prove

the theorem by checking something that seems very strange. We are going to

show that if A ∈ T , then A is independent of A. It then follows that

P[A] = P[A ∩ A] = P[A]P[A],

so P[A] = 0 or 1. In fact, we are going to prove that T is independent of T .

Let

= σ(X

, ··· , X

This σ-algebra is generated by the π-system of events of the form

A = {X

≤ x

, ··· , X

≤ x

Similarly,

(

n+1

, X

n+2

, ···

) is generated by the

-system of events of the

form

B = {X

n+1

≤ x

n+1

, ··· , X

n+k

≤ x

n+k

where k is any natural number.

Since the X

are independent, we know for any such A and B, we have

P[A ∩ B] = P[A]P[B].

Since this is true for all A and B, it follows that F

is independent of T

Since T =

⊆ T

for each n, we know F

is independent of T .

Now

is a

-system, which generates the

-algebra

∞

(

, X

, ···

We know that if

A ∈

, then there has to exist an index

such that

A ∈ F

So A is independent of T . So F

∞

is independent of T .

Finally, note that T ⊆ F

∞

. So T is independent of T .

To find the constant, suppose that X is T -measurable. Then

P[X ≤ x] ∈ {0, 1}

for all x ∈ R since {X ≤ x} ∈ T .

Now take

c = inf{x ∈ R : P[X ≤ x] = 1}.

Then with this particular choice of

, it is easy to see that

[

] = 1. This

completes the proof of the theorem.

3 Integration

3.1 Definition and basic properties

We are now going to work towards defining the integral of a measurable function

on a measure space (

E, E, µ

). Different sources use different notations for the

integral. The following notations are all commonly used:

µ(f) =

f dµ =

f(x) dµ(x) =

f(x)µ(dx).

In the case where (E, E, µ) = (R, B, Lebesgue), people often just write this as

µ(f) =

f(x) dx.

On the other hand, if (

E, E, µ

) = (Ω

, F, P

) is a probability space, and

is a

random variable, then people write the integral as E[X], the expectation of X.

So how are we going to define the integral? There are two steps to defining

the integral. The idea is that we first define the integral on simple functions,

and then extend the definition to more general measurable functions by taking

the limit. When we do the definition for simple functions, it will be obvious that

the definition satisfies the nice properties, and we will have to check that they

are preserved when we take the limit.

Definition

(Simple function)

A simple function is a measurable function that

can be written as a finite non-negative linear combination of indicator functions

of measurable sets, i.e.

f =

k=1

for some A

∈ E and a

≥ 0.

Note that some sources do not assume that

≥

0, but assuming this makes

our life easier.

It is obvious that

Proposition.

A function is simple iff it is measurable, non-negative, and takes

on only finitely-many values.

Definition (Integral of simple function). The integral of a simple function

f =

k=1

is given by

µ(f) =

k=1

µ(A

Note that it can be that

(

) =

∞

, but

= 0. When this happens, we

are just going to declare that 0

· ∞

= 0 (this makes sense because this means

we are ignoring all 0

· 1

terms for any

). After we do this, we can check the

integral is well-defined.

We are now going to extend this definition to non-negative measurable

functions by a limiting procedure. Once we’ve done this, we are going to extend

the definition to measurable functions by linearity of the integral. Then we

would have a definition of the integral, and we are going to deduce properties of

the integral using approximation.

Definition (Integral). Let f be a non-negative measurable function. We set

µ(f) = sup{µ(g) : g ≤ f, g is simple}.

For arbitrary f , we write

f = f

− f

−

= (f ∨ 0) + (f ∧ 0).

We put |f | = f

+ f

−

. We say f is integrable if µ(|f|) < ∞. In this case, set

µ(f) = µ(f

) − µ(f

−

If only one of

(

)

, µ

(

−

)

< ∞

, then we can still make the above definition,

and the result will be infinite.

In the case where we are integrating over (a subset of) the reals, we call it

the Lebesgue integral.

Proposition.

Let

: [0

→ R

be Riemann integrable. Then it is also Lebesgue

integrable, and the two integrals agree.

We will not prove this, but this immediately gives us results like the funda-

mental theorem of calculus, and also helps us to actually compute the integral.

However, note that this does not hold for infinite domains, as you will see in the

second example sheet.

But the Lebesgue integrable functions are better. A lot of functions are

Lebesgue integrable but not Riemann integrable.

Example. Take the standard non-Riemann integrable function

f = 1

[0,1]\Q

Then f is not Riemann integrable, but it is Lebesgue integrable, since

µ(f) = µ([0, 1] \ Q) = 1.

We are now going to study some basic properties of the integral. We will first

look at the properties of integrals of simple functions, and then extend them to

general integrable functions.

For f, g simple, and α, β ≥ 0, we have that

µ(αf + βg) = αµ(f) + βµ(g).

So the integral is linear.

Another important property is monotonicity — if f ≤ g, then µ(f ) ≤ µ(g).

Finally, we have

= 0 a.e. iff

(

) = 0. It is absolutely crucial here that we

are talking about non-negative functions.

Our goal is to show that these three properties are also satisfied for arbitrary

non-negative measurable functions, and the first two hold for integrable functions.

In order to achieve this, we prove a very important tool — the monotone

convergence theorem. Later, we will also learn about the dominated convergence

theorem and Fatou’s lemma. These are the main and very important results

about exchanging limits and integration.

Theorem

(Monotone convergence theorem)

Suppose that (

)

, f

are non-

negative measurable with f

% f . Then µ(f

) % µ(f ).

In the proof we will use the fact that the integral is monotonic, which we

shall prove later.

Proof.

We will split the proof into five steps. We will prove each of the following

in turn:

(i) If f

and f are indicator functions, then the theorem holds.

(ii) If f is an indicator function, then the theorem holds.

(iii) If f is simple, then the theorem holds.

(iv) If f is non-negative measurable, then the theorem holds.

Each part follows rather straightforwardly from the previous one, and the reader

is encouraged to try to prove it themself.

We first consider the case where

and

. Then

% f

is true

iff A

% A. On the other hand, µ(f

) % µ(f ) iff µ(A

) % µ(A).

For convenience, we let A

= ∅. We can write

µ(A) = µ

[

\ A

n−1

∞

n=1

µ(A

\ A

n−1

)

= lim

N→∞

n=1

µ(A

\ A

n−1

)

= lim

N→∞

µ(A

So done.

We next consider the case where f = 1

for some A. Fix ε > 0, and set

= {f

> 1 − ε} ∈ E.

Then we know that A

% A, as f

% f . Moreover, by definition, we have

(1 − ε)1

≤ f

≤ f = 1

As A

% A, we have that

(1 − ε)µ(f) = (1 − ε) lim

n→∞

µ(A

) ≤ lim

n→∞

µ(f

) ≤ µ(f )

since f

≤ f . Since ε is arbitrary, we know that

lim

n→∞

µ(f

) = µ(f).

Next, we consider the case where f is simple. We write

f =

k=1

where a

> 0 and A

are pairwise disjoint. Since f

% f , we know

−1

% 1

So we have

µ(f

) =

k=1

µ(f

) =

k=1

µ(a

−1

) →

k=1

µ(A

) = µ(f).

Suppose

is non-negative measurable. Suppose

g ≤ f

is a simple function.

% f

, we know

∧g % f ∧g

. So by the previous case, we know that

µ(f

∧ g) → µ(g).

We also know that

µ(f

) ≥ µ(f

∧ g).

So we have

lim

n→∞

µ(f

) ≥ µ(g)

for all g ≤ f. This is possible only if

lim

n→∞

µ(f

) ≥ µ(f )

by definition of the integral. However, we also know that

(

)

≤ µ

(

) for all

again by definition of the integral. So we must have equality. So we have

µ(f) = lim

n→∞

µ(f

Theorem. Let f, g be non-negative measurable, and α, β ≥ 0. We have that

(i) µ(αf + βg) = αµ(f) + βµ(g).

(ii) f ≤ g implies µ(f ) ≤ µ(g).

(iii) f = 0 a.e. iff µ(f) = 0.

Proof.

(i) Let

= 2

−n

fc ∧ n

= 2

−n

gc∧ n.

Then

, g

are simple with

% f

and

% g

. Hence

(

)

% µ

(

)

and

(

)

% µ

(

) and

(

αf

βg

)

% µ

(

αf

βg

), by the monotone

convergence theorem. As f

, g

are simple, we have that

µ(αf

+ βg

) = αµ(f

) + βµ(g

Taking the limit as n → ∞, we get

µ(αf + βg) = αµ(f) + βµ(g).

(ii)

We shall be careful not to use the monotone convergence theorem. We

have

µ(g) = sup{µ(h) : h ≤ g simple}

≥ sup{µ(h) : h ≤ f simple}

= µ(f).

(iii) Suppose f 6= 0 a.e. Let



x : f (x) >



Then

{x : f (x) 6= 0} =

[

Since the left hand set has non-negative measure, it follows that there is

some A

with non-negative measure. For that n, we define

h =

Then µ(f ) ≥ µ(h) > 0. So µ(f) 6= 0.

Conversely, suppose f = 0 a.e. We let

= 2

−n

fc ∧ n

be a simple function. Then f

% f and f

= 0 a.e. So

µ(f) = lim

n→∞

µ(f

) = 0.

We now prove the analogous statement for general integrable functions.

Theorem. Let f, g be integrable, and α, β ≥ 0. We have that

(i) µ(αf + βg) = αµ(f) + βµ(g).

(ii) f ≤ g implies µ(f ) ≤ µ(g).

(iii) f = 0 a.e. implies µ(f) = 0.

Note that in the last case, the converse is no longer true, as one can easily

see from the sign function sgn : [−1, 1] → R.

Proof.

(i) We are going to prove these by applying the previous theorem.

By definition of the integral, we have

(

−f

) =

−µ

(

). Also, if

α ≥

0, then

µ(αf) = µ(αf

) − µ(αf

−

) = αµ(f

) − αµ(f

−

) = αµ(f).

Combining these two properties, it then follows that if

is a real number,

then

µ(αf) = αµ(f).

To finish the proof of (i), we have to show that

(

) =

(

) +

(

We know that this is true for non-negative functions, so we need to employ

a little trick to make this a statement about the non-negative version. If

we let h = f + g, then we can write this as

− h

−

= (f

− f

−

) + (g

− g

−

We now rearrange this as

−

+ g

−

= f

+ g

+ h

−

Now everything is non-negative measurable. So applying µ gives

µ(f

) + µ(f

−

) + µ(g

−

) = µ(f

) + µ(g

) + µ(h

−

Rearranging, we obtain

µ(h

) − µ(h

−

) = µ(f

) − µ(f

−

) + µ(g

) − µ(g

−

This is exactly the same thing as saying

µ(f + g) = µ(h) = µ(f) = µ(g).

(ii)

f ≤ g

, then

g −f ≥

0. So

(

g −f

)

≥

0. By (i), we know

(

)

−µ

(

)

≥

So µ(g) ≥ µ(f).

(iii)

= 0 a.e., then

, f

−

= 0 a.e. So

(

) =

(

−

) = 0. So

(

) =

µ(f

) − µ(f

−

) = 0.

As mentioned, the converse to (iii) is no longer true. However, we do have

the following partial converse:

Proposition.

is a

-system with

E ∈ A

and

(

) =

, and

is an

integrable function that

µ(f1

) = 0

for all A ∈ A. Then µ(f ) = 0 a.e.

Proof. Let

D = {A ∈ E : µ(f 1

) = 0}.

It follows immediately from the properties of the integral that

is a d-system.

So D = E by Dynkin’s lemma. Let

= {x ∈ E : f(x) > 0},

−

= {x ∈ E : f(x) < 0}.

Then A

∈ E, and

µ(f1

) = µ(f1

−

) = 0.

So f 1

and f 1

−

vanish a.e. So f vanishes a.e.

Proposition.

Suppose that (

) is a sequence of non-negative measurable

functions. Then we have

∞

n=1

∞

n=1

µ(g

Proof. We know

n=1

∞

n=1

as N → ∞. So by the monotone convergence theorem, we have

n=1

µ(g

) = µ

n=1

% µ

∞

n=1

But we also know that

n=1

µ(g

) %

∞

n=1

µ(g

)

by definition. So we are done.

So for non-negative measurable functions, we can always switch the order of

integration and summation.

Note that we can consider summation as integration. We let

and

{all subsets of N}

. We let

be the counting measure, so that

(

) is the

size of

. Then integrability (and having a finite integral) is the same as absolute

convergence. Then if it converges, then we have

f dµ =

∞

n=1

f(n).

So we can just view our proposition as proving that we can swap the order of

two integrals. The general statement is known as Fubini’s theorem.

3.2 Integrals and limits

We are now going to prove more things about exchanging limits and integrals.

These are going to be extremely useful in the future, as we want to exchange

limits and integrals a lot.

Theorem

(Fatou’s lemma)

Let (

) be a sequence of non-negative measurable

functions. Then

µ(lim inf f

) ≤ lim inf µ(f

Note that a special case was proven in the first example sheet, where we did

it for the case where f

are indicator functions.

Proof.

We start with the trivial observation that if

k ≥ n

, then we always have

that

inf

m≥n

≤ f

By the monotonicity of the integral, we know that



inf

m≥n



≤ µ(f

for all k ≥ n.

So we have



inf

m≥n



≤ inf

k≥n

µ(f

) ≤ lim inf

µ(f

It remains to show that the left hand side converges to

(

lim inf f

). Indeed,

we know that

inf

m≥n

% lim inf

Then by monotone convergence, we have



inf

m≥n



% µ



lim inf



So we have



lim inf



≤ lim inf

µ(f

No one ever remembers which direction Fatou’s lemma goes, and this leads to

many incorrect proofs and results, so it is helpful to keep the following example

in mind:

Example. We let (E, E, µ) = (R, B, Lebesgue). We let

= 1

[n,n+1]

Then we have

lim inf

= 0.

So we have

µ(f

) = 1 for all n.

So we have

lim inf µ(f

) = 1, µ(lim inf f

) = 0.

So we have



lim inf



≤ lim inf

µ(f

The next result we want to prove is the dominated convergence theorem.

This is like the monotone convergence theorem, but we are going to remove the

increasing and non-negative measurable condition, and add in something else.

Theorem

(Dominated convergence theorem)

Let (

)

, f

be measurable with

(

)

→ f

(

) for all

x ∈ E

. Suppose that there is an integrable function

such

that

| ≤ g

for all n, then we have

µ(f

) → µ(f )

as n → ∞.

Proof. Note that

|f| = lim

|f|

≤ g.

So we know that

µ(|f|) ≤ µ(g) < ∞.

So we know that f, f

are integrable.

Now note also that

0 ≤ g + f

, 0 ≤ g − f

for all

. We are now going to apply Fatou’s lemma twice with these series. We

have that

µ(g) + µ(f) = µ(g + f)

= µ



lim inf

(g + f

)



≤ lim inf

µ(g + f

)

= lim inf

(µ(g) + µ(f

))

= µ(g) + lim inf

µ(f

Since µ(g) is finite, we know that

µ(f) ≤ lim inf

µ(f

We now do the same thing with g − f

. We have

µ(g) − µ(f) = µ(g − f)

= µ



lim inf

(g − f

)



≤ lim inf

µ(g − f

)

= lim inf

(µ(g) − µ(f

))

= µ(g) − lim sup

µ(f

Again, since µ(g) is finite, we know that

µ(f) ≥ lim sup

µ(f

These combine to tell us that

µ(f) ≤ lim inf

µ(f

) ≤ lim sup

µ(f

) ≤ µ(f ).

So they must be all equal, and thus µ(f

) → µ(f ).

3.3 New measures from old

We have previously considered several ways of constructing measures from old

ones, such as the image measure. We are now going to study a few more ways of

constructing new measures, and see how integrals behave when we do these.

Definition

(Restriction of measure space)

Let (

E, E, µ

) be a measure space,

and let A ∈ E. The restriction of the measure space to A is (A, E

, µ

), where

= {B ∈ E : B ⊆ A},

and µ

is the restriction of µ to E

, i.e.

(B) = µ(B)

for all B ∈ E

It is easy to check the following:

Lemma.

For (

E, E, µ

) a measure space and

A ∈ E

, the restriction to

is a

measure space.

Proposition.

Let (

E, E, µ

) and (

F, F, µ

) be measure spaces and

A ∈ E

. Let

f : E → F be a measurable function. Then f|

is E

-measurable.

Proof. Let B ∈ F. Then

(f|

)

−1

(B) = f

−1

(B) ∩ A ∈ E

Similarly, we have

Proposition.

is integrable, then

-integrable and

(

) =

µ(f1

Note that means we have

µ(f1

) =

dµ =

f dµ

Usually, we are lazy and just write

µ(f1

) =

f dµ.

In the particular case of Lebesgue integration, if

is an interval with left and

right end points

a, b

(i.e. it can be open, closed, half open or half closed), then

we write

f dµ =

f(x) dx.

There is another construction we would be interested in.

Definition

(Pushforward/image of measure)

Let (

E, E

) and (

G, G

) be measure

spaces, and

E → G

a measurable function. If

is a measure on (

E, E

), then

ν = µ ◦ f

−1

is a measure on (G, G), known as the pushforward or image measure.

We have already seen this before, but we can apply this to integration as

follows:

Proposition. If g is a non-negative measurable function on G, then

ν(g) = µ(g ◦ f).

Proof. Exercise using the monotone class theorem (see example sheet).

Finally, we can specify a measure by specifying a density.

Definition

(Density)

Let (

E, E, µ

) be a measure space, and

be a non-negative

measurable function. We define

ν(A) = µ(f 1

Then ν is a measure on (E, E).

Proposition. The ν defined above is indeed a measure.

Proof.

(i) ν(φ) = µ(f 1

∅

) = µ(0) = 0.

(ii) If (A

) is a disjoint sequence in E, then



[



= µ(f1

) = µ





µ (f1

) =

ν(f ).

Definition

(Density)

Let

be a random variable. We say

has a density if

its law

has a density with respect to the Lebesgue measure. In other words,

there exists f

non-negative measurable so that

(A) = P[X ∈ A] =

(x) dx.

In this case, for any non-negative measurable function, for any non-negative

measurable g, we have that

E[g(X)] =

g(x)f

(x) dx.

3.4 Integration and differentiation

In “normal” calculus, we had three results involving both integration and dif-

ferentiation. One was the fundamental theorem of calculus, which we already

stated. The others are the change of variables formula, and differentiating under

the integral sign.

We start by proving the change of variables formula.

Proposition

(Change of variables formula)

Let

: [

a, b

]

→ R

be continuously

differentiable and increasing. Then for any bounded Borel function g, we have

φ(b)

φ(a)

g(y) dy =

g(φ(x))φ

(x) dx. (∗)

We will use the monotone class theorem.

Proof. We let

V = {Borel functions g such that (∗) holds}.

We will want to use the monotone class theorem to show that this includes all

bounded functions.

We already know that

(i) V

contains

for all

in the

-system of intervals of the form [

u, v

]

⊆

[

a, b

This is just the fundamental theorem of calculus.

(ii) By linearity of the integral, V is indeed a vector space.

(iii)

Finally, let (

) be a sequence in

, and

≥

% g

. Then we know

that

φ(b)

φ(a)

(y) dy =

(φ(x))φ

(x) dx.

By the monotone convergence theorem, these converge to

φ(b)

φ(a)

g(y) dy =

g(φ(x))φ

(x) dx.

Then by the monotone class theorem,

contains all bounded Borel functions.

The next problem is differentiation under the integral sign. We want to know

when we can say

f(x, t) dx =

∂f

∂t

(x, t) dx.

Theorem

(Differentiation under the integral sign)

Let (

E, E, µ

) be a space,

and U ⊆ R be an open set, and f : U × E → R. We assume that

(i) For any t ∈ U fixed, the map x 7→ f(t, x) is integrable;

(ii) For any x ∈ E fixed, the map t 7→ f(t, x) is differentiable;

(iii) There exists an integrable function g such that



∂f

∂t

(t, x)



≤ g(x)

for all x ∈ E and t ∈ U.

Then the map

x 7→

∂f

∂t

(t, x)

is integrable for all t, and also the function

F (t) =

f(t, x)dµ

is differentiable, and

(t) =

∂f

∂t

(t, x) dµ.

The reason why we want the derivative to be bounded is that we want to

apply the dominated convergence theorem.

Proof.

Measurability of the derivative follows from the fact that it is a limit of

measurable functions, and then integrability follows since it is bounded by g.

Suppose (h

) is a positive sequence with h

→ 0. Then let

(x) =

f(t + h

, x) − f(t, x)

−

∂f

∂t

(t, x).

Since

is differentiable, we know that

(

)

→

0 as

n → ∞

. Moreover, by the

mean value theorem, we know that

(x)| ≤ 2g(x).

On the other hand, by definition of F (t), we have

F (t + h

) − F (t)

−

∂f

∂t

(t, x) dµ =

(x) dx.

By dominated convergence, we know the RHS tends to 0. So we know

lim

n→∞

F (t + h

) − F (t)

→

∂f

∂t

(t, x) dµ.

Since

was arbitrary, it follows that

(

) exists and is equal to the integral.

3.5 Product measures and Fubini’s theorem

Recall the following definition of the product σ-algebra.

Definition

(Product

-algebra)

Let (

, E

, µ

) and (

, E

, µ

) be finite mea-

sure spaces. We let

A = {A

× A

: A

× E

, A

× E

Then A is a π-system on E

× E

. The product σ-algebra is

E = E

⊗ E

= σ(A).

We now want to construct a measure on the product

-algebra. We can, of

course, just apply the Caratheodory extension theorem, but we would want a

more explicit description of the integral. The idea is to define, for A ∈ E

⊗ E

µ(A) =



, x

) µ

(dx

)



(dx

Doing this has the advantage that it would help us in a step of proving Fubini’s

theorem.

However, before we can make this definition, we need to do some preparation

to make sure the above statement actually makes sense:

Lemma.

Let

× E

be a product of

-algebras. Suppose

E → R

E-measurable function. Then

(i) For each x

∈ E

, the function x

7→ f (x

, x

) is E

-measurable.

(ii) If f is bounded or non-negative measurable, then

) =

f(x

, x

) µ

(dx

)

is E

-measurable.

Proof.

The first part follows immediately from the fact that for a fixed

the map

→ E

given by

(

) = (

, x

) is measurable, and that the

composition of measurable functions is measurable.

For the second part, we use the monotone class theorem. We let

the set of all measurable functions

such that

7→

(

, x

)

) is

-measurable.

(i)

It is clear that

, 1

∈ V

for all

A ∈ A

(where

is as in the definition

of the product σ-algebra).

(ii) V is a vector space by linearity of the integral.

(iii) Suppose (f

) is a non-negative sequence in V and f

% f , then



7→

, x

) µ

(dx

)





7→

f(x

, x

) µ(dx

)



by the monotone convergence theorem. So f ∈ V .

So the monotone class theorem tells us

contains all bounded measurable

functions.

Now if

is a general non-negative measurable function, then

f ∧n

is bounded

and measurable, hence

f ∧n ∈ V

. Therefore

f ∈ V

by the monotone convergence

theorem.

Theorem.

There exists a unique measurable function

⊗ µ

such

that

µ(A

× A

) = µ(A

)µ(A

)

for all A

× A

∈ A.

Here it is crucial that the measure space is finite. Actually, everything

still works for

-finite measure spaces, as we can just reduce to the finite case.

However, things start to go wrong if we don’t have σ-finite measure spaces.

Proof.

One might be tempted to just apply the Caratheodory extension theorem,

but we have a more direct way of doing it here, by using integrals. We define

µ(A) =



, x

) µ

(dx

)



(dx

Here the previous lemma is very important. It tells us that these integrals

actually make sense!

We first check that this is a measure:

(i) µ(∅) = 0 is immediate since 1

∅

= 0.

(ii) Suppose (A

) is a disjoint sequence and A =

. Then we have

µ(A) =



, x

) µ

(dx

)



(dx

)

, x

) µ

(dx

)

(dx

)

We now use the fact that integration commutes with the sum of non-

negative measurable functions to get



, x

) µ

(dx

)



(dx

)



, x

) µ

(dx

)



(dx

)

µ(A

So we have a working measure, and it clearly satisfies

µ(A

× A

) = µ(A

)µ(A

Uniqueness follows because

is finite, and is thus characterized by its values on

the π-system A that generates E.

Exercise.

Show the non-uniqueness of the product Lebesgue measure on [0

and the counting measure on [0, 1].

Note that we could as well have defined the measure as

µ(A) =



, x

) µ

(dx

)



(dx

The same proof would go through, so we have another measure on the space.

However, by uniqueness, we know they must be the same! Fubini’s theorem

generalizes this to arbitrary functions.

Theorem (Fubini’s theorem).

(i) If f is non-negative measurable, then

µ(f) =



f(x

, x

) µ

(dx

)



(dx

). (∗)

In particular, we have



f(x

, x

) µ

(dx

)



(dx

) =



f(x

, x

) µ

(dx

)



(dx

This is sometimes known as Tonelli’s theorem.

(ii) If f is integrable, and

A =



∈ E :

|f(x

, x

)|µ

(dx

) < ∞



then

\ A) = 0.

If we set

) =

(

f(x

, x

) µ

(dx

) x

∈ A

0 x

6∈ A

then f

is a µ

integrable function and

) = µ(f).

Proof.

(i)

Let

be the set of all measurable functions such that (

∗

) holds. Then

is a vector space since integration is linear.

(a) By definition of µ, we know 1

and 1

are in V for all A ∈ A.

(b)

The monotone convergence theorem on both sides tell us that

closed under monotone limits of the form f

% f , f

≥ 0.

By the monotone class theorem, we know

contains all bounded measur-

able functions. If

is non-negative measurable, then (

f ∧ n

)

∈ V

, and

monotone convergence for f ∧n % f gives that f ∈ V .

(ii) Assume that f is µ-integrable. Then

7→

|f(x

, x

)| µ(dx

)

-measurable, and, by (i), is

-integrable. So

, being the inverse

image of

∞

under that map, lies in

. Moreover,

(

) = 0 because

integrable functions can only be infinite on sets of measure 0.

We set

) =

, x

) µ

(dx

)

−

) =

−

, x

) µ

(dx

Then we have

= (f

− f

−

So the result follows since

µ(f) = µ(f

) − µ(f

−

) = µ(f

) − µ

−

) = µ

by (i).

Since

-finite, we know that we can sensibly talk about the

-fold product

of the Lebesgue measure on R to obtain the Lebesgue measure on R

What

-algebra is the Lebesgue measure on

defined on? We know the

Lebesgue measure on

is defined on

. So the Lebesgue measure is defined on

B × ··· × B = σ(B

× ··· × B

: B

∈ B).

By looking at the definition of the product topology, we see that this is just the

Borel σ-algebra on R

Recall that when we constructed the Lebesgue measure, the Caratheodory

extension theorem yields a measure on the “Lebesgue

-algebra”

, which

was strictly bigger than the Borel

-algebra. It was shown in the first example

sheet that

is complete, i.e. if we have

A ⊆ B ⊆ R

with

B ∈ M

(

) = 0,

then

A ∈ M

. We can also take the Lebesgue measure on

to be defined on

M ⊗ ··· ⊗ M

. However, it happens that

M ⊗ M

together with the Lebesgue

measure on

is no longer complete (proof is left as an exercise for the reader).

We now turn to probability. Recall that random variables

, ··· , X

are

independent iff the

-algebras

(

)

, ··· , σ

(

) are independent. We will show

that random variables are independent iff their laws are given by the product

measure.

Proposition.

Let

, ··· , X

be random variables on (Ω

, F, P

) with values in

, E

), ··· , (E

, E

) respectively. We define

E = E

× ··· × E

, E = E

⊗ ··· ⊗ E

Then X = (X

, ··· , X

) is E-measurable and the following are equivalent:

(i) X

, ··· , X

are independent.

(ii) µ

= µ

⊗ ··· ⊗ µ

(iii) For any f

, ··· , f

bounded and measurable, we have

k=1

)

k=1

E[f

)].

Proof.

–

(i)

⇒

(ii): Let

× ··· ⊗ µ

. We want to show that

. To

do so, we just have to check that they agree on a

-system generating the

entire σ-algebra. We let

A = {A

× ··· × A

: A

∈ E

, ··· , A

∈ E

Then

is a generating

-system of

. Moreover, if

×···×A

∈ A

then we have

(A) = P[X ∈ A]

= P[X

∈ A

, ··· , X

∈ A

]

By independence, we have

k=1

P[X

∈ A

]

= ν(A).

So we know that µ

= ν = µ

⊗ ··· ⊗ µ

on E.

– (ii) ⇒ (iii): By assumption, we can evaluate the expectation

k=1

)

k=1

)µ(dx

)

k=1

f(x

)µ

(dx

)

k=1

E[f

)].

Here in the middle we have used Fubini’s theorem.

– (iii) ⇒ (i): Take f

= 1

for A

∈ E

. Then we have

P[X

∈ A

, ··· , X

∈ A

] = E

k=1

)

k=1

E[1

)]

k=1

P[X

∈ A

]

So X

, ··· , X

are independent.

4 Inequalities and L

spaces

Eventually, we will want to define the L

spaces as follows:

Definition

(

spaces)

Let (

E, E, µ

) be a measurable space. For 1

≤ p < ∞

we define

(

E, E, µ

) to be the set of all measurable functions

such that

kfk



|f|

dµ



1/p

< ∞.

For p = ∞, we let L

∞

= L

∞

(E, E, µ) to be the space of functions with

kfk

∞

= inf{λ ≥ 0 : |f| ≤ λ a.e.} < ∞.

However, it is not clear that this is a norm. First of all,

kfk

= 0 does not

imply that

= 0. It only means that

= 0 a.e. But this is easy to solve. We

simply quotient out the vector space by functions that differ on a set of measure

zero. The more serious problem is that we don’t know how to prove the triangle

inequality.

To do so, we are going to prove some inequalities. Apart from enabling us to

show that

k · k

is indeed a norm, they will also be very helpful in the future

when we want to bound integrals.

4.1 Four inequalities

The four inequalities we are going to prove are the following:

(i) Chebyshev/Markov inequality

(ii) Jensen’s inequality

(iii) H¨older’s inequality

(iv) Minkowski’s inequality.

So let’s start proving the inequalities.

Proposition

(Chebyshev’s/Markov’s inequality)

Let

be non-negative mea-

surable and λ > 0. Then

µ({f ≥ λ}) ≤

µ(f).

This is often used when this is a probability measure, so that we are bounding

the probability that a random variable is big.

The proof is essentially one line.

Proof. We write

f ≥ f1

f≥λ

≥ λ1

f≥λ

Taking µ gives the desired answer.

This is incredibly simple, but also incredibly useful!

The next inequality is Jensen’s inequality. To state it, we need to know what

a convex function is.

Definition

(Convex function)

Let

I ⊆ R

be an interval. Then

I → R

convex if for any t ∈ [0, 1] and x, y ∈ I, we have

c(tx + (1 − t)y) ≤ tc(x) + (1 − t)c(y).

x y

(1 − t)x + ty

(1 − t)f (x) + tc(y)

Note that if c is twice differentiable, then this is equivalent to c

> 0.

Proposition

(Jensen’s inequality)

Let

be an integrable random variable

with values in I. If c : I → R is convex, then we have

E[c(X)] ≥ c(E[X]).

It is crucial that this only applies to a probability space. We need the total

mass of the measure space to be 1 for it to work. Just being finite is not enough.

Jensen’s inequality will be an easy consequence of the following lemma:

Lemma.

I → R

is a convex function and

is in the interior of

, then

there exists real numbers a, b such that

c(x) ≥ ax + b

for all x ∈ I, with equality at x = m.

ax + b

If the function is differentiable, then we can easily extract this from the

derivative. However, if it is not, then we need to be more careful.

Proof.

is smooth, then we know

≥

0, and thus

is non-decreasing. We

are going to show an analogous statement that does not mention the word

“derivative”. Consider x < m < y with x, y, m ∈ I. We want to show that

c(m) − c(x)

m − x

≤

c(y) − c(m)

y − m

To show this, we turn off our brains and do the only thing we can do. We can

write

m = tx + (1 − t)y

for some t. Then convexity tells us

c(m) ≤ tc(x) + (1 − t)c(y).

Writing c(m) = tc(m) + (1 − t)c(m), this tells us

t(c(m) − c(x)) ≤ (1 − t)(c(y) − c(m)).

To conclude, we simply have to compute the actual value of

and plug it in. We

have

t =

y − m

y − x

, 1 − t =

m − x

y − x

So we obtain

y − m

y − x

(c(m) − c(x)) ≤

m − x

y − x

(c(y) − c(m)).

Cancelling the y − x and dividing by the factors gives the desired result.

Now since x and y are arbitrary, we know there is some a ∈ R such that

c(m) − c(x)

m − x

≤ a ≤

c(y) − c(m)

y − m

for all x < m < y. If we rearrange, then we obtain

c(t) ≥ a(t − m) + c(m)

for all t ∈ I.

Proof of Jensen’s inequality.

To apply the previous result, we need to pick a

right m. We take

m = E[X].

To apply this, we need to know that

is in the interior of

. So we assume that

is not a.s. constant (that case is boring). By the lemma, we can find some

a, b ∈ R such that

c(X) ≥ aX + b.

We want to take the expectation of the LHS, but we have to make sure the

[

(

)] is a sensible thing to talk about. To make sure it makes sense, we show

that E[c(X)

−

] = E[(−c(X)) ∨ 0] is finite.

We simply bound

[c(X)]

−

= [−c(X)] ∨ 0 ≤ |a||X| + |b|.

So we have

E[c(X)

−

] ≤ |a|E|X| + |b| < ∞

since X is integrable. So E[c(X)] makes sense.

We then just take

E[c(X)] ≥ E[aX + b] = aE[X] + b = am + b = c(m) = c(E[X]).

So done.

We are now going to use Jensen’s inequality to prove H¨older’s inequality.

Before that, we take note of the following definition:

Definition (Conjugate). Let p, q ∈ [1, ∞]. We say that they are conjugate if

= 1,

where we take 1/∞ = 0.

Proposition

(H¨older’s inequality)

Let

p, q ∈

, ∞

) be conjugate. Then for

f, g measurable, we have

µ(|fg|) = kfgk

≤ kf k

kgk

When p = q = 2, then this is the Cauchy-Schwarz inequality.

We will provide two different proofs.

Proof.

We assume that

kfk

0 and

kfk

< ∞

. Otherwise, there is nothing to

prove. By scaling, we may assume that

kfk

= 1. We make up a probability

measure by

P[A] =

|f|

dµ.

Since we know

kfk



|f|

dµ



1/p

= 1,

we know P[ ·] is a probability measure. Then we have

µ(|fg|) = µ(|fg|1

{|f|>0}

)

= µ



|g|

|f|

p−1

{|f|>0}

|f|



= E



|g|

|f|

p−1

{|f|>0}



Now use the fact that (

E|X|

)

≤ E

[

|X|

] since

x 7→ x

is convex for

q >

1. Then

we obtain

≤





|g|

|f|

(p−1)q

{|f|>0}



1/q

The key realization now is that

= 1 means that

(

p −

1) =

. So this

becomes



|g|

|f|

{|f|>0}



1/q

= µ(|g|

)

1/q

= kgk

Using the fact that kfk

= 1, we obtain the desired result.

Alternative proof.

We wlog 0

< kfk

, kgk

< ∞

, or else there is nothing to

prove. By scaling, we wlog kfk

= kgk

= 1. Then we have to show that

|f||g| dµ ≤ 1.

To do so, we notice if

= 1, then the concavity of

log

tells us for any

a, b >

we have

log a +

log b ≤ log





Replacing a with a

; b with b

and then taking exponentials tells us

ab ≤

While we assumed

a, b >

0 when deriving, we observe that it is also valid when

some of them are zero. So we have

|f||g| dµ ≤



|f|

|g|



dµ =

= 1.

Just like Jensen’s inequality, this is very useful when bounding integrals, and

it is also theoretically very important, because we are going to use it to prove

the Minkowski inequality. This tells us that the L

norm is actually a norm.

Before we prove the Minkowski inequality, we prove the following tiny lemma

that we will use repeatedly:

Lemma. Let a, b ≥ 0 and p ≥ 1. Then

(a + b)

≤ 2

+ b

This is a terrible bound, but is useful when we want to prove that things are

finite.

Proof. We wlog a ≤ b. Then

(a + b)

≤ (2b)

≤ 2

+ b

Theorem (Minkowski inequality). Let p ∈ [1, ∞] and f, g measurable. Then

kf + gk

≤ kf k

+ kgk

Again the proof is magic.

Proof. We do the boring cases first. If p = 1, then

kf + gk

|f + g| ≤

(|f| + |g|) =

|f| +

|g| = kfk

+ kgk

The proof of the case of p = ∞ is similar.

Now note that if

= 0, then the result is trivial. On the other hand,

if kf + gk

= ∞, then since we have

|f + g|

≤ (|f | + |g|)

≤ 2

(|f|

+ |g|

we know the right hand side is infinite as well. So this case is also done.

Let’s now do the interesting case. We compute

µ(|f + g|

) = µ(|f + g||f + g|

p−1

)

≤ µ(|f ||f + g|

p−1

) + µ(|g||f + g|

p−1

)

≤ kf k

k|f + g|

p−1

+ kgk

k|f + g|

p−1

= (kfk

+ kgk

)k|f + g|

p−1

= (kfk

+ kgk

)µ(|f + g|

(p−1)q

)

1−1/p

= (kfk

+ kgk

)µ(|f + g|

)

1−1/p

So we know

µ(|f + g|

) ≤ (kf k

+ kgk

)µ(|f + g|

)

1−1/p

Then dividing both sides by (µ(|f + g|

)

1−1/p

tells us

µ(|f + g|

)

1/p

= kf + gk

≤ kf k

+ kgk

Given these inequalities, we can go and prove some properties of L

spaces.

4.2 L

spaces

Recall the following definition:

Definition

(Norm of vector space)

Let

be a vector space. A norm on

a function k · k : V → R

≥0

such that

(i) ku + vk ≤ kuk + kvk for all U, v ∈ V .

(ii) kαvk = |α|kvk for all v ∈ V and α ∈ R

(iii) kvk = 0 implies v = 0.

Definition

(

spaces)

Let (

E, E, µ

) be a measurable space. For 1

≤ p < ∞

we define

(

E, E, µ

) to be the set of all measurable functions

such that

kfk



|f|

dµ



1/p

< ∞.

For p = ∞, we let L

∞

= L

∞

(E, E, µ) to be the space of functions with

kfk

∞

= inf{λ ≥ 0 : |f| ≤ λ a.e.} < ∞.

By Minkowski’s inequality, we know

is a vector space, and also (i) holds.

By definition, (ii) holds obviously. However, (iii) does not hold for

k · k

, because

kfk

= 0 does not imply that f = 0. It merely implies that f = 0 a.e.

To fix this, we define an equivalence relation as follows: for

f, g ∈ L

, we say

that

f ∼ g

iff

f −g

= 0 a.e. For any

f ∈ L

, we let [

] denote its equivalence

class under this relation. In other words,

[f] = {g ∈ L

: f − g = 0 a.e.}.

Definition (L

space). We define

= {[f] : f ∈ L

where

[f] = {g ∈ L

: f − g = 0 a.e.}.

This is a normed vector space under the k · k

norm.

One important property of

is that it is complete, i.e. every Cauchy

sequence converges.

Definition

(Complete vector space/Banach spaces)

A normed vector space

(

V, k · k

) is complete if every Cauchy sequence converges. In other words, if (

)

is a sequence in

such that

− v

k →

0 as

n, m → ∞

, then there is some

v ∈ V

such that

− vk →

0 as

n → ∞

. A complete vector space is known as

a Banach space.

Theorem.

Let 1

≤ p ≤ ∞

. Then

is a Banach space. In other words, if (

)

is a sequence in

, with the property that

− f

→

0 as

n, m → ∞

, then

there is some f ∈ L

such that kf

− f k

→ 0 as n → ∞.

Proof.

We will only give the proof for

p < ∞

. The

∞

case is left as an

exercise for the reader.

Suppose that (

) is a sequence in

with

− f

→

0 as

n, m → ∞

Take a subsequence (f

) of (f

) with

k+1

− f

≤ 2

−k

for all k ∈ N. We then find that



k=1

k+1

− f



≤

k=1

k+1

− f

≤ 1.

We know that

k=1

k+1

− f

| %

∞

k=1

k+1

− f

| as M → ∞.

So applying the monotone convergence theorem, we know that



∞

k=1

k+1

− f



≤

∞

k=1

k+1

− f

≤ 1.

In particular,

∞

k=1

k+1

− f

| < ∞ a.e.

So f

(x) converges a.e., since the real line is complete. So we set

f(x) =

(

lim

k→∞

(x) if the limit exists

0 otherwise

By an exercise on the first example sheet, this function is indeed measurable.

Then we have

− f k

= µ(|f

− f |

)

= µ



lim inf

k→∞

− f



≤ lim inf

k→∞

µ(|f

− f

which tends to 0 as

n → ∞

since the sequence is Cauchy. So

is indeed the

limit.

Finally, we have to check that f ∈ L

. We have

µ(|f|

) = µ(|f − f

+ f

)

≤ µ((|f − f

| + |f

)

≤ µ(2

(|f − f

+ |f

))

= 2

(µ(|f − f

) + µ(|f

)

We know the first term tends to 0, and in particular is finite for

large enough,

and the second term is also finite. So done.

4.3 Orthogonal projection in L

In the particular case

= 2, we have an extra structure on

, namely an inner

product structure, given by

hf, gi =

fg dµ.

This inner product induces the L

norm by

kfk

= hf, f i.

Recall the following definition:

Definition

(Hilbert space)

A Hilbert space is a vector space with a complete

inner product.

So L

is not only a Banach space, but a Hilbert space as well.

Somehow Hilbert spaces are much nicer than Banach spaces, because you

have an inner product structure as well. One particular thing we can do is

orthogonal complements.

Definition (Orthogonal functions). Two functions f, g ∈ L

are orthogonal if

hf, gi = 0,

Definition (Orthogonal complement). Let V ⊆ L

. We then set

⊥

= {f ∈ L

: hf, vi = 0 for all v ∈ V }.

Note that we can always make these definitions for any inner product space.

However, the completeness of the space guarantees nice properties of the orthog-

onal complement.

Before we proceed further, we need to make a definition of what it means

for a subspace of

to be closed. This isn’t the usual definition, since

isn’t

really a normed vector space, so we need to accommodate for that fact.

Definition

(Closed subspace)

Let

V ⊆ L

. Then

is closed if whenever (

)

is a sequence in V with f

→ f , then there exists v ∈ V with v ∼ f.

Thee main thing that makes

nice is that we can use closed subspaces to

decompose functions orthogonally.

Theorem.

Let

be a closed subspace of

. Then each

f ∈ L

has an

orthogonal decomposition

f = u + v,

where v ∈ V and u ∈ V

⊥

. Moreover,

kf − vk

≤ kf − gk

for all g ∈ V with equality iff g ∼ v.

To prove this result, we need two simple identities, which can be easily proven

by writing out the expression.

Lemma (Pythagoras identity).

kf + gk

= kfk

+ kgk

+ 2hf, gi.

Lemma (Parallelogram law).

kf + gk

+ kf − gk

= 2(kfk

+ kgk

To prove the existence of orthogonal decomposition, we need to use a slight

trick involving the parallelogram law.

Proof of orthogonal decomposition.

Given

f ∈ L

, we take a sequence (

) in

such that

kf − g

→ d(f, V ) = inf

kf − gk

We now want to show that the infimum is attained. To do so, we show that

is a Cauchy sequence, and by the completeness of L

, it will have a limit.

If we apply the parallelogram law with

f − g

and

f − g

, then we

know

ku + vk

+ ku − vk

= 2(kuk

+ kvk

Using our particular choice of u and v, we obtain





f −

+ g





+ kg

− g

= 2(kf − g

+ kf − g

So we have

− g

= 2(kf − g

+ kf − g

) − 4



f −

+ g



The first two terms on the right hand side tend to

(

f, V

)

, and the last term

is bounded below in magnitude by 4

(

f, V

). So as

n, m → ∞

, we must have

−g

→

0. By completeness of

, there exists a

g ∈ L

such that

→ g

Now since

is assumed to be closed, we can find a

v ∈ V

such that

a.e. Then we know

kf − vk

= lim

n→∞

kf − g

= d(f, V ).

attains the infimum. To show that this gives us an orthogonal decomposition,

we want to show that

u = f − v ∈ V

⊥

Suppose

h ∈ V

. We need to show that

hu, hi

= 0. We need to do another funny

trick. Suppose t ∈ R. Then we have

d(f, V )

≤ kf − (v + th)k

= kf − vk

+ t

khk

− 2thf − v, hi.

We think of this as a quadratic in t, which is minimized when

t =

hf − v, hi

khk

But we know this quadratic is minimized when t = 0. So hf −v, hi = 0.

We are now going to look at the relationship between conditional expectation

and orthogonal projection.

Definition

(Conditional expectation)

Suppose we have a probability space

(Ω

, F, P

), and (

) is a collection of pairwise disjoint events with

= Ω.

We let

G = σ(G

: n ∈ N).

The conditional expectation of X given G is the random variable

Y =

∞

n=1

E[X | G

where

E[X | G

] =

E[X1

]

P[G

]

for P[G

] > 0.

In other words, given any x ∈ Ω, say x ∈ G

, then Y (x) = E[X | G

X ∈ L

(

), then

Y ∈ L

(

), and it is clear that

-measurable. We

claim that this is in fact the projection of

onto the subspace

(

G, P

) of

G-measurable L

random variables in the ambient space L

(P).

Proposition.

The conditional expectation of

given

is the projection of

onto the subspace

(

G, P

) of

-measurable

random variables in the ambient

space L

(P).

In some sense, this tells us

is our best prediction of

given only the

information encoded in G.

Proof.

Let

be the conditional expectation. It suffices to show that

[(

X −W

)

]

is minimized for

among

-measurable random variables. Suppose that

W is a G-measurable random variable. Since

G = σ(G

: n ∈ N),

it follows that

W =

∞

n=1

where a

∈ R. Then

E[(X − W )

] = E





∞

n=1

(X − a





= E

+ a

− 2a

X)1

= E

+ a

− 2a

E[X | G

])1

We now optimize the quadratic

+ a

− 2a

E[X | G

]

over a

. We see that this is minimized for

= E[X | G

Note that this does not depend on what

is in the quadratic, since it is in the

constant term.

Therefore we know that E[X | G

] is minimized for W = Y .

We can also rephrase variance and covariance in terms of the L

spaces.

Suppose X, Y ∈ L

(P) with

= E[X], m

= E[Y ].

Then variance and covariance just correspond to

inner product and norm.

In fact, we have

var(X) = E[(X − m

)

] = kX − m

cov(X, Y ) = E[(X − m

)(Y − m

)] = hX − m

, Y −m

More generally, the covariance matrix of a random vector

= (

, ··· , X

) is

given by

var(X) = (cov(X

, X

))

On the example sheet, we will see that the covariance matrix is a positive definite

matrix.

4.4 Convergence in L

(P) and uniform integrability

What we are looking at here is the following question — suppose (

)

, X

are

random variables and

→ X

in probability. Under what extra assumptions is

it true that X

also converges to X in L

, i.e. E[X

− X] → 0 as X → ∞?

This is not always true.

Example. If we take (Ω, F, P) = ((0, 1), B((0, 1)), Lebesgue), and

= n1

(0,1/n)

Then X

→ 0 in probability, and in fact X

→ 0 almost surely. However,

E[|X

− 0|] = E[X

] = n ·

= 1,

which does not converge to 1.

We see that the problem with this series is that there is a lot of “stuff”

concentrated near 0, and indeed the functions can get unbounded near 0. We

can easily curb this problem by requiring our functions to be bounded:

Theorem

(Bounded convegence theorem)

Suppose

(

) are random vari-

ables. Assume that there exists a (non-random) constant

C >

0 such that

| ≤ C. If X

→ X in probability, then X

→ X in L

The proof is a rather standard manipulation.

Proof. We first show that |X| ≤ C a.e. Let ε > 0. We then have

P[|X| > C + ε] ≤ P[|X − X

| + |X

| > C + ε]

≤ P[|X − X

| > ε] + P[|X

| > C]

We know the second term vanishes, while the first term

→

0 as

n → ∞

. So we

know

P[|X| > C + ε] = 0

for all ε. Since ε was arbitrary, we know |X| ≤ C a.s.

Now fix an ε > 0. Then

E[|X

− X| = E



− X|(1

−X|≤ε

+ 1

−X|>ε

)



≤ ε + 2C P [|X

− X| > ε] .

Since

→ X

in probability, for

sufficiently large, the second term is

≤ ε

So E[|X

− X|] ≤ 2ε, and we have convergence in L

But we can do better than that. We don’t need the functions to be actually

bounded. We just need that the functions aren’t concentrated in arbitrarily

small subsets of Ω. Thus, we make the following definition:

Definition

(Uniformly integrable)

Let

be a family of random variables.

Define

(δ) = sup{E[|X|1

] : X ∈ X, A ∈ F with P[A] < δ}.

Then we say

is uniformly integrable if

-bounded (see below), and

(δ) → 0 as δ → 0.

Definition

(

-bounded)

Let

be a family of random variables. Then we say

X is L

-bounded if

sup{kXk

: X ∈ X} < ∞.

In some sense, this is “uniform continuity for integration”. It is immediate

that

Proposition.

Finite unions of uniformly integrable sets are uniformly integrable.

How can we find uniformly integrable families? The following proposition

gives us a large class of such families.

Proposition.

Let

be an

-bounded family for some

p >

1. Then

uniformly integrable.

Proof. We let

C = sup{kXk

: X ∈ X} < ∞.

Suppose that X ∈ X and A ∈ F. We then have

E[|X|1

] =≤ E[|X|

]

1/p

P[A]

1/q

≤ CP[A]

1/q

by H¨older’s inequality, where p, q are conjugates. This is now a uniform bound

depending only on P[A]. So done.

This is the best we can get.

boundedness is not enough. Indeed, our

earlier example

= n1

(0,1/n)

is L

bounded but not uniformly integrable. So L

boundedness is not enough.

For many practical purposes, it is convenient to rephrase the definition of

uniform integrability as follows:

Lemma.

Let

be a family of random variables. Then

is uniformly integrable

if and only if

sup{E[|X|1

|X|>k

] : X ∈ X} → 0

as k → ∞.

Proof.

(⇒)

Suppose that

is uniformly integrable. For any

, and

X ∈ X

by Cheby-

shev inequality, we have

P[|X| ≥ k] ≤

E[X]

Given

ε >

0, we pick

such that

[

|X|1

]

< ε

for all

with

(

)

< δ

Then pick

sufficiently large such that

kδ < sup{E

[

] :

X ∈ X}

. Then

P[|X| ≥ k] < δ, and hence E[|X|1

|X|>k

] < ε for all X ∈ X.

(⇐)

Suppose that the condition in the lemma holds. We first show that

-bounded. We have

E[|X|] = E[|X|(1

|X|≤k

+ 1

|X|>k

)] ≤ k + E[|X|1

|X|>k

] < ∞

by picking a large enough k.

Next note that for any measurable A and X ∈ X, we have

E[|X|1

] = E[|X|1

|X|>k

+ 1

|X|≤k

)] ≤ E[|X|1

|X|>k

] + kP[A].

Thus, for any

ε >

0, we can pick

sufficiently large such that the first

term is

for all

X ∈ X

by assumption. Then when

[

]

, we have

E|X|1

] ≤ ε.

As a corollary, we find that

Corollary. Let X = {X}, where X ∈ L

(P). Then X is uniformly integrable.

Hence, a finite collection of L

functions is uniformly integrable.

Proof. Note that

E[|X|] =

∞

k=0

E[|X|1

X∈[k,k+1)

Since the sum is finite, we must have

E[|X|1

|X|≥K

] =

∞

k=K

E[|X|1

X∈[k,k+1)

] → 0.

With all that preparation, we now come to the main theorem on uniform

integrability.

Theorem.

Let

(

) be random variables. Then the following are equivalent:

(i) X

, X ∈ L

for all n and X

→ X in L

(ii) {X

} is uniformly integrable and X

→ X in probability.

The (i)

⇒

(ii) direction is just a standard manipulation. The idea of the

(ii)

⇒

(i) direction is that we use uniformly integrability to cut off

and

at some large value

, which gives us a small error, then apply bounded

convergence.

Proof.

We first assume that

, X

are

and

→ X

. We want to show

that {X

} is uniformly integrable and X

→ X in probability.

We first show that

→ X

in probability. This is just going to come from

the Chebyshev inequality. For ε > 0. Then we have

P[|X − X

| > ε] ≤

E[|X − X

→ 0

as n → ∞.

Next we show that

}

is uniformly integrable. Fix

ε >

0. Take

such

that

n ≥ N

implies

[

|X −X

]

≤

. Since finite families of

random variables

are uniformly integrable, we can pick

δ >

0 such that

A ∈ F

and

[

]

< δ

implies

E[X1

], E[|X

] ≤

for n = 1, ··· , N.

Now when n > N and A ∈ F with P[A] ≤ δ, then we have

E[|X

] ≤ E[|X − X

] + E[|X|1

]

≤ E[|X − X

|] +

≤

= ε.

So {X

} is uniformly integrable.

Assume that {X

} is uniformly integrable and X

→ X in probability.

The first step is to show that

X ∈ L

. We want to use Fatou’s lemma, but

to do so, we want almost sure convergence, not just convergence in probability.

Recall that we have previously shown that there is a subsequence (

) of

) such that X

→ X a.s. Then we have

E[|X|] = E



lim inf

k→∞



≤ lim inf

k→∞

E[|X

|] < ∞

since uniformly integrable families are

bounded. So

[

|X|

]

< ∞

, hence

X ∈ L

Next we want to show that

→ X

. Take

ε >

0. Then there exists

K ∈ (0, ∞) such that



|X|1

{|X|>K}



, E



{|X

|>K}



≤

To set things up so that we can use the bounded convergence theorem, we have

to invent new random variables

= (X

∨ −K) ∧ K, X

= (X ∨ −K) ∧ K.

Since X

→ X in probability, it follows that X

→ X

in probability.

Now bounded convergence tells us that there is some

such that

n ≥ N

implies

E[|X

− X

|] ≤

Combining, we have for n ≥ N that

E[|X

− X|] ≤ E[|X

− X

|] + E[|X|1

{|X|≥K}

] + E[|X

{|X

|≥K}

] ≤ ε.

So we know that X

→ X in L

The main application is that when

}

is a type of stochastic process known

as a martingale. This will be done in III Advanced Probability and III Stochastic

Calculus.

5 Fourier transform

We now turn to the exciting topic of the Fourier transform. There are two main

questions we want to ask — when does the Fourier transform exist, and when

we can recover a function from its Fourier transform.

Of course, not only do we want to know if the Fourier transform exists. We

also want to know if it lies in some nice space, e.g. L

It turns out that when we want to prove things about Fourier transforms,

it is often helpful to “smoothen” the function by doing what is known as a

Gaussian convolution. So after defining the Fourier transform and proving some

really basic properties, we are going to investigate convolutions and Gaussians

for a bit (convolutions are also useful on their own, since they correspond to

sums of independent random variables). After that, we can go and prove the

actual important properties of the Fourier transform.

5.1 The Fourier transform

When talking about Fourier transforms, we will mostly want to talk about

functions

→ C

. So from now on, we will write

for complex valued Borel

functions on R

with

kfk



|f|



1/p

< ∞.

The integrals of complex-valued function are defined on the real and imaginary

parts separately, and satisfy the properties we would expect them to. The details

are on the first example sheet.

Definition

(Fourier transform)

The Fourier transform

→ C

f ∈

) is given by

f(u) =

f(x)e

i(u,x)

dx,

where u ∈ R

and (u, x) denotes the inner product, i.e.

(u, x) = u

+ ··· + u

Why do we care about Fourier transforms? Many computations are easier

with

in place of

, especially computations that involve differentiation and

convolutions (which are relevant to sums of independent random variables). In

particular, we will use it to prove the central limit theorem.

More generally, we can define the Fourier transform of a measure:

Definition

(Fourier transform of measure)

The Fourier transform of a finite

measure µ on R

is the function ˆµ : R

→ C given by

ˆµ(u) =

i(u,x)

µ(dx).

In the context of probability, we give these things a different name:

Definition

(Characteristic function)

Let

be a random variable. Then the

characteristic function of X is the Fourier transform of its law, i.e.

(u) = E[e

i(u,X)

] = ˆµ

(u),

where µ

is the law of X.

We now make the following (trivial) observations:

Proposition.

∞

≤ kf k

, kˆµk

∞

≤ µ(R

Less trivially, we have the following result:

Proposition. The functions

f, ˆµ are continuous.

Proof. If u

→ u, then

f(x)e

i(u

,x)

→ f (x)e

i(u,x)

Also, we know that

|f(x)e

i(u

,x)

| = |f (x)|.

So we can apply dominated convergence theorem with |f| as the bound.

5.2 Convolutions

To actually do something useful about the Fourier transforms, we need to talk

about convolutions.

Definition

(Convolution of random variables)

Let

µ, ν

be probability measures.

Their convolution

µ ∗ ν

is the law of

, where

has law

and

has law

ν, and X, Y are independent. Explicitly, we have

µ ∗ ν(A) = P[X + Y ∈ A]

(x + y) µ(dx) ν(dy)

Let’s suppose that

has a density function

with respect to the Lebesgue

measure. Then we have

µ ∗ ν(A) =

(x + y)f(x) dx ν(dy)

(x)f(x − y) dx ν(dy)

(x)



f(x − y) ν(dy)



dx.

So we know that µ ∗ ν has law

f(x − y) ν(dy).

This thing has a name.

Definition

(Convolution of function with measure)

Let

f ∈ L

and

probability measure. Then the convolution of f with µ is

f ∗ ν(x) =

f(x − y) ν(dy) ∈ L

Note that we do have to treat the two cases of convolutions separately, since

a measure need not have a density, and a function need not specify a probability

measure (it may not integrate to 1).

We check that it is indeed in L

. Since ν is a probability measure, Jensen’s

inequality says we have

kf ∗ νk



|f(x − y)|ν(dy)



≤

|f(x − y)|

ν(dy) dx

|f(x − y)|

dx ν(dy)

= kfk

< ∞.

In fact, from this computation, we see that

Proposition. For any f ∈ L

and ν a probability measure, we have

kf ∗ νk

≤ kf k

The interesting thing happens when we try to take the Fourier transform of

a convolution.

Proposition.

[

f ∗ ν(u) =

f(u)ˆν(u).

Proof. We have

[

f ∗ ν(u) =



f(x − y)ν(dy)



i(u,x)

f(x − y)e

i(u,x)

dx ν(dy)



f(x − y)e

i(u,x−y)

d(x − y)



i(u,y)

µ(dy)



f(x)e

i(u,x)

d(x)



i(u,y)

µ(dy)

f(u)e

i(u,x)

µ(dy)

f(u)

i(u,x)

µ(dy)

f(u)ˆν(u).

In the context of random variables, we have a similar result:

Proposition.

Let

µ, ν

be probability measures, and

X, Y

be independent vari-

ables with laws µ, ν respectively. Then

[

µ ∗ ν(u) = ˆµ(u)ˆν(u).

Proof. We have

[

µ ∗ ν(u) = E[e

i(u,X+Y )

] = E[e

i(u,X)

]E[e

i(u,Y )

] = ˆµ(u)ˆν(u).

5.3 Fourier inversion formula

We now want to work towards proving the Fourier inversion formula:

Theorem (Fourier inversion formula). Let f,

f ∈ L

. Then

f(x) =

(2π)

f(u)e

−i(u,x)

du a.e.

Our strategy is as follows:

(i)

Show that the Fourier inversion formula holds for a Gaussian distribution

by direct computations.

(ii)

Show that the formula holds for Gaussian convolutions, i.e. the convolution

of an arbitrary function with a Gaussian.

(iii)

We show that any function can be approximated by a Gaussian convolution.

Note that the last part makes a lot of sense. If

is a random variable, then

convolving with a Gaussian is just adding

√

, and if we take

t →

0, we

recover the original function. What we have to do is to show that this behaves

sufficiently well with the Fourier transform and the Fourier inversion formula

that we will actually get the result we want.

Gaussian densities

Before we start, we had better start by defining the Gaussian distribution.

Definition (Gaussian density). The Gaussian density with variance t is

(x) =



2πt



d/2

−|x|

/2t

This is equivalently the density of

√

, where

= (

, ··· , Z

) with

∼

N(0, 1) independent.

We now want to compute the Fourier transformation directly and show that

the Fourier inversion formula works for this.

We start off by working in the case

= 1 and

Z ∼ N

1). We want to

compute the Fourier transform of the law of this guy, i.e. its characteristic

function. We will use a nice trick.

Proposition. Let Z ∼ N(0, 1). Then

(a) = e

−u

We see that this is in fact a Gaussian up to a factor of 2π.

Proof. We have

(u) = E[e

iuZ

]

√

2π

iux

−x

dx.

We now notice that the function is bounded, so we can differentiate under the

integral sign, and obtain

(u) = E[iZe

iuZ

]

√

2π

ixe

iux

−x

= −uφ

(u),

where the last equality is obtained by integrating by parts. So we know that

(u) solves

(u) = −uφ

(u).

This is easy to solve, since we can just integrate this. We find that

log φ

(u) = −

+ C.

So we have

(u) = Ae

−u

We know that A = 1, since φ

(0) = 1. So we have

(u) = e

−u

We now do this problem in general.

Proposition.

Let

= (

, ··· , Z

) with

∼ N

1) independent. Then

√

has density

(x) =

(2πt)

d/2

−|x|

/(2t)

with

ˆg

(u) = e

−|u|

t/2

Proof. We have

ˆg

(u) = E[e

i(u,

√

tZ)

]

j=1

E[e

i(u

√

)

]

j=1

(

√

)

j=1

−tu

= e

−|u|

t/2

Again,

and

ˆg

are almost the same, apart form the factor of (2

πt

)

−d/2

and

the position of t shifted. We can thus write this as

ˆg

(u) = (2π)

d/2

−d/2

1/t

(u).

So this tells us that

ˆg

(u) = (2π)

(u).

This is not exactly the same as saying the Fourier inversion formula works,

because in the Fourier inversion formula, we integrated against

−i(u,x)

, not

i(u,x)

. However, we know that by the symmetry of the Gaussian distribution,

we have

(x) = g

(−x) = (2π)

−d

ˆg

(−x) =



2π



ˆg

(u)e

−i(u,x)

du.

So we conclude that

Lemma.

The Fourier inversion formula holds for the Gaussian density function.

Gaussian convolutions

Definition

(Gaussian convolution)

Let

f ∈ L

. Then a Gaussian convolution

of f is a function of the form f ∗g

We are now going to do a little computation that shows that functions of

this type also satisfy the Fourier inversion formula.

Before we start, we make some observations about the Gaussian convolution.

By general theory of convolutions, we know that we have

Proposition.

kf ∗ g

≤ kf k

We also have a pointwise bound

|f ∗ g

(x)| =



f(x − y)e

−|y|

/(2t)



2πt



d/2



≤ (2πt)

−d/2

|f(x − y)| dx

≤ (2πt)

−d/2

kfk

This tells us that in fact

Proposition.

kf ∗ g

∞

≤ (2πt)

−d/2

kfk

So in fact the convolution is pointwise bounded. We see that the bound gets

worse as

t →

0, and we will see that this is because as

t →

0, the convolution

f ∗ g

becomes a better and better approximation of

, and we did not assume

that f is bounded.

Similarly, we can compute that

Proposition.

f ∗ g

= k

f ˆg

≤ (2π)

d/2

−d/2

and

f ∗ g

∞

≤ k

∞

Now given these bounds, it makes sense to write down the Fourier inversion

formula for a Gaussian convolution.

Lemma. The Fourier inversion formula holds for Gaussian convolutions.

We are going to reduce this to the fact that the Gaussian distribution itself

satisfies Fourier inversion.

Proof. We have

f ∗ g

(x) =

f(x − y)g

(y) dy

f(x − y)



(2π)

ˆg

(u)e

−i(u,y)





2π



f(x − y)ˆg

(u)e

−i(u,y)

du dy



2π





f(x − y)e

−i(u,x−y)



ˆg

(u)e

−i(u,x)



2π



f(u)ˆg

(u)e

−i(u,x)



2π



f ∗ g

(u)e

−i(u,x)

So done.

The proof

Finally, we are going to extend the Fourier inversion formula to the case where

f ∈ L

Theorem (Fourier inversion formula). Let f ∈ L

and

(x) = (2π)

−d

f(u)e

−|u|

t/2

−i(u,x)

du = (2π)

−d

f ∗ g

(u)e

−i(u,x)

du.

Then

−f k

→

0, as

t →

0, and the Fourier inversion holds whenever

f ∈ L

To prove this, we first need to show that the Gaussian convolution is indeed

a good approximation of f:

Lemma.

Suppose that

f ∈ L

with

p ∈

, ∞

). Then

kf ∗ g

− fk

→

0 as

t → 0.

Note that this cannot hold for

∞

. Indeed, if

∞

, then the

∞

-norm

is the uniform norm. But we know that

f ∗ g

is always continuous, and the

uniform limit of continuous functions is continuous. So the formula cannot hold

if f is not already continuous.

Proof.

We fix

ε >

0. By a question on the example sheet, we can find

which

is continuous and with compact support such that kf −hk

≤

. So we have

kf ∗ g

− h ∗ g

= k(f − h) ∗ g

≤ kf − hk

≤

So it suffices for us to work with a continuous function

with compact support.

We let

e(y) =

|h(x − y) − h(x)|

dx.

We first show that e is a bounded function:

e(y) ≤

(|h(x − y)|

+ |h(x)|

) dx

= 2

p+1

khk

Also, since

is continuous and bounded, the dominated convergence theorem

tells us that e(y) → 0 as y → 0.

Moreover, using the fact that

(y) dy = 1, we have

kh ∗ g

− hk



(h(x − y) − h(x))g

(y) dy



Since

(

) d

is a probability measure, by Jensen’s inequality, we can bound

this by

≤

|h(x − y) − h(x)|

(y) dy dx



|h(x − y) − h(x)|



(y) dy

e(y)g

(y) dy

√

ty)g

(y) dy,

where we used the definition of

and substitution. We know that this tends to 0

t →

0 by the bounded convergence theorem, since we know that

is bounded.

Finally, we have

kf ∗ g

− f k

≤ kf ∗ g

− h ∗ g

+ kh ∗ g

− hk

+ kh − fk

≤

+ kh ∗ g

− hk

2ε

+ kh ∗ g

− hk

Since we know that

kh ∗ g

− hk

→

0 as

t →

0, we know that for all sufficiently

small t, the function is bounded above by ε. So we are done.

With this lemma, we can now prove the Fourier inversion theorem.

Proof of Fourier inversion theorem.

The first part is just a special case of the

previous lemma. Indeed, recall that

f ∗ g

(u) =

f(u)e

−|u|

t/2

Since Gaussian convolutions satisfy Fourier inversion formula, we know that

= f ∗ g

So the previous lemma says exactly that kf

− f k

→ 0.

Suppose now that

f ∈ L

as well. Then looking at the integrand of

(x) = (2π)

−d

f(u)e

−|u|

t/2

−i(u,x)

du,

we know that



f(u)e

−|u|

t/2

−i(u,x)



≤ |

f|.

Then by the dominated convergence theorem with dominating function

, we

know that this converges to

(x) → (2π)

−d

f(u)e

−i(u,x)

du as t → 0.

By the first part, we know that

− f k

→

0 as

t →

0. So we can find a

sequence (

) with

→

0 so that

→ f

a.e. Combining these, we

know that

f(x) =

f(u)e

−i(u,x)

du a.e.

So done.

5.4 Fourier transform in L

It turns out wonderful things happen when we take the Fourier transform of an

function.

Theorem

(Plancherel identity)

For any function

f ∈ L

∩ L

, the Plancherel

identity holds:

= (2π)

d/2

kfk

As we are going to see in a moment, this is just going to follow from the

Fourier inversion formula plus a clever trick.

Proof.

We first work with the special case where

f ∈ L

, since the Fourier

inversion formula holds for f. We then have

kfk

f(x)f(x) dx

(2π)



f(u)e

−i(u,x)



f(x) dx

(2π)

f(u)



f(x)e

−i(u,x)



(2π)

f(u)



f(x)e

i(u,x)



(2π)

f(u)

f(u) du

(2π)

f(u)k

So the Plancherel identity holds for f .

To prove it for the general case, we use this result and an approximation

argument. Suppose that

f ∈ L

∩ L

, and let

f ∗g

. Then by our earlier

lemma, we know that

→ kf k

as t → 0.

Now note that

(u) =

f(u)ˆg

(u) =

f(u)e

−|u|

t/2

The important thing is that e

−|u|

t/2

% 1 as t → 0. Therefore, we know

f(u)|

−|u|

du →

f(u)|

du = k

as t → 0, by monotone convergence.

Since f

∈ L

, we know that the Plancherel identity holds, i.e.

= (2π)

d/2

Taking the limit as t → 0, the result follows.

What is this good for? It turns out that the Fourier transform gives as

a bijection from

to itself. While it is not true that the Fourier inversion

formula holds for everything in

, it holds for enough of them that we can just

approximate everything else by the ones that are nice. Then the above tells us

that in fact this bijection is a norm-preserving automorphism.

Theorem.

There exists a unique Hilbert space automorphism

→ L

such that

F ([f]) = [(2π)

−d/2

whenever f ∈ L

∩ L

Here [

] denotes the equivalence class of

, and we say

→ L

a Hilbert space automorphism if it is a linear bijection that preserves the inner

product.

Note that in general, there is no guarantee that

sends a function to its

Fourier transform. We know that only if it is a well-behaved function (i.e. in

∩ L

). However, the formal property of it being a bijection from

to itself

will be convenient for many things.

Proof. We define F

: L

∩ L

→ L

([f]) = [(2π)

−d/2

f].

By the Plancherel identity, we know F

preserves the L

norm, i.e.

([f])k

= k[f]k

Also, we know that

∩ L

is dense in

, since even the continuous functions

with compact support are dense. So we know

extends uniquely to an isometry

F : L

→ L

Since it preserves distance, it is in particular injective. So it remains to show

that the map is surjective. By Fourier inversion, the subspace

V = {[f] ∈ L

: f,

f ∈ L

}

is sent to itself by the map

. Also if

f ∈ V

, then

[

] = [

] (note that

applying it twice does not suffice, because we actually have

[

](

) = [

](

−x

)).

is contained in the image

, and also

is dense in

, again because it

contains all Gaussian convolutions (we have

f ˆg

, and

is bounded and

ˆg

is decaying exponentially). So we know that F is surjective.

5.5 Properties of characteristic functions

We are now going to state a bunch of theorems about characteristic functions.

Since the proofs are not examinable (but the statements are!), we are only going

to provide a rough proof sketch.

Theorem.

The characteristic function

of a distribution

of a random

variable

determines

. In other words, if

and

are random variables

and φ

= φ

, then µ

= µ

Proof sketch.

Use the Fourier inversion to show that

determines

(

) =

E[g(X)] for any bounded, continuous g.

Theorem.

is integrable, then

has a bounded, continuous density

function

(x) = (2π)

−d

(u)e

−i(u,x)

du.

Proof sketch.

Let

Z ∼ N

1) be independent of

. Then

√

has a

bounded continuous density function which, by Fourier inversion, is

(x) = (2π)

−d

(u)e

−|u|

t/2

−i(u,x)

du.

Sending

t →

0 and using the dominated convergence theorem with dominating

function |φ

The next theorem relates to the notion of weak convergence.

Definition

(Weak convergence of measures)

Let

µ,

(

) be Borel probability

measures. We say that

→ µ

weakly if and only if

(

)

→ µ

(

) for all

bounded continuous g.

Similarly, we can define weak convergence of random variables.

Definition

(Weak convergence of random variables)

Let

(

) be random

variables. We say

→ X

weakly iff

→ µ

weakly, iff

[

(

)]

→ E

[

(

)]

for all bounded continuous g.

This is related to the notion of convergence in distribution, which we defined

long time ago without talking about it much. It is an exercise on the example

sheet that weak convergence of random variables in

is equivalent to convergence

in distribution.

It turns out that weak convergence is very useful theoretically. One reason is

that they are related to convergence of characteristic functions.

Theorem.

Let

(

) be random variables with values in

. If

(

)

→

(u) for each u ∈ R

, then µ

→ µ

weakly.

The main application of this that will appear later is that this is the fact

that allows us to prove the central limit theorem.

Proof sketch.

By the example sheet, it suffices to show that

[

(

)]

→ E

[

(

)]

for all compactly supported

g ∈ C

∞

. We then use Fourier inversion and

convergence of characteristic functions to check that

E[g(X

√

tZ)] → E[g(X +

√

tZ)]

for all

t >

0 for

Z ∼ N

1) independent of

(

). Then we check that

E[g(X

√

tZ)] is close to E[g(X

)] for t > 0 small, and similarly for X.

5.6 Gaussian random variables

Recall that in the proof of the Fourier inversion theorem, we used these things

called Gaussians, but didn’t really say much about them. These will be useful

later on when we want to prove the central limit theorem, because the central

limit theorem says that in the long run, things look like Gaussians. So here we

lay out some of the basic definitions and properties of Gaussians.

Definition

(Gaussian random variable)

Let

be a random variable on

This is said to be Gaussian if there exists

µ ∈ R

and

σ ∈

, ∞

) such that the

density of X is

(x) =

√

2πσ

exp



−

(x − µ)

2σ



A constant random variable

corresponds to

= 0. We say this has mean

µ and variance σ

When this happens, we write X ∼ N(µ, σ

For completeness, we record some properties of Gaussian random variables.

Proposition. Let X ∼ N (µ, σ

). Then

E[X] = µ, var(X) = σ

Also, for any a, b ∈ R, we have

aX + b ∼ N(aµ + b, a

Lastly, we have

(u) = e

−iµu−u

Proof.

All but the last of them follow from direct calculation, and can be found

in IA Probability.

For the last part, if X ∼ N (µ, σ

), then we can write

X = σZ + µ,

where

Z ∼ N

1). Recall that we have previously found that the characteristic

function of a N(0, 1) function is

(u) = e

−|u|

So we have

(u) = E[e

iu(σZ+µ)

]

= e

iuµ

E[e

iuσZ

]

= e

iuµ

(iuσ)

= e

iuµ−u

What we are next going to do is to talk about the corresponding facts for

the Gaussian in higher dimensions. Before that, we need to come up with the

definition of a higher-dimensional Gaussian distribution. This might be different

from the one you’ve seen previously, because we want to allow some degeneracy

in our random variable, e.g. some of the dimensions can be constant.

Definition (Gaussian random variable). Let X be a random variable. We say

that X is a Gaussian on R

if (u, X) is Gaussian on R for all u ∈ R

We are now going to prove a version of our previous theorem to higher

dimensional Gaussians.

Theorem.

Let

be Gaussian on

, and le t

be an

m×n

matrix and

b ∈ R

Then

(i) AX + b is Gaussian on R

(ii) X ∈ L

and its law

is determined by

[

] and

var

(

), the

covariance matrix.

(iii) We have

(u) = e

i(u,µ)−(u,V u)/2

(iv) If V is invertible, then X has a density of

(x) = (2π)

−n/2

(det V )

−1/2

exp



−

(x − µ, V

−1

(x − µ))



(v)

= (

, X

) where

∈ R

, then

cov

(

, X

) = 0 iff

and

are

independent.

Proof.

(i) If u ∈ R

, then we have

(AX + b, u) = (AX, u) + (b, u) = (X, A

u) + (b, u).

Since (

X, A

) is Gaussian and (

b, u

) is constant, it follows that (

b, u

)

is Gaussian.

(ii)

We know in particular that each component of

is a Gaussian random

variable, which are in

. So

X ∈ L

. We will prove the second part of (ii)

with (iii)

(iii) If µ = E[X] and V = var(X), then if u ∈ R

, then we have

E[(u, X)] = (u, µ), var((u, X)) = (u, V u).

So we know

(u, X) ∼ N((u, µ), (u, V u)).

So it follows that

(u) = E[e

i(u,X)

] = e

i(u,µ)−(u,V u)/2

and

determine the characteristic function of

, which in turn

determines the law of X.

(iv)

We start off with a boring Gaussian vector

= (

, ··· , Y

), where the

∼ N (0, 1) are independent. Then the density of Y is

(y) = (2π)

−n/2

−|y|

We are now going to construct X from Y . We define

X = V

1/2

Y + µ.

This makes sense because

is always non-negative definite. Then

Gaussian with

[

] =

and

var

(

) =

. Therefore

has the same

distribution as

. Since

is assumed to be invertible, we can compute

the density of

X using the change of variables formula.

(v) It is clear that if X

and X

are independent, then cov(X

, X

) = 0.

Conversely, let X = (X

, X

), where cov(X

, X

) = 0. Then we have

V = var(X) =



0 V



Then for u = (u

, u

), we have

(u, V u) = (u

) + (u

, V

where V

= var(X

) and V

var(X

). Then we have

(u) = e

iµu−(u,V u)/2

= e

iµ

−(u

)/2

iµ

−(u

)/2

= φ

)φ

So it follows that X

and X

are independent.

6 Ergodic theory

We are now going to study a new topic — ergodic theory. This is the study

the “long run behaviour” of system under the evolution of some Θ. Due to time

constraints, we will not do much with it. We are going to prove two ergodic

theorems that tell us what happens in the long run, and this will be useful when

we prove our strong law of large numbers at the end of the course.

The general settings is that we have a measure space (

E, E, µ

) and a measur-

able map Θ :

E → E

that is measure preserving, i.e.

(

) =

(Θ

−1

(

)) for all

A ∈ E.

Example.

Take (

E, E, µ

) = ([0

, B

([0

1))

, Lebesgue

). For each

a ∈

1), we

can define

(x) = x + a mod 1.

By what we’ve done earlier in the course, we know this translation map preserves

the Lebesgue measure on [0, 1).

Our goal is to try to understand the “long run averages” of the system when

we apply Θ many times. One particular quantity we are going to look at is the

following:

Let f be measurable. We define

(f) = f + f ◦Θ + ··· + f ◦Θ

n−1

We want to know what is the long run behaviour of

(f)

as n → ∞.

The ergodic theorems are going to give us the answer in a certain special

case. Finally, we will apply this in a particular case to get the strong law of

large numbers.

Definition

(Invariant subset)

We say

A ∈ E

is invariant for Θ if

= Θ

−1

(

Definition

(Invariant function)

A measurable function

is invariant if

f ◦ Θ.

Definition (E

). We write

= {A ∈ E : A is invariant}.

It is easy to show that

is a

-algebra, and

E → R

is invariant iff it is

measurable.

Definition

(Ergodic)

We say Θ is ergodic if

A ∈ E

implies

(

) = 0 or

µ(A

) = 0.

Example.

For the translation map on [0

1), we have Θ

is ergodic iff

irrational. Proof is left on example sheet 4.

Proposition.

is integrable and Θ is measure-preserving. Then

f ◦

Θ is

integrable and

f ◦ Θdµ =

fdµ.

It turns out that if Θ is ergodic, then there aren’t that many invariant

functions.

Proposition. If Θ is ergodic and f is invariant, then there exists a constant c

such that f = c a.e.

The proofs of these are left as an exercise on example sheet 4.

We are now going to spend a little bit of time studying a particular example,

because this will be needed to prove the strong law of large numbers.

Example

(Bernoulli shifts)

Let

be a probability distribution on

. Then

there exists an iid sequence

, Y

, ···

with law

. Recall we constructed this

in a really funny way. Now we are going to build it in a more natural way.

We let

be the set of all real sequences (

). We define the

-algebra

to be the

-algebra generated by the projections

(

) =

. In other

words, this is the smallest

-algebra such that all these functions are measurable.

Alternatively, this is the σ-algebra generated by the π-system

A =

(

n∈N

, A

∈ B for all n and A

= R eventually

)

Finally, to define the measure µ, we let

Y = (Y

, Y

, ···) : Ω → E

where

are iid random variables defined earlier, and Ω is the sample space of

the Y

Then

is a measurable map because each of the

’s is a random variable.

We let µ = P ◦ Y

−1

By the independence of Y

’s, we have that

µ(A) =

n∈N

m(A

)

for any

A = A

× A

× ··· × A

× R × ···× R.

Note that the product is eventually 1, so it is really a finite product.

This (

E, E, µ

) is known as the canonical space associated with the sequence

of iid random variables with law m.

Finally, we need to define Θ. We define Θ : E → E by

Θ(x) = Θ(x

, x

, ···) = (x

, x

, ···).

This is known as the shift map.

Why do we care about this? Later, we are going to look at the function

f(x) = f(x

, x

, ···) = x

Then we have

(f) = f + f ◦Θ + ··· + f ◦Θ

n−1

= x

+ ··· + x

(f)

will the average of the first

things. So ergodic theory will tell us

about the long-run behaviour of the average.

Theorem. The shift map Θ is an ergodic, measure preserving transformation.

Proof. It is an exercise to show that Θ is measurable and measure preserving.

To show that Θ is ergodic, recall the definition of the tail σ-algebra

= σ(X

: m ≥ n + 1), T =

Suppose that A ∈

n∈N

∈ A. Then

−n

(A) = {X

n+k

∈ A

for all k} ∈ T

Since

is a

-algebra, we and Θ

−n

(

)

∈ T

for all

A ∈ A

and

(

) =

, we

know Θ

−n

(A) ∈ T

for all A ∈ E.

So if A ∈ E

, i.e. A = Θ

−1

(A), then A ∈ T

for all N. So A ∈ T .

From the Kolmogorov 0-1 law, we know either

[

] = 1 or

[

] = 0. So

done.

6.1 Ergodic theorems

The proofs in this section are non-examinable.

Instead of proving the ergodic theorems directly, we first start by proving

the following magical lemma:

Lemma (Maximal ergodic lemma). Let f be integrable, and

∗

= sup

n≥0

(f) ≥ 0,

where S

(f) = 0 by convention. Then

∗

>0}

f dµ ≥ 0.

Proof. We let

∗

= max

0≤m≤n

and

= {S

∗

> 0}.

Now if 1 ≤ m ≤ n, then we know

= f + S

m−1

◦ Θ ≤ f + S

∗

◦ Θ.

Now on A

, we have

∗

= max

1≤m≤n

since S

= 0. So we have

∗

≤ f + S

∗

◦ Θ.

On A

, we have

∗

= 0 ≤ S

∗

◦ Θ.

So we know

∗

dµ =

∗

dµ +

∗

dµ

≤

f dµ +

∗

◦ Θ dµ +

∗

◦ Θ dµ

f dµ +

∗

◦ Θ dµ

f dµ +

∗

dµ

So we know

f dµ ≥ 0.

Taking the limit as

n → ∞

gives the result by dominated convergence with

dominating function f.

We are now going to prove the two ergodic theorems, which tell us the

limiting behaviour of S

(f).

Theorem

(Birkhoff’s ergodic theorem)

Let (

E, E, µ

) be

-finite and

integrable. There exists an invariant function

f such that

µ(|

f|) ≤ µ(|f |),

and

(f)

→

f a.e.

If Θ is ergodic, then

f is a constant.

Note that the theorem only gives

(

)

≤ µ

(

|f|

). However, in many cases,

we can use some integration theorems such as dominated convergence to argue

that they must in fact be equal. In particular, in the ergodic case, this will allow

us to find the value of

Theorem

(von Neumann’s ergodic theorem)

Let (

E, E, µ

) be a finite measure

space. Let

p ∈

, ∞

) and assume that

f ∈ L

. Then there is some function

f ∈ L

such that

(f)

→

f in L

Proof of Birkhoff’s ergodic theorem. We first note that

lim sup

, lim sup

are invariant functions, Indeed, we know

◦ Θ = f ◦ Θ + f ◦Θ

+ ··· + f ◦ Θ

= S

n+1

− f

So we have

lim sup

n→∞

◦ Θ

= lim sup

n→∞

→ lim sup

n→∞

Exactly the same reasoning tells us the lim inf is also invariant.

What we now need to show is that the set of points on which

lim sup

and

lim inf do not agree have measure zero. We set a < b. We let

D = D(a, b) =



x ∈ E : lim inf

n→∞

(x)

< a < b < lim sup

n→∞

(x)



Now if

lim sup

(x)

lim inf

(x)

, then there is some

a, b ∈ Q

such that

x ∈ D

(

a, b

). So by countable subadditivity, it suffices to show that

(

a, b

)) = 0

for all a, b.

We now fix

a, b

, and just write

. Since

lim sup

and

lim inf

are both

invariant, we have that

is invariant. By restricting to

, we can assume that

D = E.

Suppose that B ∈ E and µ(G) < ∞. We let

g = f −b1

Then

is integrable because

is integrable and

(

)

< ∞

. Moreover, we have

(g) = S

(f − b1

) ≥ S

(f) − nb.

Since we know that

lim sup

(f)

> b

by definition, we can find an

such that

(g) > 0. So we know that

∗

(g)(x) = sup

(g)(x) > 0

for all x ∈ D. By the maximal ergodic lemma, we know

0 ≤

g dµ =

f − b1

dµ =

f dµ − bµ(B).

If we rearrange this, we know

bµ(B) ≤

f dµ.

for all measurable sets

B ∈ E

with finite measure. Since our space is

-finite, we

can find B

% D such µ(B

) < ∞ for all n. So taking the limit above tells

bµ(D) ≤

f dµ. (†)

Now we can apply the same argument with (

−a

) in place of

and (

−f

) in place

of f to get

(−a)µ(D) ≤ −

f dµ. (‡)

Now note that since

b > a

, we know that at least one of

b >

0 and

a <

0 has to

be true. In the first case, (

†

) tells us that

(

) is finite, since

is integrable.

Then combining with (‡), we see that

bµ(D) ≤

f dµ ≤ aµ(D).

But

a < b

. So we must have

(

) = 0. The second case follows similarly (or

follows immediately by flipping the sign of f).

We are almost done. We can now define

f(x) =

(

lim S

(f)/n the limit exists

0 otherwise

Then by the above calculations, we have

(f)

→

f a.e.

Also, we know

is invariant, because

lim S

(

)

is invariant, and so is the set

where the limit exists.

Finally, we need to show that

µ(

f) ≤ µ(|f |).

This is since

µ(|f ◦ Θ

|) = µ(|f|)

as Θ

preserves the metric. So we have that

µ(|S

|) ≤ nµ(|f |) < ∞.

So by Fatou’s lemma, we have

µ(|

f|) ≤ µ



lim inf





≤ lim inf





≤ µ(|f |)

The proof of the von Neumann ergodic theorem follows easily from Birkhoff’s

ergodic theorem.

Proof of von Neumann ergodic theorem.

It is an exercise on the example sheet

to show that

kf ◦ Θk

|f ◦ Θ|

dµ =

|f|

dµ = kf k

So we have



kf + f ◦Θ + ··· + f ◦Θ

n−1

k ≤ kfk

by Minkowski’s inequality.

So let ε > 0, and take M ∈ (0, ∞) so that if

g = (f ∨(−M)) ∧ M,

then

kf − gk

By Birkhoff’s theorem, we know

(g)

→ ¯g

a.e.

Also, we know



(g)



≤ M

for all n. So by bounded convergence theorem, we know



(g)

− ¯g



→ 0

as n → ∞. So we can find N such that n ≥ N implies



(g)

− ¯g



Then we have



f − ¯g



lim inf



(f − g)



dµ

≤ lim inf



(f − g)



dµ

≤ kf − gk

So if n ≥ N , then we know



(f)

−



≤



(f − g)



(g)

−



¯g −



≤ ε.

So done.

7 Big theorems

We are now going to use all the tools we have previously developed to prove

two of the most important theorems about the sums of independent random

variables, namely the strong law of large numbers and the central limit theorem.

7.1 The strong law of large numbers

Before we start proving the strong law of large numbers, we first spend some

time discussing the difference between the strong law and the weak law. In both

cases, we have a sequence (

) of iid random variables with

[

] =

. We let

= X

+ ··· + X

–

The weak law of large number says

/n → µ

in probability as

n → ∞

provided E[X

] < ∞.

– The strong law of large number says S

/n → µ a.s. provided E|X

| < ∞.

So we see that the strong law is indeed stronger, because convergence almost

everywhere implies convergence in measure.

We are actually going to do two versions of the strong law with different

hypothesis.

Theorem

(Strong law of large numbers assuming finite fourth moments)

Let

(

) be a sequence of independent random variables such that there exists

µ ∈ R

and M > 0 such that

E[X

] = µ, E[X

] ≤ M

for all n. With S

= X

+ ··· + X

, we have that

→ µ a.s. as n → ∞.

Note that in this version, we do not require that the

are iid. We simply

need that they are independent and have the same mean.

The proof is completely elementary.

Proof. We reduce to the case that µ = 0 by setting

= X

− µ.

We then have

E[Y

] = 0, E[Y

] ≤ 2

(E[µ

+ X

]) ≤ 2

(µ

+ M ).

So it suffices to show that the theorem holds with

in place of

. So we can

assume that µ = 0.

By independence, we know that for i 6= j, we have

E[X

] = E[X

]E[X

] = 0.

Similarly, for all i, j, k, ` distinct, we have

E[X

] = E[X

] = 0.

Hence we know that

E[S

] = E





k=1

+ 6

1≤i<j≤n





We know the first term is bounded by

, and we also know that for

i 6

, we

have

E[X

] = E[X

]E[X

] ≤

E[X

]E[X

] ≤ M

by Jensen’s inequality. So we know





1≤i<j≤n





≤ 3n(n − 1)M.

Putting everything together, we have

E[S

] ≤ nM + 3n(n − 1)M ≤ 3n

So we know



/n)



≤

So we know

∞

n=1





≤

∞

n=1

< ∞.

So we know that

∞

n=1





< ∞ a.s.

So we know that (S

/n)

→ 0 a.s., i.e. S

/n → 0 a.s.

We are now going to get rid of the assumption that we have finite fourth

moments, but we’ll need to work with iid random variables.

Theorem

(Strong law of large numbers)

Let (

) be an iid sequence of inte-

grable random variables with mean ν. With S

= Y

+ ··· + Y

, we have

→ ν a.s.

We will use the ergodic theorem to prove this. This is not the “usual” proof

of the strong law, but since we’ve done all that work on ergodic theory, we might

as well use it to get a clean proof. Most of the work left is setting up the right

setting for the proof.

Proof.

Let

be the law of

and let

= (

, Y

, ···

). We can view

a function

Y : Ω → R

= E.

We let (

E, E, µ

) be the canonical space associated with the distribution

that

µ = P ◦ Y

−1

We let f : E → R be given by

f(x

, x

, ···) = X

, ··· , x

) = x

Then

has law given by

, and in particular is integrable. Also, the shift map

Θ : E → E given by

Θ(x

, x

, ···) = (x

, x

, ···)

is measure-preserving and ergodic. Thus, with

(f) = f + f ◦Θ + ··· + f ◦Θ

n−1

= X

+ ··· + X

we have that

(f)

→

f a.e.

by Birkhoff’s ergodic theorem. We also have convergence in

by von Neumann

ergodic theorem.

Here

-measurable, and Θ is ergodic, so we know that

a.e. for

some constant c. Moreover, we have

c = µ(

f) = lim

n→∞

µ(S

(f)/n) = ν.

So done.

7.2 Central limit theorem

Theorem.

Let (

) be a sequence of iid random variables with

[

] = 0 and

E[X

] = 1. Then if we set

= X

+ ··· + X

then for all x ∈ R, we have



√

≤ x



→

−∞

−y

√

2π

dy = P[N(0, 1) ≤ x]

as n → ∞.

Proof.

Let

(

) =

[

iuX

]. Since

[

] = 1

< ∞

, we can differentiate under

the expectation twice to obtain

φ(u) = E[e

iuX

], φ

(u) = E[iX

iuX

], φ

(u) = E[−X

iuX

Evaluating at 0, we have

φ(0) = 1, φ

(0) = 0, φ

(0) = −1.

So if we Taylor expand φ at 0, we have

φ(u) = 1 −

+ o(u

We consider the characteristic function of S

√

(u) = E[e

iuS

√

]

i=1

E[e

iuX

√

]

= φ(u/

√



1 −

+ o





We now take the logarithm to obtain

log φ

(u) = n log



1 −

+ o





= −

+ o(1)

→ −

So we know that

(u) → e

−u

which is the characteristic function of a N(0, 1) random variable.

So we have convergence in characteristic function, hence weak convergence,

hence convergence in distribution.