Part II — Probability and Measure
Based on lectures by J. Miller
Notes taken by Dexter Chua
Michaelmas 2016
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
Analysis II is essential
Measure spaces,
σ
-algebras,
π
-systems and uniqueness of extension, statement *and
proof* of Carath´eodory’s extension theorem. Construction of Lebesgue measure on
R
.
The Borel
σ
-algebra of
R
. Existence of non-measurable subsets of
R
. Lebesgue-Stieltjes
measures and probability distribution functions. Independence of events, independence
of σ-algebras. The Borel–Cantelli lemmas. Kolmogorov’s zero-one law. [6]
Measurable functions, random variables, independence of random variables. Construc-
tion of the integral, expectation. Convergence in measure and convergence almost
everywhere. Fatou’s lemma, monotone and dominated convergence, differentiation
under the integral sign. Discussion of product measure and statement of Fubini’s
theorem. [6]
Chebyshev’s inequality, tail estimates. Jensen’s inequality. Completeness of
L
p
for
1 ≤ p ≤ ∞. The H¨older and Minkowski inequalities, uniform integrability. [4]
L
2
as a Hilbert space. Orthogonal projection, relation with elementary conditional
probability. Variance and covariance. Gaussian random variables, the multivariate
normal distribution. [2]
The strong law of large numbers, proof for independent random variables with bounded
fourth moments. Measure preserving transformations, Bernoulli shifts. Statements
*and proofs* of maximal ergodic theorem and Birkhoff’s almost everywhere ergodic
theorem, proof of the strong law. [4]
The Fourier transform of a finite measure, characteristic functions, uniqueness and
inversion. Weak convergence, statement of L´evy’s convergence theorem for characteristic
functions. The central limit theorem. [2]
Contents
0 Introduction
1 Measures
1.1 Measures
1.2 Probability measures
2 Measurable functions and random variables
2.1 Measurable functions
2.2 Constructing new measures
2.3 Random variables
2.4 Convergence of measurable functions
2.5 Tail events
3 Integration
3.1 Definition and basic properties
3.2 Integrals and limits
3.3 New measures from old
3.4 Integration and differentiation
3.5 Product measures and Fubini’s theorem
4 Inequalities and L
p
spaces
4.1 Four inequalities
4.2 L
p
spaces
4.3 Orthogonal projection in L
2
4.4 Convergence in L
1
(P) and uniform integrability
5 Fourier transform
5.1 The Fourier transform
5.2 Convolutions
5.3 Fourier inversion formula
5.4 Fourier transform in L
2
5.5 Properties of characteristic functions
5.6 Gaussian random variables
6 Ergodic theory
6.1 Ergodic theorems
7 Big theorems
7.1 The strong law of large numbers
7.2 Central limit theorem
0 Introduction
In measure theory, the main idea is that we want to assign “sizes” to different
sets. For example, we might think [0
,
2]
⊆ R
has size 2, while perhaps
Q ⊆ R
has
size 0. This is known as a measure. One of the main applications of a measure
is that we can use it to come up with a new definition of an integral. The idea
is very simple, but it is going to be very powerful mathematically.
Recall that if
f
: [0
,
1]
→ R
is continuous, then the Riemann integral of
f
is
defined as follows:
(i) Take a partition 0 = t
0
< t
1
< ··· < t
n
= 1 of [0, 1].
(ii) Consider the Riemann sum
n
X
j=1
f(t
j
)(t
j
− t
j−1
)
(iii) The Riemann integral is
Z
f = Limit of Riemann sums as the mesh size of the partition → 0.
x
y
0
t
1
t
2
t
3
···
t
k
t
k+1
···
···
1
The idea of measure theory is to use a different approximation scheme. Instead
of partitioning the domain, we partition the range of the function. We fix some
numbers r
0
< r
1
< r
2
< ··· < r
n
.
We then approximate the integral of f by
n
X
j=1
r
j
· (“size of f
−1
([r
j−1
, r
j
])”).
We then define the integral as the limit of approximations of this type as the
mesh size of the partition → 0.
x
y
We can make an analogy with bankers — If a Riemann banker is given a stack
of money, they would just add the values of the money in order. A measure-
theoretic banker will sort the bank notes according to the type, and then find
the total value by multiplying the number of each type by the value, and adding
up.
Why would we want to do so? It turns out this leads to a much more
general theory of integration on much more general spaces. Instead of integrating
functions [
a, b
]
→ R
only, we can replace the domain with any measure space.
Even in the context of
R
, this theory of integration is much much more powerful
than the Riemann sum, and can integrate a much wider class of functions. While
you probably don’t care about those pathological functions anyway, being able
to integrate more things means that we can state more general theorems about
integration without having to put in funny conditions.
That was all about measures. What about probability? It turns out the
concepts we develop for measures correspond exactly to many familiar notions
from probability if we restrict it to the particular case where the total measure
of the space is 1. Thus, when studying measure theory, we are also secretly
studying probability!
1 Measures
In the course, we will write
f
n
% f
for “
f
n
converges to
f
monotonically
increasingly”, and
f
n
& f
similarly. Unless otherwise specified, convergence is
taken to be pointwise.
1.1 Measures
The starting point of all these is to come up with a function that determines
the “size” of a given set, known as a measure. It turns out we cannot sensibly
define a size for all subsets of [0
,
1]. Thus, we need to restrict our attention to a
collection of “nice” subsets. Specifying which subsets are “nice” would involve
specifying a σ-algebra.
This section is mostly technical.
Definition
(
σ
-algebra)
.
Let
E
be a set. A
σ
-algebra
E
on
E
is a collection of
subsets of E such that
(i) ∅ ∈ E.
(ii) A ∈ E implies that A
C
= X \ A ∈ E.
(iii) For any sequence (A
n
) in E, we have that
[
n
A
n
∈ E.
The pair (E, E) is called a measurable space.
Note that the axioms imply that
σ
-algebras are also closed under countable
intersections, as we have A ∩ B = (A
C
∪ B
C
)
C
.
Definition
(Measure)
.
A measure on a measurable space (
E, E
) is a function
µ : E → [0, ∞] such that
(i) µ(∅) = 0
(ii) Countable additivity: For any disjoint sequence (A
n
) in E, then
µ
[
n
A
n
!
=
∞
X
n=1
µ(A
n
).
Example.
Let
E
be any countable set, and
E
=
P
(
E
) be the set of all subsets
of
E
. A mass function is any function
m
:
E →
[0
, ∞
]. We can then define a
measure by setting
µ(A) =
X
x∈A
m(x).
In particular, if we put
m
(
x
) = 1 for all
x ∈ E
, then we obtain the counting
measure.
Countable spaces are nice, because we can always take
E
=
P
(
E
), and the
measure can be defined on all possible subsets. However, for “bigger” spaces, we
have to be more careful. The set of all subsets is often “too large”. We will see
a concrete and also important example of this later.
In general,
σ
-algebras are often described on large spaces in terms of a smaller
set, known as the generating sets.
Definition
(Generator of
σ
-algebra)
.
Let
E
be a set, and that
A ⊆ P
(
E
) be a
collection of subsets of E. We define
σ(A) = {A ⊆ E : A ∈ E for all σ-algebras E that contain A}.
In other words
σ
(
A
) is the smallest sigma algebra that contains
A
. This is
known as the sigma algebra generated by A.
Example.
Take
E
=
Z
, and
A
=
{{x}
:
x ∈ Z}
. Then
σ
(
A
) is just
P
(
E
), since
every subset of E can be written as a countable union of singletons.
Example.
Take
E
=
Z
, and let
A
=
{{x, x
+ 1
, x
+ 2
, x
+ 3
, ···}
:
x ∈ E}
. Then
again σ(E) is the set of all subsets of E.
The following is the most important σ-algebra in the course:
Definition
(Borel
σ
-algebra)
.
Let
E
=
R
, and
A
=
{U ⊆ R
:
U is open}
. Then
σ(A) is known as the Borel σ-algebra, which is not the set of all subsets of R.
We can equivalently define this by
˜
A
=
{
(
a, b
) :
a < b, a, b ∈ Q}
. Then
σ
(
˜
A
)
is also the Borel σ-algebra.
Often, we would like to prove results that allow us to deduce properties
about the
σ
-algebra just by checking it on a generating set. However, usually,
we cannot just check it on an arbitrary generating set. Instead, the generating
set has to satisfy some nice closure properties. We are now going to introduce a
bunch of many different definitions that you need not aim to remember (except
when exams are near).
Definition
(
π
-system)
.
Let
A
be a collection of subsets of
E
. Then
A
is called
a π-system if
(i) ∅ ∈ A
(ii) If A, B ∈ A, then A ∩ B ∈ A.
Definition
(d-system)
.
Let
A
be a collection of subsets of
E
. Then
A
is called
a d-system if
(i) E ∈ A
(ii) If A, B ∈ A and A ⊆ B, then B \ A ∈ A
(iii) For all increasing sequences (A
n
) in A, we have that
S
n
A
n
∈ A.
The point of d-systems and
π
-systems is that they separate the axioms of a
σ-algebra into two parts. More precisely, we have
Proposition.
A collection
A
is a
σ
-algebra if and only if it is both a
π
-system
and a d-system.
This follows rather straightforwardly from the definitions.
The following definitions are also useful:
Definition
(Ring)
.
A collection of subsets
A
is a ring on
E
if
∅ ∈ A
and for all
A, B ∈ A, we have B \ A ∈ A and A ∪ B ∈ A.
Definition
(Algebra)
.
A collection of subsets
A
is an algebra on
E
if
∅ ∈ A
,
and for all A, B ∈ A, we have A
C
∈ A and A ∪ B ∈ A.
So an algebra is like a
σ
-algebra, but it is just closed under finite unions only,
rather than countable unions.
While the names
π
-system and
d
-system are rather arbitrary, we can make
some sense of the names “ring” and “algebra”. Indeed, a ring forms a ring
(without unity) in the algebraic sense with symmetric difference as “addition”
and intersection as “multiplication”. Then the empty set acts as the additive
identity, and
E
, if present, acts as the multiplicative identity. Similarly, an
algebra is a boolean subalgebra under the boolean algebra P (E).
A very important lemma about these things is Dynkin’s lemma:
Lemma
(Dynkin’s
π
-system lemma)
.
Let
A
be a
π
-system. Then any d-system
which contains A contains σ(A).
This will be very useful in the future. If we want to show that all elements of
σ
(
A
) satisfy a particular property for some generating
π
-system
A
, we just have
to show that the elements of
A
satisfy that property, and that the collection of
things that satisfy the property form a d-system.
While this use case might seem rather contrived, it is surprisingly common
when we have to prove things.
Proof.
Let
D
be the intersection of all d-systems containing
A
, i.e. the smallest
d-system containing
A
. We show that
D
contains
σ
(
A
). To do so, we will show
that D is a π-system, hence a σ-algebra.
There are two steps to the proof, both of which are straightforward verifica-
tions:
(i) We first show that if B ∈ D and A ∈ A, then B ∩ A ∈ D.
(ii) We then show that if A, B ∈ D, then A ∩ B ∈ D.
Then the result immediately follows from the second part.
We let
D
0
= {B ∈ D : B ∩ A ∈ D for all A ∈ A}.
We note that
D
0
⊇ A
because
A
is a
π
-system, and is hence closed under
intersections. We check that
D
0
is a d-system. It is clear that
E ∈ D
0
. If we
have B
1
, B
2
∈ D
0
, where B
1
⊆ B
2
, then for any A ∈ A, we have
(B
2
\ B
1
) ∩ A = (B
2
∩ A) \ (B
1
∩ A).
By definition of
D
0
, we know
B
2
∩ A
and
B
1
∩ A
are elements of
D
. Since
D
is
a d-system, we know this intersection is in D. So B
2
\ B
1
∈ D
0
.
Finally, suppose that (
B
n
) is an increasing sequence in
D
0
, with
B
=
S
B
n
.
Then for every A ∈ A, we have that
[
B
n
∩ A =
[
(B
n
∩ A) = B ∩ A ∈ D.
Therefore B ∈ D
0
.
Therefore
D
0
is a d-system contained in
D
, which also contains
A
. By our
choice of D, we know D
0
= D.
We now let
D
00
= {B ∈ D : B ∩ A ∈ D for all A ∈ D}.
Since
D
0
=
D
, we again have
A ⊆ D
00
, and the same argument as above implies
that
D
00
is a d-system which is between
A
and
D
. But the only way that can
happen is if D
00
= D, and this implies that D is a π-system.
After defining all sorts of things that are “weaker versions” of
σ
-algebras, we
now defined a bunch of measure-like objects that satisfy fewer properties. Again,
no one really remembers these definitions:
Definition
(Set function)
.
Let
A
be a collection of subsets of
E
with
∅ ∈ A
. A
set function function µ : A → [0, ∞] such that µ(∅) = 0.
Definition
(Increasing set function)
.
A set function is increasing if it has the
property that for all A, B ∈ A with A ⊆ B, we have µ(A) ≤ µ(B).
Definition
(Additive set function)
.
A set function is additive if whenever
A, B ∈ A and A ∪ B ∈ A, A ∩ B = ∅, then µ(A ∪ B) = µ(A) + µ(B).
Definition
(Countably additive set function)
.
A set function is countably addi-
tive if whenever A
n
is a sequence of disjoint sets in A with ∪A
n
∈ A, then
µ
[
n
A
n
!
=
X
n
µ(A
n
).
Under these definitions, a measure is just a countable additive set function
defined on a σ-algebra.
Definition
(Countably subadditive set function)
.
A set function is countably
subadditive if whenever (A
n
) is a sequence of sets in A with
S
n
A
n
∈ A, then
µ
[
n
A
n
!
≤
X
n
µ(A
n
).
The big theorem that allows us to construct measures is the Caratheodory
extension theorem. In particular, this will help us construct the Lebesgue measure
on R.
Theorem
(Caratheodory extension theorem)
.
Let
A
be a ring on
E
, and
µ
a countably additive set function on
A
. Then
µ
extends to a measure on the
σ-algebra generated by A.
Proof.
(non-examinable) We start by defining what we want our measure to be.
For B ⊆ E, we set
µ
∗
(B) = inf
(
X
n
µ(A
n
) : (A
n
) ∈ A and B ⊆
[
A
n
)
.
If it happens that there is no such sequence, we set this to be
∞
. This measure is
known as the outer measure. It is clear that
µ
∗
(
φ
) = 0, and that
µ
∗
is increasing.
We say a set A ⊆ E is µ
∗
-measurable if
µ
∗
(B) = µ
∗
(B ∩ A) + µ
∗
(B ∩ A
C
)
for all B ⊆ E. We let
M = {µ
∗
-measurable sets}.
We will show the following:
(i) M is a σ-algebra containing A.
(ii) µ
∗
is a measure on M with µ
∗
|
A
= µ.
Note that it is not true in general that
M
=
σ
(
A
). However, we will always
have M ⊇ σ(A).
We are going to break this up into five nice bite-size chunks.
Claim. µ
∗
is countably subadditive.
Suppose
B ⊆
S
n
B
n
. We need to show that
µ
∗
(
B
)
≤
P
n
µ
∗
(
B
n
). We can
wlog assume that
µ
∗
(
B
n
) is finite for all
n
, or else the inequality is trivial. Let
ε >
0. Then by definition of the outer measure, for each
n
, we can find a
sequence (B
n,m
)
∞
m=1
in A with the property that
B
n
⊆
[
m
B
n,m
and
µ
∗
(B
n
) +
ε
2
n
≥
X
m
µ(B
n,m
).
Then we have
B ⊆
[
n
B
n
⊆
[
n,m
B
n,m
.
Thus, by definition, we have
µ
∗
(B) ≤
X
n,m
µ
∗
(B
n,m
) ≤
X
n
µ
∗
(B
n
) +
ε
2
n
= ε +
X
n
µ
∗
(B
n
).
Since ε was arbitrary, we are done.
Claim. µ
∗
agrees with µ on A.
In the first example sheet, we will show that if
A
is a ring and
µ
is a countably
additive set function on
µ
, then
µ
is in fact countably subadditive and increasing.
Assuming this, suppose that
A,
(
A
n
) are in
A
and
A ⊆
S
n
A
n
. Then by
subadditivity, we have
µ(A) ≤
X
n
µ(A ∩ A
n
) ≤
X
n
µ(A
n
),
using that
µ
is countably subadditivity and increasing. Note that we have to do
this in two steps, rather than just applying countable subadditivity, since we did
not assume that
S
n
A
n
∈ A. Taking the infimum over all sequences, we have
µ(A) ≤ µ
∗
(A).
Also, we see by definition that
µ
(
A
)
≥ µ
∗
(
A
), since
A
covers
A
. So we get that
µ(A) = µ
∗
(A) for all A ∈ A.
Claim. M contains A.
Suppose that A ∈ A and B ⊆ E. We need to show that
µ
∗
(B) = µ
∗
(B ∩ A) + µ
∗
(B ∩ A
C
).
Since
µ
∗
is countably subadditive, we immediately have
µ
∗
(
B
)
≤ µ
∗
(
B ∩ A
) +
µ
∗
(
B ∩ A
C
). For the other inequality, we first observe that it is trivial if
µ
∗
(
B
)
is infinite. If it is finite, then by definition, given
ε >
0, we can find some (
B
n
)
in A such that B ⊆
S
n
B
n
and
µ
∗
(B) + ε ≥
X
n
µ(B
n
).
Then we have
B ∩ A ⊆
[
n
(B
n
∩ A)
B ∩ A
C
⊆
[
n
(B
n
∩ A
C
)
We notice that B
n
∩ A
C
= B
n
\ A ∈ A. Thus, by definition of µ
∗
, we have
µ
∗
(B ∩ A) + µ
∗
(B ∩ A
c
) ≤
X
n
µ(B
n
∩ A) +
X
n
µ(B
n
∩ A
C
)
=
X
n
(µ(B
n
∩ A) + µ(B
n
∩ A
C
))
=
X
n
µ(B
n
)
≤ µ
∗
(B
n
) + ε.
Since ε was arbitrary, the result follows.
Claim. We show that M is an algebra.
We first show that E ∈ M. This is true since we obviously have
µ
∗
(B) = µ
∗
(B ∩ E) + µ
∗
(B ∩ E
C
)
for all B ⊆ E.
Next, note that if A ∈ M, then by definition we have, for all B,
µ
∗
(B) = µ
∗
(B ∩ A) + µ
∗
(B ∩ A
C
).
Now note that this definition is symmetric in
A
and
A
C
. So we also have
A
C
∈ M .
Finally, we have to show that
M
is closed under intersection (which is
equivalent to being closed under union when we have complements). Suppose
A
1
, A
2
∈ M and B ⊆ E. Then we have
µ
∗
(B) = µ
∗
(B ∩ A
1
) + µ
∗
(B ∩ A
C
1
)
= µ
∗
(B ∩ A
1
∩ A
2
) + µ
∗
(B ∩ A
1
∩ A
C
2
) + µ
∗
(B ∩ A
C
1
)
= µ
∗
(B ∩ (A
1
∩ A
2
)) + µ
∗
(B ∩ (A
1
∩ A
2
)
C
∩ A
1
)
+ µ
∗
(B ∩ (A
1
∩ A
2
)
C
∩ A
C
1
)
= µ
∗
(B ∩ (A
1
∩ A
2
)) + µ
∗
(B ∩ (A
1
∩ A
2
)
C
).
So we have A
1
∩ A
2
∈ M. So M is an algebra.
Claim. M is a σ-algebra, and µ
∗
is a measure on M.
To show that
M
is a
σ
-algebra, we need to show that it is closed under
countable unions. We let (
A
n
) be a disjoint collection of sets in
M
, then we
want to show that A =
S
n
A
n
∈ M and µ
∗
(A) =
P
n
µ
∗
(A
n
).
Suppose that B ⊆ E. Then we have
µ
∗
(B) = µ
∗
(B ∩ A
1
) + µ
∗
(B ∩ A
C
1
)
Using the fact that A
2
∈ M and A
1
∩ A
2
= ∅, we have
= µ
∗
(B ∩ A
1
) + µ
∗
(B ∩ A
2
) + µ
∗
(B ∩ A
C
1
∩ A
C
2
)
= ···
=
n
X
i=1
µ
∗
(B ∩ A
i
) + µ
∗
(B ∩ A
C
1
∩ ··· ∩ A
C
n
)
≥
n
X
i=1
µ
∗
(B ∩ A
i
) + µ
∗
(B ∩ A
C
).
Taking the limit as n → ∞, we have
µ
∗
(B) ≥
∞
X
i=1
µ
∗
(B ∩ A
i
) + µ
∗
(B ∩ A
C
).
By the countable-subadditivity of µ
∗
, we have
µ
∗
(B ∩ A) ≤
∞
X
i=1
µ
∗
(B ∩ A
i
).
Thus we obtain
µ
∗
(B) ≥ µ
∗
(B ∩ A) + µ
∗
(B ∩ A
C
).
By countable subadditivity, we also have inequality in the other direction. So
equality holds. So A ∈ M. So M is a σ-algebra.
To see that µ
∗
is a measure on M, note that the above implies that
µ
∗
(B) =
∞
X
i=1
(B ∩ A
i
) + µ
∗
(B ∩ A
C
).
Taking B = A, this gives
µ
∗
(A) =
∞
X
i=1
(A ∩ A
i
) + µ
∗
(A ∩ A
C
) =
∞
X
i=1
µ
∗
(A
i
).
Note that when
A
itself is actually a
σ
-algebra, the outer measure can be
simply written as
µ
∗
(B) = inf{µ(A) : A ∈ A, B ⊆ A}.
Caratheodory gives us the existence of some measure extending the set function
on
A
. Could there be many? In general, there could. However, in the special
case where the measure is finite, we do get uniqueness.
Theorem.
Suppose that
µ
1
, µ
2
are measures on (
E, E
) with
µ
1
(
E
) =
µ
2
(
E
)
<
∞
. If
A
is a
π
-system with
σ
(
A
) =
E
, and
µ
1
agrees with
µ
2
on
A
, then
µ
1
=
µ
2
.
Proof. Let
D = {A ∈ E : µ
1
(A) = µ
2
(A)}
We know that
D ⊇ A
. By Dynkin’s lemma, it suffices to show that
D
is a
d-system. The things to check are:
(i) E ∈ D — this follows by assumption.
(ii) If A, B ∈ D with A ⊆ B, then B \ A ∈ D. Indeed, we have the equations
µ
1
(B) = µ
1
(A) + µ
1
(B \ A) < ∞
µ
2
(B) = µ
2
(A) + µ
2
(B \ A) < ∞.
Since
µ
1
(
B
) =
µ
2
(
B
) and
µ
1
(
A
) =
µ
2
(
A
), we must have
µ
1
(
B \ A
) =
µ
2
(B \ A).
(iii) Let (A
n
) ∈ D be an increasing sequence with
S
A
n
= A. Then
µ
1
(A) = lim
n→∞
µ
1
(A
n
) = lim
n→∞
µ
2
(A
n
) = µ
2
(A).
So A ∈ D.
The assumption that
µ
1
(
E
) =
µ
2
(
E
)
< ∞
is necessary. The theorem does
not necessarily hold without it. We can see this from a simple counterexample:
Example. Let E = Z, and let E = P (E). We let
A = {{x, x + 1, x + 2, ···} : x ∈ E}∪ {∅}.
This is a
π
-system with
σ
(
A
) =
E
. We let
µ
1
(
A
) be the number of elements in
A
,
and
µ
2
= 2
µ
1
(
A
). Then obviously
µ
1
6
=
µ
2
, but
µ
1
(
A
) =
∞
=
µ
2
(
A
) for
A ∈ A
.
Definition
(Borel
σ
-algebra)
.
Let
E
be a topological space. We define the
Borel σ-algebra as
B(E) = σ({U ⊆ E : U is open}).
We write B for B(R).
Definition
(Borel measure and Radon measure)
.
A measure
µ
on (
E, B
(
E
)) is
called a Borel measure. If
µ
(
K
)
< ∞
for all
K ⊆ E
compact, then
µ
is a Radon
measure.
The most important example of a Borel measure we will consider is the
Lebesgue measure.
Theorem. There exists a unique Borel measure µ on R with µ([a, b]) = b − a.
Proof.
We first show uniqueness. Suppose
˜µ
is another measure on
B
satisfying
the above property. We want to apply the previous uniqueness theorem, but our
measure is not finite. So we need to carefully get around that problem.
For each n ∈ Z, we set
µ
n
(A) = µ(A ∩ (n, n + 1]))
˜µ
n
(A) = ˜µ(A ∩ (n, n + 1]))
Then
µ
n
and
˜µ
n
are finite measures on
B
which agree on the
π
-system of intervals
of the form (
a, b
] with
a, b ∈ R
,
a < b
. Therefore we have
µ
n
=
˜µ
n
for all
n ∈ Z
.
Now we have
µ(A) =
X
n∈Z
µ(A ∩ (n, n + 1]) =
X
n∈Z
µ
n
(A) =
X
n∈Z
˜µ
n
(A) = ˜µ(A)
for all Borel sets A.
To show existence, we want to use the Caratheodory extension theorem. We
let A be the collection of finite, disjoint unions of the form
A = (a
1
, b
1
] ∪ (a
2
, b
2
] ∪ ··· ∪ (a
n
, b
n
].
Then
A
is a ring of subsets of
R
, and
σ
(
A
) =
B
(details are to be checked on
the first example sheet).
We set
µ(A) =
n
X
i=1
(b
i
− a
i
).
We note that µ is well-defined, since if
A = (a
1
, b
1
] ∪ ··· ∪ (a
n
, b
n
] = (˜a
1
,
˜
b
1
] ∪ ··· ∪ (˜a
n
,
˜
b
n
],
then
n
X
i=1
(b
i
− a
i
) =
n
X
i=1
(
˜
b
i
− ˜a
i
).
Also, if
µ
is additive,
A, B ∈ A
,
A ∩ B
=
∅
and
A ∪ B ∈ A
, we obviously have
µ(A ∪ B) = µ(A) + µ(B). So µ is additive.
Finally, we have to show that
µ
is in fact countably additive. Let (
A
n
) be a
disjoint sequence in
A
, and let
A
=
S
∞
i=1
A
n
∈ A
. Then we need to show that
µ(A) =
P
∞
n=1
µ(A
n
).
Since µ is additive, we have
µ(A) = µ(A
1
) + µ(A \ A
1
)
= µ(A
1
) + µ(A
2
) + µ(A \ A
1
∪ A
2
)
=
n
X
i=1
µ(A
i
) + µ
A \
n
[
i=1
A
i
!
To finish the proof, we show that
µ
A \
n
[
i=1
A
i
!
→ 0 as n → ∞.
We are going to reduce this to the finite intersection property of compact sets in
R
:
if (
K
n
) is a sequence of compact sets in
R
with the property that
T
n
m=1
K
m
6
=
∅
for all n, then
T
∞
m=1
K
m
6= ∅.
We first introduce some new notation. We let
B
n
= A \
n
[
m=1
A
m
.
We now suppose, for contradiction, that
µ
(
B
n
)
6→
0 as
n → ∞
. Since the
B
n
’s
are decreasing, there must exist ε > 0 such that µ(B
n
) ≥ 2ε for every n.
For each
n
, we take
C
n
∈ A
with the property that
C
n
⊆ B
n
and
µ
(
B
n
\C
n
)
≤
ε
2
n
. This is possible since each
B
n
is just a finite union of intervals. Thus we
have
µ(B
n
) − µ
n
\
m=1
C
m
!
= µ
B
n
\
n
\
m=1
C
m
!
≤ µ
n
[
m=1
(B
m
\ C
m
)
!
≤
n
X
m=1
µ(B
m
\ C
m
)
≤
n
X
m=1
ε
2
m
≤ ε.
On the other hand, we also know that µ(B
n
) ≥ 2ε.
µ
n
\
m=1
C
m
!
≥ ε
for all
n
. We now let that
K
n
=
T
n
m=1
C
m
. Then
µ
(
K
n
)
≥ ε
, and in particular
K
n
6= ∅ for all n.
Thus, the finite intersection property says
∅ 6=
∞
\
n=1
K
n
⊆
∞
\
n=1
B
n
= ∅.
This is a contradiction. So we have µ(B
n
) → 0 as n → ∞. So done.
Definition
(Lebesgue measure)
.
The Lebesgue measure is the unique Borel
measure µ on R with µ([a, b]) = b − a.
Note that the Lebesgue measure is not a finite measure, since
µ
(
R
) =
∞
.
However, it is a σ-finite measure.
Definition
(
σ
-finite measure)
.
Let (
E, E
) be a measurable space, and
µ
a
measure. We say
µ
is
σ
-finite if there exists a sequence (
E
n
) in
E
such that
S
n
E
n
= E and µ(E
n
) < ∞ for all n.
This is the next best thing we can hope after finiteness, and often proofs
that involve finiteness carry over to σ-finite measures.
Proposition. The Lebesgue measure is translation invariant, i.e.
µ(A + x) = µ(A)
for all A ∈ B and x ∈ R, where
A + x = {y + x, y ∈ A}.
Proof. We use the uniqueness of the Lebesgue measure. We let
µ
x
(A) = µ(A + x)
for
A ∈ B
. Then this is a measure on
B
satisfying
µ
x
([
a, b
]) =
b − a
. So the
uniqueness of the Lebesgue measure shows that µ
x
= µ.
It turns out that translation invariance actually characterizes the Lebesgue
measure.
Proposition.
Let
˜µ
be a Borel measure on
R
that is translation invariant and
µ([0, 1]) = 1. Then ˜µ is the Lebesgue measure.
Proof. We show that any such measure must satisfy
µ([a, b]) = b − a.
By additivity and translation invariance, we can show that
µ
([
p, q
]) =
q −p
for all
rational
p < q
. By considering
µ
([
p, p
+ 1
/n
]) for all
n
and using the increasing
property, we know
µ
(
{p}
) = 0. So
µ
(([
p, q
)) =
µ
((
p, q
]) =
µ
((
p, q
)) =
q − p
for
all rational p, q.
Finally, by countable additivity, we can extend this to all real intervals. Then
the result follows from the uniqueness of the Lebesgue measure.
In the proof of the Caratheodory extension theorem, we constructed a measure
µ
∗
on the
σ
-algebra
M
of
µ
∗
-measurable sets which contains
A
. This contains
B
=
σ
(
A
), but could in fact be bigger than it. We call
M
the Lebesgue
σ
-algebra.
Indeed, it can be given by
M = {A ∪ N : A ∈ B, N ⊆ B ∈ B with µ(B) = 0}.
If A ∪ N ∈ M, then µ(A ∪ N ) = µ(A). The proof is left for the example sheet.
It is also true that
M
is strictly larger than
B
, so there exists
A ∈ M
with
A 6∈ B. Construction of such a set was on last year’s exam (2016).
On the other hand, it is also true that not all sets are Lebesgue measurable.
This is a rather funny construction.
Example.
For
x, y ∈
[0
,
1), we say
x ∼ y
if
x − y
is rational. This defines an
equivalence relation on [0
,
1). By the axiom of choice, we pick a representative
of each equivalence class, and put them into a set
S ⊆
[0
,
1). We will show that
S is not Lebesgue measurable.
Suppose that
S
were Lebesgue measurable. We are going to get a contra-
diction to the countable additivity of the Lebesgue measure. For each rational
r ∈ [0, 1) ∩ Q, we define
S
r
= {s + r mod 1 : s ∈ S}.
By translation invariance, we know
S
r
is also Lebesgue measurable, and
µ
(
S
r
) =
µ(S).
Also, by construction of
S
, we know (
S
r
)
r∈Q
is disjoint, and
S
r∈Q
S
r
= [0
,
1).
Now by countable additivity, we have
1 = µ([0, 1)) = µ
[
r∈Q
S
r
=
X
r∈Q
µ(S
r
) =
X
r∈Q
µ(S),
which is clearly not possible. Indeed, if
µ
(
S
) = 0, then this says 1 = 0; If
µ(S) > 0, then this says 1 = ∞. Both are absurd.
1.2 Probability measures
Since the course is called “probability and measure”, we’d better start talking
about probability! It turns out the notions we care about in probability theory
are very naturally just special cases of the concepts we have previously considered.
Definition
(Probability measure and probability space)
.
Let (
E, E
) be a measure
space with the property that
µ
(
E
) = 1. Then we often call
µ
a probability measure,
and (E, E, µ) a probability space.
Probability spaces are usually written as (Ω, F, P) instead.
Definition
(Sample space)
.
In a probability space (Ω
, F, P
), we often call Ω
the sample space.
Definition
(Events)
.
In a probability space (Ω
, F, P
), we often call the elements
of F the events.
Definition
(Probaiblity)
.
In a probability space (Ω
, F, P
), if
A ∈ F
, we often
call P[A] the probability of the event A.
These are exactly the same things as measures, but with different names!
However, thinking of them as probabilities could make us ask different questions
about these measure spaces. For example, in probability, one is often interested
in independence.
Definition
(Independence of events)
.
A sequence of events (
A
n
) is said to be
independent if
P
"
\
n∈J
A
n
#
=
Y
n∈J
P[A
n
]
for all finite subsets J ⊆ N.
However, it turns out that talking about independence of events is usually
too restrictive. Instead, we want to talk about the independence of σ-algebras:
Definition
(Independence of
σ
-algebras)
.
A sequence of
σ
-algebras (
A
n
) with
A
n
⊆ F
for all
n
is said to be independent if the following is true: If (
A
n
) is a
sequence where A
n
∈ A
n
for all n, them (A
n
) is independent.
Proposition.
Events (
A
n
) are independent iff the
σ
-algebras
σ
(
A
n
) are inde-
pendent.
While proving this directly would be rather tedious (but not too hard), it is
an immediate consequence of the following theorem:
Theorem. Suppose A
1
and A
2
are π-systems in F. If
P[A
1
∩ A
2
] = P[A
1
]P[A
2
]
for all A
1
∈ A
1
and A
2
∈ A
2
, then σ(A
1
) and σ(A
2
) are independent.
Proof.
This will follow from two applications of the fact that a finite measure is
determined by its values on a π-system which generates the entire σ-algebra.
We first fix A
1
∈ A
1
. We define the measures
µ(A) = P[A ∩ A
1
]
and
ν(A) = P[A]P[A
1
]
for all
A ∈ F
. By assumption, we know
µ
and
ν
agree on
A
2
, and we have that
µ(Ω) = P[A
1
] = ν(Ω) ≤ 1 < ∞. So µ and ν agree on σ(A
2
). So we have
P[A
1
∩ A
2
] = µ(A
2
) = ν(A
2
) = P[A
1
]P[A
2
]
for all A
2
∈ σ(A
2
).
So we have now shown that if
A
1
and
A
2
are independent, then
A
1
and
σ
(
A
2
) are independent. By symmetry, the same argument shows that
σ
(
A
1
)
and σ(A
2
) are independent.
Say we are rolling a dice. Instead of asking what the probability of getting
a 6, we might be interested instead in the probability of getting a 6 infinitely
often. Intuitively, the answer is “it happens with probability 1”, because in each
dice roll, we have a probability of
1
6
of getting a 6, and they are all independent.
We would like to make this precise and actually prove it. It turns out that
the notions of “occurs infinitely often” and also “occurs eventually” correspond
to more analytic notions of lim sup and lim inf.
Definition (limsup and liminf). Let (A
n
) be a sequence of events. We define
lim sup A
n
=
\
n
[
m≥n
A
m
lim inf A
n
=
[
n
\
m≥n
A
m
.
To parse these definitions more easily, we can read
∩
as “for all”, and
∪
as
“there exits”. For example, we can write
lim sup A
n
= ∀n, ∃m ≥ n such that A
m
occurs
= {x : ∀n, ∃m ≥ n, x ∈ A
m
}
= {A
m
occurs infinitely often}
= {A
m
i.o.}
Similarly, we have
lim inf A
n
= ∃n, ∀m ≥ n such that A
m
occurs
= {x : ∃n, ∀m ≥ n, x ∈ A
m
}
= {A
m
occurs eventually}
= {A
m
e.v.}
We are now going to prove two “obvious” results, known as the Borel–Cantelli
lemmas. These give us necessary conditions for an event to happen infinitely
often, and in the case where the events are independent, the condition is also
sufficient.
Lemma (Borel–Cantelli lemma). If
X
n
P[A
n
] < ∞,
then
P[A
n
i.o.] = 0.
Proof. For each k, we have
P[A
n
i.o] = P
\
n
[
m≥n
A
m
≤ P
[
m≥k
A
m
≤
∞
X
m=k
P[A
m
]
→ 0
as k → ∞. So we have P[A
n
i.o.] = 0.
Note that we did not need to use the fact that we are working with a
probability measure. So in fact this holds for any measure space.
Lemma (Borel–Cantelli lemma II). Let (A
n
) be independent events. If
X
n
P[A
n
] = ∞,
then
P[A
n
i.o.] = 1.
Note that independence is crucial. If we flip a fair coin, and we set all the
A
n
to be equal to “getting a heads”, then
P
n
P
[
A
n
] =
P
n
1
2
=
∞
, but we certainly
do not have P[A
n
i.o.] = 1. Instead it is just
1
2
.
Proof.
By example sheet, if (
A
n
) is independent, then so is (
A
C
n
). Then we have
P
"
N
\
m=n
A
C
m
#
=
N
Y
m=n
P[A
C
m
]
=
N
Y
m=n
(1 − P[A
m
])
≤
N
Y
m=n
exp(−P[A
m
])
= exp
−
N
X
m=n
P[A
m
]
!
→ 0
as N → ∞, as we assumed that
P
n
P[A
n
] = ∞. So we have
P
"
∞
\
m=n
A
C
m
#
= 0.
By countable subadditivity, we have
P
"
[
n
∞
\
m=n
A
C
m
#
= 0.
This in turn implies that
P
"
\
n
∞
[
m=n
A
m
#
= 1 − P
"
[
n
∞
\
m=n
A
C
m
#
= 1.
So we are done.
2 Measurable functions and random variables
We’ve had enough of measurable sets. As in most of mathematics, not only
should we talk about objects, but also maps between objects. Here we want to
talk about maps between measure spaces, known as measurable functions. In
the case of a probability space, a measurable function is a random variable!
In this chapter, we are going to start by defining a measurable function and
investigate some of its basic properties. In particular, we are going to prove the
monotone class theorem, which is the analogue of Dynkin’s lemma for measurable
functions. Afterwards, we turn to the probabilistic aspects, and see how we can
make sense of the independence of random variables. Finally, we are going to
consider different notions of “convergence” of functions.
2.1 Measurable functions
The definition of a measurable function is somewhat like the definition of a
continuous function, except that we replace “open” with “in the σ-algebra”.
Definition
(Measurable functions)
.
Let (
E, E
) and (
G, G
) be measure spaces.
A map f : E → G is measurable if for every A ∈ G, we have
f
−1
(A) = {x ∈ E : f(x) ∈ E} ∈ E.
If (G, G) = (R, B), then we will just say that f is measurable on E.
If (
G, G
) = ([0
, ∞
]
, B
), then we will just say that
f
is non-negative measurable.
If E is a topological space and E = B(E), then we call f a Borel function.
How do we actually check in practice that a function is measurable? It turns
out we are lucky. We can simply check that
f
−1
(
A
)
∈ E
for
A
in any generating
set Q of G.
Lemma.
Let (
E, E
) and (
G, G
) be measurable spaces, and
G
=
σ
(
Q
) for some
Q. If f
−1
(A) ∈ E for all A ∈ Q, then f is measurable.
Proof. We claim that
{A ⊆ G : f
−1
(A) ∈ E}
is a
σ
-algebra on
G
. Then the result follows immediately by definition of
σ
(
Q
).
Indeed, this follows from the fact that
f
−1
preserves everything. More
precisely, we have
f
−1
[
n
A
n
!
=
[
n
f
−1
(A
n
), f
−1
(A
C
) = (f
−1
(A))
C
, f
−1
(∅) = ∅.
So if, say, all A
n
∈ A, then so is
S
n
A
n
.
Example.
In the particular case where we have a function
f
:
E → R
, we know
that
B
=
B
(
R
) is generated by (
−∞, y
] for
y ∈ R
. So we just have to check that
{x ∈ E : f(x) ≤ y} = f
−1
((−∞, y])) ∈ E.
Example.
Let
E, F
be topological spaces, and
f
:
E → F
be continuous. We
will see that
f
is a measurable function (under the Borel
σ
-algebras). Indeed,
by definition, whenever
U ⊆ F
is open, we have
f
−1
(
U
) open as well. So
f
−1
(
U
)
∈ B
(
E
) for all
U ⊆ F
open. But since
B
(
F
) is the
σ
-algebra generated
by the open sets, this implies that f is measurable.
This is one very important example. We can do another very important
example.
Example.
Suppose that
A ⊆ E
. The indicator function of
A
is
1
A
(
x
) :
E →
{0, 1} given by
1
A
(x) =
(
1 x ∈ A
0 x 6∈ A
.
Suppose we give
{
0
,
1
}
the non-trivial measure. Then
1
A
is a measurable function
iff A ∈ E.
Example. The identity function is always measurable.
Example.
Composition of measurable functions are measurable. More precisely,
if (
E, E
), (
F, F
) and (
G, G
) are measurable spaces, and the functions
f
:
E → F
and
g
:
F → G
are measurable, then the composition
g◦f
:
E → G
is measurable.
Indeed, if
A ∈ G
, then
g
−1
(
A
)
∈ F
, so
f
−1
(
g
−1
(
A
))
∈ E
. But
f
−1
(
g
−1
(
A
)) =
(g ◦ f )
−1
(A). So done.
Definition
(
σ
-algebra generated by functions)
.
Now suppose we have a set
E
,
and a family of real-valued functions {f
i
: i ∈ I} on E. We then define
σ(f
i
: i ∈ I) = σ(f
−1
i
(A) : A ∈ B, i ∈ I).
This is the smallest
σ
-algebra on
E
which makes all the
f
i
’s measurable.
This is analogous to the notion of initial topologies for topological spaces.
If we want to construct more measurable functions, the following definition
will be rather useful:
Definition
(Product measurable space)
.
Let (
E, E
) and (
G, G
) be measure
spaces. We define the product measure space as
E × G
whose
σ
-algebra is
generated by the projections
E × G
E G
π
1
π
2
.
More explicitly, the σ-algebra is given by
E ⊗ G = σ({A × B : A ∈ E, B ∈ G}).
More generally, if (
E
i
, E
i
) is a collection of measure spaces, the product measure
space has underlying set
Q
i
E
i
, and the
σ
-algebra generated by the projection
maps π
i
:
Q
j
E
j
→ E
i
.
This satisfies the following property:
Proposition.
Let
f
i
:
E → F
i
be functions. Then
{f
i
}
are all measurable iff
(
f
i
) :
E →
Q
F
i
is measurable, where the function (
f
i
) is defined by setting the
ith component of (f
i
)(x) to be f
i
(x).
Proof.
If the map (
f
i
) is measurable, then by composition with the projections
π
i
, we know that each f
i
is measurable.
Conversely, if all
f
i
are measurable, then since the
σ
-algebra of
Q
F
i
is
generated by sets of the form
π
−1
j
(
A
) :
A ∈ F
j
, and the pullback of such sets
along (f
i
) is exactly f
−1
j
(A), we know the function (f
i
) is measurable.
Using this, we can prove that a whole lot more functions are measurable.
Proposition.
Let (
E, E
) be a measurable space. Let (
f
n
:
n ∈ N
) be a sequence
of non-negative measurable functions on
E
. Then the following are measurable:
f
1
+ f
2
, f
1
f
2
, max{f
1
, f
2
}, min{f
1
, f
2
},
inf
n
f
n
, sup
n
f
n
, lim inf
n
f
n
, lim sup
n
f
n
.
The same is true with “real” replaced with “non-negative”, provided the new
functions are real (i.e. not infinity).
Proof.
This is an (easy) exercise on the example sheet. For example, the sum
f
1
+ f
2
can be written as the following composition.
E [0, ∞]
2
[0, ∞].
(f
1
,f
2
)
+
We know the second map is continuous, hence measurable. The first function is
also measurable since the f
i
are. So the composition is also measurable.
The product follows similarly, but for the infimum and supremum, we need to
check explicitly that the corresponding maps [0
, ∞
]
N
→
[0
, ∞
] is measurable.
Notation. We will write
f ∧ g = min{f, g}, f ∨ g = max{f, g}.
We are now going to prove the monotone class theorem, which is a “Dynkin’s
lemma” for measurable functions. As in the case of Dynkin’s lemma, it will
sound rather awkward but will prove itself to be very useful.
Theorem
(Monotone class theorem)
.
Let (
E, E
) be a measurable space, and
A ⊆ E
be a
π
-system with
σ
(
A
) =
E
. Let
V
be a vector space of functions such
that
(i) The constant function 1 = 1
E
is in V.
(ii) The indicator functions 1
A
∈ V for all A ∈ A
(iii) V is closed under bounded, monotone limits.
More explicitly, if (
f
n
) is a bounded non-negative sequence in
V
,
f
n
% f
(pointwise) and f is also bounded, then f ∈ V.
Then V contains all bounded measurable functions.
Note that the conditions for
V
is pretty like the conditions for a d-system,
where taking a bounded, monotone limit is something like taking increasing
unions.
Proof. We first deduce that 1
A
∈ V for all A ∈ E.
D = {A ∈ E : 1
A
∈ V}.
We want to show that
D
=
E
. To do this, we have to show that
D
is a
d
-system.
(i) Since 1
E
∈ V, we know E ∈ D.
(ii) If 1
A
∈ V , then 1 − 1
A
= 1
E\A
∈ V. So E \ A ∈ D.
(iii)
If (
A
n
) is an increasing sequence in
D
, then
1
A
n
→ 1
S
A
n
monotonically
increasingly. So 1
S
A
n
is in D.
So, by Dynkin’s lemma, we know
D
=
E
. So
V
contains indicators of all measur-
able sets. We will now try to obtain any measurable function by approximating.
Suppose that
f
is bounded and non-negative measurable. We want to show
that f ∈ V. To do this, we approximate it by letting
f
n
= 2
−n
b2
n
fc =
∞
X
k=0
k2
−n
1
{k2
−n
≤f<(k+1)2
−n
}
.
Note that since
f
is bounded, this is a finite sum. So it is a finite linear
combination of indicators of elements in
E
. So
f
n
∈ V
, and 0
≤ f
n
→ f
monotonically. So f ∈ V.
More generally, if f is bounded and measurable, then we can write
f = (f ∨ 0) + (f ∧ 0) ≡ f
+
− f
−
.
Then f
+
and f
−
are bounded and non-negative measurable. So f ∈ V.
Unfortunately, we will not have a chance to use this result until the next
chapter where we discuss integration. There we will use this a lot.
2.2 Constructing new measures
We are going to look at two ways to construct new measures on spaces based on
some measurable function we have.
Definition
(Image measure)
.
Let (
E, E
) and (
G, G
) be measure spaces. Suppose
µ
is a measure on
E
and
f
:
E → G
is a measurable function. We define the
image measure ν = µ ◦ f
−1
on G by
ν(A) = µ(f
−1
(A)).
It is a routine check that this is indeed a measure.
If we have a strictly increasing continuous function, then we know it is
invertible (if we restrict the codomain appropriately), and the inverse is also
strictly increasing. It is also clear that these conditions are necessary for an
inverse to exist. However, if we relax the conditions a bit, we can get some sort
of “pseudoinverse” (some categorists may call them “left adjoints” (and will tell
you that it is a trivial consequence of the adjoint functor theorem)).
Recall that a function
g
is right continuous if
x
n
& x
implies
g
(
x
n
)
→ g
(
x
),
and similarly f is left continuous if x
n
% x implies f (x
n
) → f (x).
Lemma. Let g : R → R be non-constant, non-decreasing and right continuous.
We set
g(±∞) = lim
x→±∞
g(x).
We set I = (g(−∞), g(∞)). Since g is non-constant, this is non-empty.
Then there is a non-decreasing, left continuous function
f
:
I → R
such that
for all x ∈ I and y ∈ R, we have
x ≤ g(y) ⇔ f (x) ≤ y.
Thus, taking the negation of this, we have
x > g(y) ⇔ f (x) > y.
Explicitly, for x ∈ I, we define
f(x) = inf{y ∈ R : x ≤ g(y)}.
Proof. We just have to verify that it works. For x ∈ I, consider
J
x
= {y ∈ R : x ≤ g(y)}.
Since
g
is non-decreasing, if
y ∈ J
x
and
y
0
≥ y
, then
y
0
∈ J
x
. Since
g
is
right-continuous, if y
n
∈ J
x
is such that y
n
& y, then y ∈ J
x
. So we have
J
x
= [f(x), ∞).
Thus, for f ∈ R, we have
x ≤ g(y) ⇔ f (x) ≤ y.
So we just have to prove the remaining properties of
f
. Now for
x ≤ x
0
, we have
J
x
⊆ J
x
0
. So f(x) ≤ f (x
0
). So f is non-decreasing.
Similarly, if
x
n
% x
, then we have
J
x
=
T
n
J
x
n
. So
f
(
x
n
)
→ f
(
x
). So this
is left continuous.
Example. If g is given by the function
then f is given by
This allows us to construct new measures on R with ease.
Theorem.
Let
g
:
R → R
be non-constant, non-decreasing and right continuous.
Then there exists a unique Radon measure dg on B such that
dg((a, b]) = g(b) − g(a).
Moreover, we obtain all non-zero Radon measures on R in this way.
We have already seen an instance of this when we
g
was the identity function.
Given the lemma, this is very easy.
Proof.
Take
I
and
f
as in the previous lemma, and let
µ
be the restriction of
the Lebesgue measure to Borel subsets of
I
. Now
f
is measurable since it is left
continuous. We define dg = µ ◦ f
−1
. Then we have
dg((a, b]) = µ({x ∈ I : a < f(x) ≤ b})
= µ({x ∈ I : g(a) < x ≤ g(b)})
= µ((g(a), g(b)]) = g(b) − g(a).
So dg is a Radon measure with the required property.
There are no other such measures by the argument used for uniqueness of
the Lebesgue measure.
To show we get all non-zero Radon measures this way, suppose we have a
Radon measure ν on R, we want to produce a g such that ν = dg. We set
g(y) =
(
−ν((y, 0]) y ≤ 0
ν((0, y]) y > 0
.
Then
ν
((
a, b
]) =
g
(
b
)
− g
(
a
). We see that
ν
is non-zero, so
g
is non-constant.
It is also easy to see it is non-decreasing and right continuous. So
ν
= d
g
by
continuity.
2.3 Random variables
We are now going to look at these ideas in the context of probability. It turns
out they are concepts we already know and love!
Definition
(Random variable)
.
Let (Ω
, F, P
) be a probability space, and (
E, E
)
a measurable space. Then an
E
-valued random variable is a measurable function
X : Ω → E.
By default, we will assume the random variables are real.
Usually, when we have a random variable
X
, we might ask questions like
“what is the probability that
X ∈ A
?”. In other words, we are asking for the
“size” of the set of things that get sent to A. This is just the image measure!
Definition
(Distribution/law)
.
Given a random variable
X
: Ω
→ E
, the
distribution or law of X is the image measure µ
x
: P ◦ X
−1
. We usually write
P(X ∈ A) = µ
x
(A) = P(X
−1
(A)).
If
E
=
R
, then
µ
x
is determined by its values on the
π
-system of intervals
(−∞, y]. We set
F
X
(x) = µ
X
((−∞, x]) = P(X ≤ x)
This is known as the distribution function of X.
Proposition. We have
F
X
(x) →
(
0 x → −∞
1 x → +∞
.
Also, F
X
(x) is non-decreasing and right-continuous.
We call any function F with these properties a distribution function.
Definition
(Distribution function)
.
A distribution function is a non-decreasing,
right continuous function f : R → [0, 1] satisfying
F
X
(x) →
(
0 x → −∞
1 x → +∞
.
We now want to show that every distribution function is indeed a distribution.
Proposition.
Let
F
be any distribution function. Then there exists a probability
space (Ω, F, P) and a random variable X such that F
X
= F .
Proof. Take (Ω, F, P) = ((0, 1), B(0, 1), Lebesgue). We take X : Ω → R to be
X(ω) = inf{x : ω ≤ f(x)}.
Then we have
X(ω) ≤ x ⇐⇒ w ≤ F (x).
So we have
F
X
(x) = P[X ≤ x] = P[(0, F (x)]] = F (x).
Therefore F
X
= F .
This construction is actually very useful in practice. If we are writing
a computer program and want to sample a random variable, we will use this
procedure. The computer usually comes with a uniform (pseudo)-random number
generator. Then using this procedure allows us to produce random variables of
any distribution from a uniform sample.
The next thing we want to consider is the notion of independence of random
variables. Recall that for random variables
X, Y
, we used to say that they are
independent if for any A, B, we have
P[X ∈ A, Y ∈ B] = P[X ∈ A]P[Y ∈ B].
But this is exactly the statement that the
σ
-algebras generated by
X
and
Y
are
independent!
Definition
(Independence of random variables)
.
A family (
X
n
) of random vari-
ables is said to be independent if the family of
σ
-algebras (
σ
(
X
n
)) is independent.
Proposition. Two real-valued random variables X, Y are independent iff
P[X ≤ x, Y ≤ y] = P[X ≤ x]P[Y ≤ y].
More generally, if (
X
n
) is a sequence of real-valued random variables, then they
are independent iff
P[x
1
≤ x
1
, ··· , x
n
≤ x
n
] =
n
Y
j=1
P[X
j
≤ x
j
]
for all n and x
j
.
Proof.
The
⇒
direction is obvious. For the other direction, we simply note that
{(−∞, x] : x ∈ R} is a generating π-system for the Borel σ-algebra of R.
In probability, we often say things like “let
X
1
, X
2
, ···
be iid random vari-
ables”. However, how can we guarantee that iid random variables do indeed
exist? We start with the less ambitious goal of finding iid
Bernoulli
(1
/
2) random
variables:
Proposition. Let
(Ω, F, P) = ((0, 1), B(0, 1), Lebesgue).
be our probability space. Then there exists as sequence
R
n
of independent
Bernoulli(1/2) random variables.
Proof.
Suppose we have
ω ∈
Ω = (0
,
1). Then we write
ω
as a binary expansion
ω =
∞
X
n=1
ω
n
2
−n
,
where
ω
n
∈ {
0
,
1
}
. We make the binary expansion unique by disallowing infinite
sequences of zeroes.
We define
R
n
(
ω
) =
ω
n
. We will show that
R
n
is measurable. Indeed, we can
write
R
1
(ω) = ω
1
= 1
(1/2,1]
(ω),
where
1
(1/2,1]
is the indicator function. Since indicator functions of measurable
sets are measurable, we know R
1
is measurable. Similarly, we have
R
2
(ω) = 1
(1/4,1/2]
(ω) + 1
(3/4,1]
(ω).
So this is also a measurable function. More generally, we can do this for any
R
n
(ω): we have
R
n
(ω) =
2
n−1
X
j=1
1
(2
−n
(2j−1),2
−n
(2j)]
(ω).
So each
R
n
is a random variable, as each can be expressed as a sum of indicators
of measurable sets.
Now let’s calculate
P[R
n
= 1] =
2n−1
X
j=1
2
−n
((2j) − (2j − 1)) =
2n−1
X
j=1
2
−n
=
1
2
.
Then we have
P[R
n
= 0] = 1 − P[R
n
= 1] =
1
2
as well. So R
n
∼ Bernoulli(1/2).
We can straightforwardly check that (
R
n
) is an independent sequence, since
for n 6= m, we have
P[R
n
= 0 and R
m
= 0] =
1
4
= P[R
n
= 0]P[R
m
= 0].
We will now use the (
R
n
) to construct any independent sequence for any
distribution.
Proposition. Let
(Ω, F, P) = ((0, 1), B(0, 1), Lebesgue).
Given any sequence (
F
n
) of distribution functions, there is a sequence (
X
n
) of
independent random variables with F
X
n
= F
n
for all n.
Proof. Let m : N
2
→ N be any bijection, and relabel
Y
k,n
= R
m(k,n)
,
where the R
j
are as in the previous random variable. We let
Y
n
=
∞
X
k=1
2
−k
Y
k,n
.
Then we know that (
Y
n
) is an independent sequence of random variables, and
each is uniform on (0, 1). As before, we define
G
n
(y) = inf{x : y ≤ F
n
(x)}.
We set
X
n
=
G
n
(
Y
n
). Then (
X
n
) is a sequence of random variables with
F
X
n
= F
n
.
We end the section with a random fact: let (Ω
, F, P
) and
R
j
be as above.
Then
1
n
P
n
j=1
R
j
is the average of
n
independent of
Bernoulli
(1
/
2) random
variables. The weak law of large numbers says for any ε > 0, we have
P
1
n
n
X
j=1
R
j
−
1
2
≥ ε
→ 0 as n → ∞.
The strong law of large numbers, which we will prove later, says that
P
ω :
1
n
n
X
j=1
R
j
→
1
2
= 1.
So “almost every number” in (0
,
1) has an equal proportion of 0’s and 1’s in its
binary expansion. This is known as the normal number theorem.
2.4 Convergence of measurable functions
The next thing to look at is the convergence of measurable functions. In measure
theory, wonderful things happen when we talk about convergence. In analysis,
most of the time we had to require uniform convergence, or even stronger
notions, if we want limits to behave well. However, in measure theory, the kinds
of convergence we talk about are somewhat pointwise in nature. In fact, it
will be weaker than pointwise convergence. Yet, we are still going to get good
properties out of them.
Definition
(Convergence almost everywhere)
.
Suppose that (
E, E, µ
) is a mea-
sure space. Suppose that (
f
n
)
, f
are measurable functions. We say
f
n
→ f
almost everywhere (a.e.) if
µ({x ∈ E : f
n
(x) 6→ f (x)}) = 0.
If (E, E, µ) is a probability space, this is called almost sure convergence.
To see this makes sense, i.e. the set in there is actually measurable, note that
{x ∈ E : f
n
(x) 6→ f (x)} = {x ∈ E : lim sup |f
n
(x) − f(x)| > 0}.
We have previously seen that
lim sup |f
n
−f|
is non-negative measurable. So the
set {x ∈ E : lim sup |f
n
(x) − f(x)| > 0} is measurable.
Another useful notion of convergence is convergence in measure.
Definition
(Convergence in measure)
.
Suppose that (
E, E, µ
) is a measure space.
Suppose that (
f
n
)
, f
are measurable functions. We say
f
n
→ f
in measure if for
each ε > 0, we have
µ({x ∈ E : |f
n
(x) − f(x)| ≥ ε}) → 0 as n → ∞,
then we say that f
n
→ f in measure.
If (
E, E, µ
) is a probability space, then this is called convergence in probability.
In the case of a probability space, this says
P(|X
n
− X| ≥ ε) → 0 as n → ∞
for all ε, which is how we state the weak law of large numbers in the past.
After we define integration, we can consider the norms of a function f by
kfk
p
=
Z
|f(x)|
p
dx
1/p
.
Then in particular, if
kf
n
−fk
p
→
0, then
f
n
→ f
in measure, and this provides
an easy way to see that functions converge in measure.
In general, neither of these notions imply each other. However, the following
theorem provides us with a convenient dictionary to translate between the two
notions.
Theorem.
(i) If µ(E) < ∞, then f
n
→ f a.e. implies f
n
→ f in measure.
(ii)
For any
E
, if
f
n
→ f
in measure, then there exists a subsequence (
f
n
k
)
such that f
n
k
→ f a.e.
Proof.
(i) First suppose µ(E) < ∞, and fix ε > 0. Consider
µ({x ∈ E : |f
n
(x) − f(x)| ≤ ε}).
We use the result from the first example sheet that for any sequence of
events (A
n
), we have
lim inf µ(A
n
) ≥ µ(lim inf A
n
).
Applying to the above sequence says
lim inf µ({x : |f
n
(x) − f(x)| ≤ ε}) ≥ µ({x : |f
m
(x) − f(x)| ≤ ε eventually})
≥ µ({x ∈ E : |f
m
(x) − f(x)| → 0})
= µ(E).
As µ(E) < ∞, we have µ({x ∈ E : |f
n
(x) − f(x)| > ε}) → 0 as n → ∞.
(ii) Suppose that f
n
→ f in measure. We pick a subsequence (n
k
) such that
µ
x ∈ E : |f
n
k
(x) − f(x)| >
1
k
≤ 2
−k
.
Then we have
∞
X
k=1
µ
x ∈ E : f
n
k
(x) − f(x)| >
1
k
≤
∞
X
k=1
2
−k
= 1 < ∞.
By the first Borel–Cantelli lemma, we know
µ
x ∈ E : |f
n
k
(x) − f(x)| >
1
k
i.o.
= 0.
So f
n
k
→ f a.e.
It is important that we assume that µ(E) < ∞ for the first part.
Example.
Consider (
E, E, µ
) = (
R, B, Lebesgue
). Take
f
n
(
x
) =
1
[n,∞)
(
x
).
Then
f
n
(
x
)
→
0 for all
x
, and in particular almost everywhere. However, we
have
µ
x ∈ R : |f
n
(x)| >
1
2
= µ([n, ∞)) = ∞
for all n.
There is one last type of convergence we are interested in. We will only
first formulate it in the probability setting, but there is an analogous notion in
measure theory known as weak convergence, which we will discuss much later on
in the course.
Definition
(Convergence in distribution)
.
Let (
X
n
)
, X
be random variables
with distribution functions
F
X
n
and
F
X
, then we say
X
n
→ X
in distribution if
F
X
n
(x) → F
X
(x) for all x ∈ R at which F
X
is continuous.
Note that here we do not need that (
X
n
) and
X
live on the same probability
space, since we only talk about the distribution functions.
But why do we have the condition with continuity points? The idea is that
if the resulting distribution has a “jump” at
x
, it doesn’t matter which side of
the jump
F
X
(
x
) is at. Here is a simple example that tells us why this is very
important:
Example.
Let
X
n
to be uniform on [0
,
1
/n
]. Intuitively, this should converge
to the random variable that is always zero.
We can compute
F
X
n
(x) =
0 x ≤ 0
nx 0 < x < 1/n
1 x ≥ 1/n
.
We can also compute the distribution of the zero random variable as
F
0
=
(
0 x < 0
1 x ≥ 0
.
But F
X
n
(0) = 0 for all n, while F
X
(0) = 1.
One might now think of cheating by cooking up some random variable such
that
F
is discontinuous at so many points that random, unrelated things converge
to
F
. However, this cannot be done, because
F
is a non-decreasing function,
and thus can only have countably many points of discontinuities.
The big theorem we are going to prove about convergence in distribution is
that actually it is very boring and doesn’t give us anything new.
Theorem (Skorokhod representation theorem of weak convergence).
(i)
If (
X
n
)
, X
are defined on the same probability space, and
X
n
→ X
in
probability. Then X
n
→ X in distribution.
(ii)
If
X
n
→ X
in distribution, then there exists random variables (
˜
X
n
) and
˜
X
defined on a common probability space with
F
˜
X
n
=
F
X
n
and
F
˜
X
=
F
X
such that
˜
X
n
→
˜
X a.s.
Proof. Let S = {x ∈ R : F
X
is continuous}.
(i)
Assume that
X
n
→ X
in probability. Fix
x ∈ S
. We need to show that
F
X
n
(x) → F
X
(x) as n → ∞.
We fix
ε >
0. Since
x ∈ S
, this implies that there is some
δ >
0 such that
F
X
(x − δ) ≥ F
X
(x) −
ε
2
F
X
(x + δ) ≤ F
X
(x) +
ε
2
.
We fix N large such that n ≥ N implies P[|X
n
− X| ≥ δ] ≤
ε
2
. Then
F
X
n
(x) = P[X
n
≤ x]
= P[(X
n
− X) + X ≤ x]
We now notice that
{
(
X
n
−X
) +
X ≤ x} ⊆ {X ≤ x
+
δ}∪{|X
n
−X| > δ}
.
So we have
≤ P[X ≤ x + δ] + P[|X
n
− X| > δ]
≤ F
X
(x + δ) +
ε
2
≤ F
X
(x) + ε.
We similarly have
F
X
n
(x) = P[X
n
≤ x]
≥ P[X ≤ x − δ] − P[|X
n
− X| > δ]
≥ F
X
(x − δ) −
ε
2
≥ F
X
(x) − ε.
Combining, we have that
n ≥ N
implying
|F
x
n
(
x
)
− F
X
(
x
)
| ≤ ε
. Since
ε
was arbitrary, we are done.
(ii) Suppose X
n
→ X in distribution. We again let
(Ω, F, B) = ((0, 1), B((0, 1)), Lebesgue).
We let
˜
X
n
(ω) = inf{x : ω ≤ F
X
n
(x)},
˜
X(ω) = inf{x : ω ≤ F
X
(x)}.
Recall from before that
˜
X
n
has the same distribution function as
X
n
for
all n, and
˜
X has the same distribution as X. Moreover, we have
˜
X
n
(ω) ≤ x ⇔ ω ≤ F
X
n
(x)
x <
˜
X
n
(ω) ⇔ F
X
n
(x) < ω,
and similarly if we replace X
n
with X.
We are now going to show that with this particular choice, we have
˜
X
n
→
˜
X
a.s.
Note that
˜
X
is a non-decreasing function (0
,
1)
→ R
. Then by general
analysis,
˜
X has at most countably many discontinuities. We write
Ω
0
= {ω ∈ (0, 1) :
˜
X is continuous at ω
0
}.
Then (0, 1) \ Ω
0
is countable, and hence has Lebesgue measure 0. So
P[Ω
0
] = 1.
We are now going to show that
˜
X
n
(ω) →
˜
X(ω) for all ω ∈ Ω
0
.
Note that
F
X
is a non-decreasing function, and hence the points of discon-
tinuity
R \ S
is also countable. So
S
is dense in
R
. Fix
ω ∈
Ω
0
and
ε >
0.
We want to show that |
˜
X
n
(ω) −
˜
X(ω)| ≤ ε for all n large enough.
Since S is dense in R, we can find x
−
, x
+
in S such that
x
−
<
˜
X(ω) < x
+
and
x
+
−x
−
< ε
. What we want to do is to use the characteristic property
of
˜
X and F
X
to say that this implies
F
X
(x
−
) < ω < F
X
(x
+
).
Then since
F
X
n
→ F
X
at the points
x
−
, x
+
, for sufficiently large
n
, we
have
F
X
n
(x
−
) < ω < F
X
n
(x
+
).
Hence we have
x
−
<
˜
X
n
(ω) < x
+
.
Then it follows that |
˜
X
n
(ω) −
˜
X(ω)| < ε.
However, this doesn’t work, since
˜
X
(
ω
)
< x
+
only implies
ω ≤ F
X
(
x
+
),
and our argument will break down. So we do a funny thing where we
introduce a new variable ω
+
.
Since
˜
X
is continuous at
ω
, we can find
ω
+
∈
(
ω,
1) such that
˜
X
(
ω
+
)
≤ x
+
.
˜
X(ω)
ω
x
−
x
+
ω
+
ε
Then we have
x
−
<
˜
X(ω) ≤
˜
X(ω
+
) < x
+
.
Then we have
F
X
(x
−
) < ω < ω
+
≤ F
X
(x
+
).
So for sufficiently large n, we have
F
X
n
(x
−
) < ω < F
X
n
(x
+
).
So we have
x
−
<
˜
X
n
(ω) ≤ x
+
,
and we are done.
2.5 Tail events
Finally, we are going to quickly look at tail events. These are events that depend
only on the asymptotic behaviour of a sequence of random variables.
Definition
(Tail
σ
-algebra)
.
Let (
X
n
) be a sequence of random variables. We
let
T
n
= σ(X
n+1
, X
n+2
, ···),
and
T =
\
n
T
n
.
Then T is the tail σ-algebra.
Then
T
-measurable events and random variables only depend on the asymp-
totic behaviour of the X
n
’s.
Example. Let (X
n
) be a sequence of real-valued random variables. Then
lim sup
n→∞
1
n
n
X
j=1
X
j
, lim inf
n→∞
1
n
n
X
j=1
X
j
are T -measurable random variables. Finally,
lim
n→∞
1
n
n
X
j=1
X
j
exists
∈ T ,
since this is just the set of all points where the previous two things agree.
Theorem
(Kolmogorov 0-1 law)
.
Let (
X
n
) be a sequence of independent (real-
valued) random variables. If A ∈ T , then P[A] = 0 or 1.
Moreover, if
X
is a
T
-measurable random variable, then there exists a
constant c such that
P[X = c] = 1.
Proof.
The proof is very funny the first time we see it. We are going to prove
the theorem by checking something that seems very strange. We are going to
show that if A ∈ T , then A is independent of A. It then follows that
P[A] = P[A ∩ A] = P[A]P[A],
so P[A] = 0 or 1. In fact, we are going to prove that T is independent of T .
Let
F
n
= σ(X
1
, ··· , X
n
).
This σ-algebra is generated by the π-system of events of the form
A = {X
1
≤ x
1
, ··· , X
n
≤ x
n
}.
Similarly,
T
n
=
σ
(
X
n+1
, X
n+2
, ···
) is generated by the
π
-system of events of the
form
B = {X
n+1
≤ x
n+1
, ··· , X
n+k
≤ x
n+k
},
where k is any natural number.
Since the X
n
are independent, we know for any such A and B, we have
P[A ∩ B] = P[A]P[B].
Since this is true for all A and B, it follows that F
n
is independent of T
n
.
Since T =
T
k
T
k
⊆ T
n
for each n, we know F
n
is independent of T .
Now
S
k
F
k
is a
π
-system, which generates the
σ
-algebra
F
∞
=
σ
(
X
1
, X
2
, ···
).
We know that if
A ∈
S
n
F
n
, then there has to exist an index
n
such that
A ∈ F
n
.
So A is independent of T . So F
∞
is independent of T .
Finally, note that T ⊆ F
∞
. So T is independent of T .
To find the constant, suppose that X is T -measurable. Then
P[X ≤ x] ∈ {0, 1}
for all x ∈ R since {X ≤ x} ∈ T .
Now take
c = inf{x ∈ R : P[X ≤ x] = 1}.
Then with this particular choice of
c
, it is easy to see that
P
[
X
=
c
] = 1. This
completes the proof of the theorem.
3 Integration
3.1 Definition and basic properties
We are now going to work towards defining the integral of a measurable function
on a measure space (
E, E, µ
). Different sources use different notations for the
integral. The following notations are all commonly used:
µ(f) =
Z
E
f dµ =
Z
E
f(x) dµ(x) =
Z
E
f(x)µ(dx).
In the case where (E, E, µ) = (R, B, Lebesgue), people often just write this as
µ(f) =
Z
R
f(x) dx.
On the other hand, if (
E, E, µ
) = (Ω
, F, P
) is a probability space, and
X
is a
random variable, then people write the integral as E[X], the expectation of X.
So how are we going to define the integral? There are two steps to defining
the integral. The idea is that we first define the integral on simple functions,
and then extend the definition to more general measurable functions by taking
the limit. When we do the definition for simple functions, it will be obvious that
the definition satisfies the nice properties, and we will have to check that they
are preserved when we take the limit.
Definition
(Simple function)
.
A simple function is a measurable function that
can be written as a finite non-negative linear combination of indicator functions
of measurable sets, i.e.
f =
n
X
k=1
a
k
1
A
k
for some A
k
∈ E and a
k
≥ 0.
Note that some sources do not assume that
a
k
≥
0, but assuming this makes
our life easier.
It is obvious that
Proposition.
A function is simple iff it is measurable, non-negative, and takes
on only finitely-many values.
Definition (Integral of simple function). The integral of a simple function
f =
n
X
k=1
a
k
1
A
k
is given by
µ(f) =
n
X
k=1
a
k
µ(A
k
).
Note that it can be that
µ
(
A
k
) =
∞
, but
a
k
= 0. When this happens, we
are just going to declare that 0
· ∞
= 0 (this makes sense because this means
we are ignoring all 0
· 1
A
terms for any
A
). After we do this, we can check the
integral is well-defined.
We are now going to extend this definition to non-negative measurable
functions by a limiting procedure. Once we’ve done this, we are going to extend
the definition to measurable functions by linearity of the integral. Then we
would have a definition of the integral, and we are going to deduce properties of
the integral using approximation.
Definition (Integral). Let f be a non-negative measurable function. We set
µ(f) = sup{µ(g) : g ≤ f, g is simple}.
For arbitrary f , we write
f = f
+
− f
−
= (f ∨ 0) + (f ∧ 0).
We put |f | = f
+
+ f
−
. We say f is integrable if µ(|f|) < ∞. In this case, set
µ(f) = µ(f
+
) − µ(f
−
).
If only one of
µ
(
f
+
)
, µ
(
f
−
)
< ∞
, then we can still make the above definition,
and the result will be infinite.
In the case where we are integrating over (a subset of) the reals, we call it
the Lebesgue integral.
Proposition.
Let
f
: [0
,
1]
→ R
be Riemann integrable. Then it is also Lebesgue
integrable, and the two integrals agree.
We will not prove this, but this immediately gives us results like the funda-
mental theorem of calculus, and also helps us to actually compute the integral.
However, note that this does not hold for infinite domains, as you will see in the
second example sheet.
But the Lebesgue integrable functions are better. A lot of functions are
Lebesgue integrable but not Riemann integrable.
Example. Take the standard non-Riemann integrable function
f = 1
[0,1]\Q
.
Then f is not Riemann integrable, but it is Lebesgue integrable, since
µ(f) = µ([0, 1] \ Q) = 1.
We are now going to study some basic properties of the integral. We will first
look at the properties of integrals of simple functions, and then extend them to
general integrable functions.
For f, g simple, and α, β ≥ 0, we have that
µ(αf + βg) = αµ(f) + βµ(g).
So the integral is linear.
Another important property is monotonicity — if f ≤ g, then µ(f ) ≤ µ(g).
Finally, we have
f
= 0 a.e. iff
µ
(
f
) = 0. It is absolutely crucial here that we
are talking about non-negative functions.
Our goal is to show that these three properties are also satisfied for arbitrary
non-negative measurable functions, and the first two hold for integrable functions.
In order to achieve this, we prove a very important tool — the monotone
convergence theorem. Later, we will also learn about the dominated convergence
theorem and Fatou’s lemma. These are the main and very important results
about exchanging limits and integration.
Theorem
(Monotone convergence theorem)
.
Suppose that (
f
n
)
, f
are non-
negative measurable with f
n
% f . Then µ(f
n
) % µ(f ).
In the proof we will use the fact that the integral is monotonic, which we
shall prove later.
Proof.
We will split the proof into five steps. We will prove each of the following
in turn:
(i) If f
n
and f are indicator functions, then the theorem holds.
(ii) If f is an indicator function, then the theorem holds.
(iii) If f is simple, then the theorem holds.
(iv) If f is non-negative measurable, then the theorem holds.
Each part follows rather straightforwardly from the previous one, and the reader
is encouraged to try to prove it themself.
We first consider the case where
f
n
=
1
A
n
and
f
=
1
A
. Then
f
n
% f
is true
iff A
n
% A. On the other hand, µ(f
n
) % µ(f ) iff µ(A
n
) % µ(A).
For convenience, we let A
0
= ∅. We can write
µ(A) = µ
[
n
A
n
\ A
n−1
!
=
∞
X
n=1
µ(A
n
\ A
n−1
)
= lim
N→∞
N
X
n=1
µ(A
n
\ A
n−1
)
= lim
N→∞
µ(A
N
).
So done.
We next consider the case where f = 1
A
for some A. Fix ε > 0, and set
A
n
= {f
n
> 1 − ε} ∈ E.
Then we know that A
n
% A, as f
n
% f . Moreover, by definition, we have
(1 − ε)1
A
n
≤ f
n
≤ f = 1
A
.
As A
n
% A, we have that
(1 − ε)µ(f) = (1 − ε) lim
n→∞
µ(A
n
) ≤ lim
n→∞
µ(f
n
) ≤ µ(f )
since f
n
≤ f . Since ε is arbitrary, we know that
lim
n→∞
µ(f
n
) = µ(f).
Next, we consider the case where f is simple. We write
f =
m
X
k=1
a
k
1
A
k
,
where a
k
> 0 and A
k
are pairwise disjoint. Since f
n
% f , we know
a
−1
k
f
n
1
A
k
% 1
A
k
.
So we have
µ(f
n
) =
m
X
k=1
µ(f
n
1
A
k
) =
m
X
k=1
a
k
µ(a
−1
k
f
n
1
A
k
) →
m
X
k=1
a
k
µ(A
k
) = µ(f).
Suppose
f
is non-negative measurable. Suppose
g ≤ f
is a simple function.
As
f
n
% f
, we know
f
n
∧g % f ∧g
=
g
. So by the previous case, we know that
µ(f
n
∧ g) → µ(g).
We also know that
µ(f
n
) ≥ µ(f
n
∧ g).
So we have
lim
n→∞
µ(f
n
) ≥ µ(g)
for all g ≤ f. This is possible only if
lim
n→∞
µ(f
n
) ≥ µ(f )
by definition of the integral. However, we also know that
µ
(
f
n
)
≤ µ
(
f
) for all
n
,
again by definition of the integral. So we must have equality. So we have
µ(f) = lim
n→∞
µ(f
n
).
Theorem. Let f, g be non-negative measurable, and α, β ≥ 0. We have that
(i) µ(αf + βg) = αµ(f) + βµ(g).
(ii) f ≤ g implies µ(f ) ≤ µ(g).
(iii) f = 0 a.e. iff µ(f) = 0.
Proof.
(i) Let
f
n
= 2
−n
b2
n
fc ∧ n
g
n
= 2
−n
b2
n
gc∧ n.
Then
f
n
, g
n
are simple with
f
n
% f
and
g
n
% g
. Hence
µ
(
f
n
)
% µ
(
f
)
and
µ
(
g
n
)
% µ
(
g
) and
µ
(
αf
n
+
βg
n
)
% µ
(
αf
+
βg
), by the monotone
convergence theorem. As f
n
, g
n
are simple, we have that
µ(αf
n
+ βg
n
) = αµ(f
n
) + βµ(g
n
).
Taking the limit as n → ∞, we get
µ(αf + βg) = αµ(f) + βµ(g).
(ii)
We shall be careful not to use the monotone convergence theorem. We
have
µ(g) = sup{µ(h) : h ≤ g simple}
≥ sup{µ(h) : h ≤ f simple}
= µ(f).
(iii) Suppose f 6= 0 a.e. Let
A
n
=
x : f (x) >
1
n
.
Then
{x : f (x) 6= 0} =
[
n
A
n
.
Since the left hand set has non-negative measure, it follows that there is
some A
n
with non-negative measure. For that n, we define
h =
1
n
1
A
n
.
Then µ(f ) ≥ µ(h) > 0. So µ(f) 6= 0.
Conversely, suppose f = 0 a.e. We let
f
n
= 2
−n
b2
n
fc ∧ n
be a simple function. Then f
n
% f and f
n
= 0 a.e. So
µ(f) = lim
n→∞
µ(f
n
) = 0.
We now prove the analogous statement for general integrable functions.
Theorem. Let f, g be integrable, and α, β ≥ 0. We have that
(i) µ(αf + βg) = αµ(f) + βµ(g).
(ii) f ≤ g implies µ(f ) ≤ µ(g).
(iii) f = 0 a.e. implies µ(f) = 0.
Note that in the last case, the converse is no longer true, as one can easily
see from the sign function sgn : [−1, 1] → R.
Proof.
(i) We are going to prove these by applying the previous theorem.
By definition of the integral, we have
µ
(
−f
) =
−µ
(
f
). Also, if
α ≥
0, then
µ(αf) = µ(αf
+
) − µ(αf
−
) = αµ(f
+
) − αµ(f
−
) = αµ(f).
Combining these two properties, it then follows that if
α
is a real number,
then
µ(αf) = αµ(f).
To finish the proof of (i), we have to show that
µ
(
f
+
g
) =
µ
(
f
) +
µ
(
g
).
We know that this is true for non-negative functions, so we need to employ
a little trick to make this a statement about the non-negative version. If
we let h = f + g, then we can write this as
h
+
− h
−
= (f
+
− f
−
) + (g
+
− g
−
).
We now rearrange this as
h
+
f
−
+ g
−
= f
+
+ g
+
+ h
−
.
Now everything is non-negative measurable. So applying µ gives
µ(f
+
) + µ(f
−
) + µ(g
−
) = µ(f
+
) + µ(g
+
) + µ(h
−
).
Rearranging, we obtain
µ(h
+
) − µ(h
−
) = µ(f
+
) − µ(f
−
) + µ(g
+
) − µ(g
−
).
This is exactly the same thing as saying
µ(f + g) = µ(h) = µ(f) = µ(g).
(ii)
If
f ≤ g
, then
g −f ≥
0. So
µ
(
g −f
)
≥
0. By (i), we know
µ
(
g
)
−µ
(
f
)
≥
0.
So µ(g) ≥ µ(f).
(iii)
If
f
= 0 a.e., then
f
+
, f
−
= 0 a.e. So
µ
(
f
+
) =
µ
(
f
−
) = 0. So
µ
(
f
) =
µ(f
+
) − µ(f
−
) = 0.
As mentioned, the converse to (iii) is no longer true. However, we do have
the following partial converse:
Proposition.
If
A
is a
π
-system with
E ∈ A
and
σ
(
A
) =
E
, and
f
is an
integrable function that
µ(f1
A
) = 0
for all A ∈ A. Then µ(f ) = 0 a.e.
Proof. Let
D = {A ∈ E : µ(f 1
A
) = 0}.
It follows immediately from the properties of the integral that
D
is a d-system.
So D = E by Dynkin’s lemma. Let
A
+
= {x ∈ E : f(x) > 0},
A
−
= {x ∈ E : f(x) < 0}.
Then A
±
∈ E, and
µ(f1
A
+
) = µ(f1
A
−
) = 0.
So f 1
A
+
and f 1
A
−
vanish a.e. So f vanishes a.e.
Proposition.
Suppose that (
g
n
) is a sequence of non-negative measurable
functions. Then we have
µ
∞
X
n=1
g
n
!
=
∞
X
n=1
µ(g
n
).
Proof. We know
N
X
n=1
g
n
!
%
∞
X
n=1
g
n
!
as N → ∞. So by the monotone convergence theorem, we have
N
X
n=1
µ(g
n
) = µ
N
X
n=1
g
n
!
% µ
∞
X
n=1
g
n
!
.
But we also know that
N
X
n=1
µ(g
n
) %
∞
X
n=1
µ(g
n
)
by definition. So we are done.
So for non-negative measurable functions, we can always switch the order of
integration and summation.
Note that we can consider summation as integration. We let
E
=
N
and
E
=
{all subsets of N}
. We let
µ
be the counting measure, so that
µ
(
A
) is the
size of
A
. Then integrability (and having a finite integral) is the same as absolute
convergence. Then if it converges, then we have
Z
f dµ =
∞
X
n=1
f(n).
So we can just view our proposition as proving that we can swap the order of
two integrals. The general statement is known as Fubini’s theorem.
3.2 Integrals and limits
We are now going to prove more things about exchanging limits and integrals.
These are going to be extremely useful in the future, as we want to exchange
limits and integrals a lot.
Theorem
(Fatou’s lemma)
.
Let (
f
n
) be a sequence of non-negative measurable
functions. Then
µ(lim inf f
n
) ≤ lim inf µ(f
n
).
Note that a special case was proven in the first example sheet, where we did
it for the case where f
n
are indicator functions.
Proof.
We start with the trivial observation that if
k ≥ n
, then we always have
that
inf
m≥n
f
m
≤ f
k
.
By the monotonicity of the integral, we know that
µ
inf
m≥n
f
m
≤ µ(f
k
).
for all k ≥ n.
So we have
µ
inf
m≥n
f
m
≤ inf
k≥n
µ(f
k
) ≤ lim inf
m
µ(f
m
).
It remains to show that the left hand side converges to
µ
(
lim inf f
m
). Indeed,
we know that
inf
m≥n
f
m
% lim inf
m
f
m
.
Then by monotone convergence, we have
µ
inf
m≥n
f
m
% µ
lim inf
m
f
m
.
So we have
µ
lim inf
m
f
m
≤ lim inf
m
µ(f
m
).
No one ever remembers which direction Fatou’s lemma goes, and this leads to
many incorrect proofs and results, so it is helpful to keep the following example
in mind:
Example. We let (E, E, µ) = (R, B, Lebesgue). We let
f
n
= 1
[n,n+1]
.
Then we have
lim inf
n
f
n
= 0.
So we have
µ(f
n
) = 1 for all n.
So we have
lim inf µ(f
n
) = 1, µ(lim inf f
n
) = 0.
So we have
µ
lim inf
m
f
m
≤ lim inf
m
µ(f
m
).
The next result we want to prove is the dominated convergence theorem.
This is like the monotone convergence theorem, but we are going to remove the
increasing and non-negative measurable condition, and add in something else.
Theorem
(Dominated convergence theorem)
.
Let (
f
n
)
, f
be measurable with
f
n
(
x
)
→ f
(
x
) for all
x ∈ E
. Suppose that there is an integrable function
g
such
that
|f
n
| ≤ g
for all n, then we have
µ(f
n
) → µ(f )
as n → ∞.
Proof. Note that
|f| = lim
n
|f|
n
≤ g.
So we know that
µ(|f|) ≤ µ(g) < ∞.
So we know that f, f
n
are integrable.
Now note also that
0 ≤ g + f
n
, 0 ≤ g − f
n
for all
n
. We are now going to apply Fatou’s lemma twice with these series. We
have that
µ(g) + µ(f) = µ(g + f)
= µ
lim inf
n
(g + f
n
)
≤ lim inf
n
µ(g + f
n
)
= lim inf
n
(µ(g) + µ(f
n
))
= µ(g) + lim inf
n
µ(f
n
).
Since µ(g) is finite, we know that
µ(f) ≤ lim inf
n
µ(f
n
).
We now do the same thing with g − f
n
. We have
µ(g) − µ(f) = µ(g − f)
= µ
lim inf
n
(g − f
n
)
≤ lim inf
n
µ(g − f
n
)
= lim inf
n
(µ(g) − µ(f
n
))
= µ(g) − lim sup
n
µ(f
n
).
Again, since µ(g) is finite, we know that
µ(f) ≥ lim sup
n
µ(f
n
).
These combine to tell us that
µ(f) ≤ lim inf
n
µ(f
n
) ≤ lim sup
n
µ(f
n
) ≤ µ(f ).
So they must be all equal, and thus µ(f
n
) → µ(f ).
3.3 New measures from old
We have previously considered several ways of constructing measures from old
ones, such as the image measure. We are now going to study a few more ways of
constructing new measures, and see how integrals behave when we do these.
Definition
(Restriction of measure space)
.
Let (
E, E, µ
) be a measure space,
and let A ∈ E. The restriction of the measure space to A is (A, E
A
, µ
A
), where
E
A
= {B ∈ E : B ⊆ A},
and µ
A
is the restriction of µ to E
A
, i.e.
µ
A
(B) = µ(B)
for all B ∈ E
A
.
It is easy to check the following:
Lemma.
For (
E, E, µ
) a measure space and
A ∈ E
, the restriction to
A
is a
measure space.
Proposition.
Let (
E, E, µ
) and (
F, F, µ
0
) be measure spaces and
A ∈ E
. Let
f : E → F be a measurable function. Then f|
A
is E
A
-measurable.
Proof. Let B ∈ F. Then
(f|
A
)
−1
(B) = f
−1
(B) ∩ A ∈ E
A
.
Similarly, we have
Proposition.
If
f
is integrable, then
f|
A
is
µ
A
-integrable and
µ
A
(
f|
A
) =
µ(f1
A
).
Note that means we have
µ(f1
A
) =
Z
E
f1
A
dµ =
Z
A
f dµ
A
.
Usually, we are lazy and just write
µ(f1
A
) =
Z
A
f dµ.
In the particular case of Lebesgue integration, if
A
is an interval with left and
right end points
a, b
(i.e. it can be open, closed, half open or half closed), then
we write
Z
A
f dµ =
Z
b
a
f(x) dx.
There is another construction we would be interested in.
Definition
(Pushforward/image of measure)
.
Let (
E, E
) and (
G, G
) be measure
spaces, and
f
:
E → G
a measurable function. If
µ
is a measure on (
E, E
), then
ν = µ ◦ f
−1
is a measure on (G, G), known as the pushforward or image measure.
We have already seen this before, but we can apply this to integration as
follows:
Proposition. If g is a non-negative measurable function on G, then
ν(g) = µ(g ◦ f).
Proof. Exercise using the monotone class theorem (see example sheet).
Finally, we can specify a measure by specifying a density.
Definition
(Density)
.
Let (
E, E, µ
) be a measure space, and
f
be a non-negative
measurable function. We define
ν(A) = µ(f 1
A
).
Then ν is a measure on (E, E).
Proposition. The ν defined above is indeed a measure.
Proof.
(i) ν(φ) = µ(f 1
∅
) = µ(0) = 0.
(ii) If (A
n
) is a disjoint sequence in E, then
ν
[
A
n
= µ(f1
S
A
n
) = µ
f
X
1
A
n
=
X
µ (f1
A
n
) =
X
ν(f ).
Definition
(Density)
.
Let
X
be a random variable. We say
X
has a density if
its law
µ
X
has a density with respect to the Lebesgue measure. In other words,
there exists f
X
non-negative measurable so that
µ
X
(A) = P[X ∈ A] =
Z
A
f
X
(x) dx.
In this case, for any non-negative measurable function, for any non-negative
measurable g, we have that
E[g(X)] =
Z
R
g(x)f
X
(x) dx.
3.4 Integration and differentiation
In “normal” calculus, we had three results involving both integration and dif-
ferentiation. One was the fundamental theorem of calculus, which we already
stated. The others are the change of variables formula, and differentiating under
the integral sign.
We start by proving the change of variables formula.
Proposition
(Change of variables formula)
.
Let
φ
: [
a, b
]
→ R
be continuously
differentiable and increasing. Then for any bounded Borel function g, we have
Z
φ(b)
φ(a)
g(y) dy =
Z
b
a
g(φ(x))φ
0
(x) dx. (∗)
We will use the monotone class theorem.
Proof. We let
V = {Borel functions g such that (∗) holds}.
We will want to use the monotone class theorem to show that this includes all
bounded functions.
We already know that
(i) V
contains
1
A
for all
A
in the
π
-system of intervals of the form [
u, v
]
⊆
[
a, b
].
This is just the fundamental theorem of calculus.
(ii) By linearity of the integral, V is indeed a vector space.
(iii)
Finally, let (
g
n
) be a sequence in
V
, and
g
n
≥
0,
g
n
% g
. Then we know
that
Z
φ(b)
φ(a)
g
n
(y) dy =
Z
b
a
g
n
(φ(x))φ
0
(x) dx.
By the monotone convergence theorem, these converge to
Z
φ(b)
φ(a)
g(y) dy =
Z
b
a
g(φ(x))φ
0
(x) dx.
Then by the monotone class theorem,
V
contains all bounded Borel functions.
The next problem is differentiation under the integral sign. We want to know
when we can say
d
dt
Z
f(x, t) dx =
Z
∂f
∂t
(x, t) dx.
Theorem
(Differentiation under the integral sign)
.
Let (
E, E, µ
) be a space,
and U ⊆ R be an open set, and f : U × E → R. We assume that
(i) For any t ∈ U fixed, the map x 7→ f(t, x) is integrable;
(ii) For any x ∈ E fixed, the map t 7→ f(t, x) is differentiable;
(iii) There exists an integrable function g such that
∂f
∂t
(t, x)
≤ g(x)
for all x ∈ E and t ∈ U.
Then the map
x 7→
∂f
∂t
(t, x)
is integrable for all t, and also the function
F (t) =
Z
E
f(t, x)dµ
is differentiable, and
F
0
(t) =
Z
E
∂f
∂t
(t, x) dµ.
The reason why we want the derivative to be bounded is that we want to
apply the dominated convergence theorem.
Proof.
Measurability of the derivative follows from the fact that it is a limit of
measurable functions, and then integrability follows since it is bounded by g.
Suppose (h
n
) is a positive sequence with h
n
→ 0. Then let
g
n
(x) =
f(t + h
n
, x) − f(t, x)
h
n
−
∂f
∂t
(t, x).
Since
f
is differentiable, we know that
g
n
(
x
)
→
0 as
n → ∞
. Moreover, by the
mean value theorem, we know that
|g
n
(x)| ≤ 2g(x).
On the other hand, by definition of F (t), we have
F (t + h
n
) − F (t)
h
n
−
Z
E
∂f
∂t
(t, x) dµ =
Z
g
n
(x) dx.
By dominated convergence, we know the RHS tends to 0. So we know
lim
n→∞
F (t + h
n
) − F (t)
h
n
→
Z
E
∂f
∂t
(t, x) dµ.
Since
h
n
was arbitrary, it follows that
F
0
(
t
) exists and is equal to the integral.
3.5 Product measures and Fubini’s theorem
Recall the following definition of the product σ-algebra.
Definition
(Product
σ
-algebra)
.
Let (
E
1
, E
1
, µ
1
) and (
E
2
, E
2
, µ
2
) be finite mea-
sure spaces. We let
A = {A
1
× A
2
: A
2
× E
1
, A
2
× E
2
}.
Then A is a π-system on E
1
× E
2
. The product σ-algebra is
E = E
1
⊗ E
2
= σ(A).
We now want to construct a measure on the product
σ
-algebra. We can, of
course, just apply the Caratheodory extension theorem, but we would want a
more explicit description of the integral. The idea is to define, for A ∈ E
1
⊗ E
2
,
µ(A) =
Z
E
1
Z
E
2
1
A
(x
1
, x
2
) µ
2
(dx
2
)
µ
1
(dx
1
).
Doing this has the advantage that it would help us in a step of proving Fubini’s
theorem.
However, before we can make this definition, we need to do some preparation
to make sure the above statement actually makes sense:
Lemma.
Let
E
=
E
1
× E
2
be a product of
σ
-algebras. Suppose
f
:
E → R
is
E-measurable function. Then
(i) For each x
2
∈ E
2
, the function x
1
7→ f (x
1
, x
2
) is E
1
-measurable.
(ii) If f is bounded or non-negative measurable, then
f
2
(x
2
) =
Z
E
1
f(x
1
, x
2
) µ
1
(dx
1
)
is E
2
-measurable.
Proof.
The first part follows immediately from the fact that for a fixed
x
2
,
the map
ι
1
:
E
1
→ E
given by
ι
1
(
x
1
) = (
x
1
, x
2
) is measurable, and that the
composition of measurable functions is measurable.
For the second part, we use the monotone class theorem. We let
V
be
the set of all measurable functions
f
such that
x
2
7→
R
E
1
f
(
x
1
, x
2
)
µ
1
(d
x
1
) is
E
2
-measurable.
(i)
It is clear that
1
E
, 1
A
∈ V
for all
A ∈ A
(where
A
is as in the definition
of the product σ-algebra).
(ii) V is a vector space by linearity of the integral.
(iii) Suppose (f
n
) is a non-negative sequence in V and f
n
% f , then
x
2
7→
Z
E
1
f
n
(x
1
, x
2
) µ
1
(dx
1
)
%
x
2
7→
Z
E
1
f(x
1
, x
2
) µ(dx
1
)
by the monotone convergence theorem. So f ∈ V .
So the monotone class theorem tells us
V
contains all bounded measurable
functions.
Now if
f
is a general non-negative measurable function, then
f ∧n
is bounded
and measurable, hence
f ∧n ∈ V
. Therefore
f ∈ V
by the monotone convergence
theorem.
Theorem.
There exists a unique measurable function
µ
=
µ
1
⊗ µ
2
on
E
such
that
µ(A
1
× A
2
) = µ(A
1
)µ(A
2
)
for all A
1
× A
2
∈ A.
Here it is crucial that the measure space is finite. Actually, everything
still works for
σ
-finite measure spaces, as we can just reduce to the finite case.
However, things start to go wrong if we don’t have σ-finite measure spaces.
Proof.
One might be tempted to just apply the Caratheodory extension theorem,
but we have a more direct way of doing it here, by using integrals. We define
µ(A) =
Z
E
1
Z
E
2
1
A
(x
1
, x
2
) µ
2
(dx
2
)
µ
1
(dx
1
).
Here the previous lemma is very important. It tells us that these integrals
actually make sense!
We first check that this is a measure:
(i) µ(∅) = 0 is immediate since 1
∅
= 0.
(ii) Suppose (A
n
) is a disjoint sequence and A =
S
A
n
. Then we have
µ(A) =
Z
E
1
Z
E
2
1
A
(x
1
, x
2
) µ
2
(dx
2
)
µ
1
(dx
1
)
=
Z
E
1
Z
E
2
X
n
1
A
n
(x
1
, x
2
) µ
2
(dx
2
)
!
µ
1
(dx
1
)
We now use the fact that integration commutes with the sum of non-
negative measurable functions to get
=
Z
E
1
X
n
Z
E
2
1
A
(x
1
, x
2
) µ
2
(dx
2
)
!
µ
1
(dx
1
)
=
X
n
Z
E
1
Z
E
2
1
A
n
(x
1
, x
2
) µ
2
(dx
2
)
µ
1
(dx
1
)
=
X
n
µ(A
n
).
So we have a working measure, and it clearly satisfies
µ(A
1
× A
2
) = µ(A
1
)µ(A
2
).
Uniqueness follows because
µ
is finite, and is thus characterized by its values on
the π-system A that generates E.
Exercise.
Show the non-uniqueness of the product Lebesgue measure on [0
,
1]
and the counting measure on [0, 1].
Note that we could as well have defined the measure as
µ(A) =
Z
E
2
Z
E
1
1
A
(x
1
, x
2
) µ
1
(dx
1
)
µ
2
(dx
2
).
The same proof would go through, so we have another measure on the space.
However, by uniqueness, we know they must be the same! Fubini’s theorem
generalizes this to arbitrary functions.
Theorem (Fubini’s theorem).
(i) If f is non-negative measurable, then
µ(f) =
Z
E
1
Z
E
2
f(x
1
, x
2
) µ
2
(dx
2
)
µ
1
(dx
1
). (∗)
In particular, we have
Z
E
1
Z
E
2
f(x
1
, x
2
) µ
2
(dx
2
)
µ
1
(dx
1
) =
Z
E
2
Z
E
1
f(x
1
, x
2
) µ
1
(dx
1
)
µ
2
(dx
2
).
This is sometimes known as Tonelli’s theorem.
(ii) If f is integrable, and
A =
x
1
∈ E :
Z
E
2
|f(x
1
, x
2
)|µ
2
(dx
2
) < ∞
.
then
µ
1
(E
1
\ A) = 0.
If we set
f
1
(x
1
) =
(
R
E
2
f(x
1
, x
2
) µ
2
(dx
2
) x
1
∈ A
0 x
1
6∈ A
,
then f
1
is a µ
1
integrable function and
µ
1
(f
1
) = µ(f).
Proof.
(i)
Let
V
be the set of all measurable functions such that (
∗
) holds. Then
V
is a vector space since integration is linear.
(a) By definition of µ, we know 1
E
and 1
A
are in V for all A ∈ A.
(b)
The monotone convergence theorem on both sides tell us that
V
is
closed under monotone limits of the form f
n
% f , f
n
≥ 0.
By the monotone class theorem, we know
V
contains all bounded measur-
able functions. If
f
is non-negative measurable, then (
f ∧ n
)
∈ V
, and
monotone convergence for f ∧n % f gives that f ∈ V .
(ii) Assume that f is µ-integrable. Then
x
1
7→
Z
E
2
|f(x
1
, x
2
)| µ(dx
2
)
is
E
1
-measurable, and, by (i), is
µ
1
-integrable. So
A
1
, being the inverse
image of
∞
under that map, lies in
E
1
. Moreover,
µ
1
(
E
1
\A
1
) = 0 because
integrable functions can only be infinite on sets of measure 0.
We set
f
+
1
(x
1
) =
Z
E
2
f
+
(x
1
, x
2
) µ
2
(dx
2
)
f
−
1
(x
1
) =
Z
E
2
f
−
(x
1
, x
2
) µ
2
(dx
2
).
Then we have
f
1
= (f
+
1
− f
−
1
)1
A
1
.
So the result follows since
µ(f) = µ(f
+
) − µ(f
−
) = µ(f
+
1
) − µ
1
(f
−
1
) = µ
1
(f
1
).
by (i).
Since
R
is
σ
-finite, we know that we can sensibly talk about the
d
-fold product
of the Lebesgue measure on R to obtain the Lebesgue measure on R
d
.
What
σ
-algebra is the Lebesgue measure on
R
d
defined on? We know the
Lebesgue measure on
R
is defined on
B
. So the Lebesgue measure is defined on
B × ··· × B = σ(B
1
× ··· × B
d
: B
i
∈ B).
By looking at the definition of the product topology, we see that this is just the
Borel σ-algebra on R
d
!
Recall that when we constructed the Lebesgue measure, the Caratheodory
extension theorem yields a measure on the “Lebesgue
σ
-algebra”
M
, which
was strictly bigger than the Borel
σ
-algebra. It was shown in the first example
sheet that
M
is complete, i.e. if we have
A ⊆ B ⊆ R
with
B ∈ M
,
µ
(
B
) = 0,
then
A ∈ M
. We can also take the Lebesgue measure on
R
d
to be defined on
M ⊗ ··· ⊗ M
. However, it happens that
M ⊗ M
together with the Lebesgue
measure on
R
2
is no longer complete (proof is left as an exercise for the reader).
We now turn to probability. Recall that random variables
X
1
, ··· , X
n
are
independent iff the
σ
-algebras
σ
(
X
1
)
, ··· , σ
(
X
n
) are independent. We will show
that random variables are independent iff their laws are given by the product
measure.
Proposition.
Let
X
1
, ··· , X
n
be random variables on (Ω
, F, P
) with values in
(E
1
, E
1
), ··· , (E
n
, E
n
) respectively. We define
E = E
1
× ··· × E
n
, E = E
1
⊗ ··· ⊗ E
n
.
Then X = (X
1
, ··· , X
n
) is E-measurable and the following are equivalent:
(i) X
1
, ··· , X
n
are independent.
(ii) µ
X
= µ
X
1
⊗ ··· ⊗ µ
X
n
.
(iii) For any f
1
, ··· , f
n
bounded and measurable, we have
E
"
n
Y
k=1
f
k
(X
k
)
#
=
n
Y
k=1
E[f
k
(X
k
)].
Proof.
–
(i)
⇒
(ii): Let
ν
=
µ
X
1
× ··· ⊗ µ
X
n
. We want to show that
ν
=
µ
X
. To
do so, we just have to check that they agree on a
π
-system generating the
entire σ-algebra. We let
A = {A
1
× ··· × A
n
: A
1
∈ E
1
, ··· , A
k
∈ E
k
}.
Then
A
is a generating
π
-system of
E
. Moreover, if
A
=
A
1
×···×A
n
∈ A
,
then we have
µ
X
(A) = P[X ∈ A]
= P[X
1
∈ A
1
, ··· , X
n
∈ A
n
]
By independence, we have
=
n
Y
k=1
P[X
k
∈ A
k
]
= ν(A).
So we know that µ
X
= ν = µ
X
1
⊗ ··· ⊗ µ
X
n
on E.
– (ii) ⇒ (iii): By assumption, we can evaluate the expectation
E
"
n
Y
k=1
f
k
(X
k
)
#
=
Z
E
n
Y
k=1
f
k
(x
k
)µ(dx
k
)
=
n
Y
k=1
Z
E
k
f(x
k
)µ
k
(dx
k
)
=
n
Y
k=1
E[f
k
(X
k
)].
Here in the middle we have used Fubini’s theorem.
– (iii) ⇒ (i): Take f
k
= 1
A
k
for A
k
∈ E
k
. Then we have
P[X
1
∈ A
1
, ··· , X
n
∈ A
n
] = E
"
n
Y
k=1
1
A
k
(X
k
)
#
=
n
Y
k=1
E[1
A
k
(X
k
)]
=
n
Y
k=1
P[X
k
∈ A
k
]
So X
1
, ··· , X
n
are independent.
4 Inequalities and L
p
spaces
Eventually, we will want to define the L
p
spaces as follows:
Definition
(
L
p
spaces)
.
Let (
E, E, µ
) be a measurable space. For 1
≤ p < ∞
,
we define
L
p
=
L
p
(
E, E, µ
) to be the set of all measurable functions
f
such that
kfk
p
=
Z
|f|
p
dµ
1/p
< ∞.
For p = ∞, we let L
∞
= L
∞
(E, E, µ) to be the space of functions with
kfk
∞
= inf{λ ≥ 0 : |f| ≤ λ a.e.} < ∞.
However, it is not clear that this is a norm. First of all,
kfk
p
= 0 does not
imply that
f
= 0. It only means that
f
= 0 a.e. But this is easy to solve. We
simply quotient out the vector space by functions that differ on a set of measure
zero. The more serious problem is that we don’t know how to prove the triangle
inequality.
To do so, we are going to prove some inequalities. Apart from enabling us to
show that
k · k
p
is indeed a norm, they will also be very helpful in the future
when we want to bound integrals.
4.1 Four inequalities
The four inequalities we are going to prove are the following:
(i) Chebyshev/Markov inequality
(ii) Jensen’s inequality
(iii) H¨older’s inequality
(iv) Minkowski’s inequality.
So let’s start proving the inequalities.
Proposition
(Chebyshev’s/Markov’s inequality)
.
Let
f
be non-negative mea-
surable and λ > 0. Then
µ({f ≥ λ}) ≤
1
λ
µ(f).
This is often used when this is a probability measure, so that we are bounding
the probability that a random variable is big.
The proof is essentially one line.
Proof. We write
f ≥ f1
f≥λ
≥ λ1
f≥λ
.
Taking µ gives the desired answer.
This is incredibly simple, but also incredibly useful!
The next inequality is Jensen’s inequality. To state it, we need to know what
a convex function is.
Definition
(Convex function)
.
Let
I ⊆ R
be an interval. Then
c
:
I → R
is
convex if for any t ∈ [0, 1] and x, y ∈ I, we have
c(tx + (1 − t)y) ≤ tc(x) + (1 − t)c(y).
x y
(1 − t)x + ty
(1 − t)f (x) + tc(y)
Note that if c is twice differentiable, then this is equivalent to c
00
> 0.
Proposition
(Jensen’s inequality)
.
Let
X
be an integrable random variable
with values in I. If c : I → R is convex, then we have
E[c(X)] ≥ c(E[X]).
It is crucial that this only applies to a probability space. We need the total
mass of the measure space to be 1 for it to work. Just being finite is not enough.
Jensen’s inequality will be an easy consequence of the following lemma:
Lemma.
If
c
:
I → R
is a convex function and
m
is in the interior of
I
, then
there exists real numbers a, b such that
c(x) ≥ ax + b
for all x ∈ I, with equality at x = m.
φ
m
ax + b
If the function is differentiable, then we can easily extract this from the
derivative. However, if it is not, then we need to be more careful.
Proof.
If
c
is smooth, then we know
c
00
≥
0, and thus
c
0
is non-decreasing. We
are going to show an analogous statement that does not mention the word
“derivative”. Consider x < m < y with x, y, m ∈ I. We want to show that
c(m) − c(x)
m − x
≤
c(y) − c(m)
y − m
.
To show this, we turn off our brains and do the only thing we can do. We can
write
m = tx + (1 − t)y
for some t. Then convexity tells us
c(m) ≤ tc(x) + (1 − t)c(y).
Writing c(m) = tc(m) + (1 − t)c(m), this tells us
t(c(m) − c(x)) ≤ (1 − t)(c(y) − c(m)).
To conclude, we simply have to compute the actual value of
t
and plug it in. We
have
t =
y − m
y − x
, 1 − t =
m − x
y − x
.
So we obtain
y − m
y − x
(c(m) − c(x)) ≤
m − x
y − x
(c(y) − c(m)).
Cancelling the y − x and dividing by the factors gives the desired result.
Now since x and y are arbitrary, we know there is some a ∈ R such that
c(m) − c(x)
m − x
≤ a ≤
c(y) − c(m)
y − m
.
for all x < m < y. If we rearrange, then we obtain
c(t) ≥ a(t − m) + c(m)
for all t ∈ I.
Proof of Jensen’s inequality.
To apply the previous result, we need to pick a
right m. We take
m = E[X].
To apply this, we need to know that
m
is in the interior of
I
. So we assume that
X
is not a.s. constant (that case is boring). By the lemma, we can find some
a, b ∈ R such that
c(X) ≥ aX + b.
We want to take the expectation of the LHS, but we have to make sure the
E
[
c
(
X
)] is a sensible thing to talk about. To make sure it makes sense, we show
that E[c(X)
−
] = E[(−c(X)) ∨ 0] is finite.
We simply bound
[c(X)]
−
= [−c(X)] ∨ 0 ≤ |a||X| + |b|.
So we have
E[c(X)
−
] ≤ |a|E|X| + |b| < ∞
since X is integrable. So E[c(X)] makes sense.
We then just take
E[c(X)] ≥ E[aX + b] = aE[X] + b = am + b = c(m) = c(E[X]).
So done.
We are now going to use Jensen’s inequality to prove H¨older’s inequality.
Before that, we take note of the following definition:
Definition (Conjugate). Let p, q ∈ [1, ∞]. We say that they are conjugate if
1
p
+
1
q
= 1,
where we take 1/∞ = 0.
Proposition
(H¨older’s inequality)
.
Let
p, q ∈
(1
, ∞
) be conjugate. Then for
f, g measurable, we have
µ(|fg|) = kfgk
1
≤ kf k
p
kgk
q
.
When p = q = 2, then this is the Cauchy-Schwarz inequality.
We will provide two different proofs.
Proof.
We assume that
kfk
p
>
0 and
kfk
p
< ∞
. Otherwise, there is nothing to
prove. By scaling, we may assume that
kfk
p
= 1. We make up a probability
measure by
P[A] =
Z
|f|
p
1
A
dµ.
Since we know
kfk
p
=
Z
|f|
p
dµ
1/p
= 1,
we know P[ ·] is a probability measure. Then we have
µ(|fg|) = µ(|fg|1
{|f|>0}
)
= µ
|g|
|f|
p−1
1
{|f|>0}
|f|
p
= E
|g|
|f|
p−1
1
{|f|>0}
Now use the fact that (
E|X|
)
q
≤ E
[
|X|
q
] since
x 7→ x
q
is convex for
q >
1. Then
we obtain
≤
E
|g|
q
|f|
(p−1)q
1
{|f|>0}
1/q
.
The key realization now is that
1
q
+
1
p
= 1 means that
q
(
p −
1) =
p
. So this
becomes
E
|g|
q
|f|
p
1
{|f|>0}
1/q
= µ(|g|
q
)
1/q
= kgk
q
.
Using the fact that kfk
p
= 1, we obtain the desired result.
Alternative proof.
We wlog 0
< kfk
p
, kgk
q
< ∞
, or else there is nothing to
prove. By scaling, we wlog kfk
p
= kgk
q
= 1. Then we have to show that
Z
|f||g| dµ ≤ 1.
To do so, we notice if
1
p
+
1
q
= 1, then the concavity of
log
tells us for any
a, b >
0,
we have
1
p
log a +
1
q
log b ≤ log
a
p
+
b
q
.
Replacing a with a
p
; b with b
p
and then taking exponentials tells us
ab ≤
a
p
p
+
b
q
q
.
While we assumed
a, b >
0 when deriving, we observe that it is also valid when
some of them are zero. So we have
Z
|f||g| dµ ≤
Z
|f|
p
p
+
|g|
q
q
dµ =
1
p
+
1
q
= 1.
Just like Jensen’s inequality, this is very useful when bounding integrals, and
it is also theoretically very important, because we are going to use it to prove
the Minkowski inequality. This tells us that the L
p
norm is actually a norm.
Before we prove the Minkowski inequality, we prove the following tiny lemma
that we will use repeatedly:
Lemma. Let a, b ≥ 0 and p ≥ 1. Then
(a + b)
p
≤ 2
p
(a
p
+ b
p
).
This is a terrible bound, but is useful when we want to prove that things are
finite.
Proof. We wlog a ≤ b. Then
(a + b)
p
≤ (2b)
p
≤ 2
p
b
p
≤ 2
p
(a
p
+ b
p
).
Theorem (Minkowski inequality). Let p ∈ [1, ∞] and f, g measurable. Then
kf + gk
p
≤ kf k
p
+ kgk
p
.
Again the proof is magic.
Proof. We do the boring cases first. If p = 1, then
kf + gk
1
=
Z
|f + g| ≤
Z
(|f| + |g|) =
Z
|f| +
Z
|g| = kfk
1
+ kgk
1
.
The proof of the case of p = ∞ is similar.
Now note that if
kf
+
gk
p
= 0, then the result is trivial. On the other hand,
if kf + gk
p
= ∞, then since we have
|f + g|
p
≤ (|f | + |g|)
p
≤ 2
p
(|f|
p
+ |g|
p
),
we know the right hand side is infinite as well. So this case is also done.
Let’s now do the interesting case. We compute
µ(|f + g|
p
) = µ(|f + g||f + g|
p−1
)
≤ µ(|f ||f + g|
p−1
) + µ(|g||f + g|
p−1
)
≤ kf k
p
k|f + g|
p−1
k
q
+ kgk
p
k|f + g|
p−1
k
q
= (kfk
p
+ kgk
p
)k|f + g|
p−1
k
q
= (kfk
p
+ kgk
p
)µ(|f + g|
(p−1)q
)
1−1/p
= (kfk
p
+ kgk
p
)µ(|f + g|
p
)
1−1/p
.
So we know
µ(|f + g|
p
) ≤ (kf k
p
+ kgk
p
)µ(|f + g|
p
)
1−1/p
.
Then dividing both sides by (µ(|f + g|
p
)
1−1/p
tells us
µ(|f + g|
p
)
1/p
= kf + gk
p
≤ kf k
p
+ kgk
p
.
Given these inequalities, we can go and prove some properties of L
p
spaces.
4.2 L
p
spaces
Recall the following definition:
Definition
(Norm of vector space)
.
Let
V
be a vector space. A norm on
V
is
a function k · k : V → R
≥0
such that
(i) ku + vk ≤ kuk + kvk for all U, v ∈ V .
(ii) kαvk = |α|kvk for all v ∈ V and α ∈ R
(iii) kvk = 0 implies v = 0.
Definition
(
L
p
spaces)
.
Let (
E, E, µ
) be a measurable space. For 1
≤ p < ∞
,
we define
L
p
=
L
p
(
E, E, µ
) to be the set of all measurable functions
f
such that
kfk
p
=
Z
|f|
p
dµ
1/p
< ∞.
For p = ∞, we let L
∞
= L
∞
(E, E, µ) to be the space of functions with
kfk
∞
= inf{λ ≥ 0 : |f| ≤ λ a.e.} < ∞.
By Minkowski’s inequality, we know
L
p
is a vector space, and also (i) holds.
By definition, (ii) holds obviously. However, (iii) does not hold for
k · k
p
, because
kfk
p
= 0 does not imply that f = 0. It merely implies that f = 0 a.e.
To fix this, we define an equivalence relation as follows: for
f, g ∈ L
p
, we say
that
f ∼ g
iff
f −g
= 0 a.e. For any
f ∈ L
p
, we let [
f
] denote its equivalence
class under this relation. In other words,
[f] = {g ∈ L
p
: f − g = 0 a.e.}.
Definition (L
p
space). We define
L
p
= {[f] : f ∈ L
p
},
where
[f] = {g ∈ L
p
: f − g = 0 a.e.}.
This is a normed vector space under the k · k
p
norm.
One important property of
L
p
is that it is complete, i.e. every Cauchy
sequence converges.
Definition
(Complete vector space/Banach spaces)
.
A normed vector space
(
V, k · k
) is complete if every Cauchy sequence converges. In other words, if (
v
n
)
is a sequence in
V
such that
kv
n
− v
m
k →
0 as
n, m → ∞
, then there is some
v ∈ V
such that
kv
n
− vk →
0 as
n → ∞
. A complete vector space is known as
a Banach space.
Theorem.
Let 1
≤ p ≤ ∞
. Then
L
p
is a Banach space. In other words, if (
f
n
)
is a sequence in
L
p
, with the property that
kf
n
− f
m
k
p
→
0 as
n, m → ∞
, then
there is some f ∈ L
p
such that kf
n
− f k
p
→ 0 as n → ∞.
Proof.
We will only give the proof for
p < ∞
. The
p
=
∞
case is left as an
exercise for the reader.
Suppose that (
f
n
) is a sequence in
L
p
with
kf
n
− f
m
k
p
→
0 as
n, m → ∞
.
Take a subsequence (f
n
k
) of (f
n
) with
kf
n
k+1
− f
n
k
k
p
≤ 2
−k
for all k ∈ N. We then find that
M
X
k=1
|f
n
k+1
− f
n
k
|
p
≤
M
X
k=1
kf
n
k+1
− f
n
k
k
p
≤ 1.
We know that
M
X
k=1
|f
n
k+1
− f
n
k
| %
∞
X
k=1
|f
n
k+1
− f
n
k
| as M → ∞.
So applying the monotone convergence theorem, we know that
∞
X
k=1
|f
n
k+1
− f
n
k
|
p
≤
∞
X
k=1
kf
n
k+1
− f
n
k
k
p
≤ 1.
In particular,
∞
X
k=1
|f
n
k+1
− f
n
k
| < ∞ a.e.
So f
n
k
(x) converges a.e., since the real line is complete. So we set
f(x) =
(
lim
k→∞
f
n
k
(x) if the limit exists
0 otherwise
By an exercise on the first example sheet, this function is indeed measurable.
Then we have
kf
n
− f k
p
p
= µ(|f
n
− f |
p
)
= µ
lim inf
k→∞
|f
n
− f
n
k
|
p
≤ lim inf
k→∞
µ(|f
n
− f
n
k
|
p
),
which tends to 0 as
n → ∞
since the sequence is Cauchy. So
f
is indeed the
limit.
Finally, we have to check that f ∈ L
p
. We have
µ(|f|
p
) = µ(|f − f
n
+ f
n
|
p
)
≤ µ((|f − f
n
| + |f
n
|)
p
)
≤ µ(2
p
(|f − f
n
|
p
+ |f
n
|
p
))
= 2
p
(µ(|f − f
n
|
p
) + µ(|f
n
|
p
)
2
)
We know the first term tends to 0, and in particular is finite for
n
large enough,
and the second term is also finite. So done.
4.3 Orthogonal projection in L
2
In the particular case
p
= 2, we have an extra structure on
L
2
, namely an inner
product structure, given by
hf, gi =
Z
fg dµ.
This inner product induces the L
2
norm by
kfk
2
2
= hf, f i.
Recall the following definition:
Definition
(Hilbert space)
.
A Hilbert space is a vector space with a complete
inner product.
So L
2
is not only a Banach space, but a Hilbert space as well.
Somehow Hilbert spaces are much nicer than Banach spaces, because you
have an inner product structure as well. One particular thing we can do is
orthogonal complements.
Definition (Orthogonal functions). Two functions f, g ∈ L
2
are orthogonal if
hf, gi = 0,
Definition (Orthogonal complement). Let V ⊆ L
2
. We then set
V
⊥
= {f ∈ L
2
: hf, vi = 0 for all v ∈ V }.
Note that we can always make these definitions for any inner product space.
However, the completeness of the space guarantees nice properties of the orthog-
onal complement.
Before we proceed further, we need to make a definition of what it means
for a subspace of
L
2
to be closed. This isn’t the usual definition, since
L
2
isn’t
really a normed vector space, so we need to accommodate for that fact.
Definition
(Closed subspace)
.
Let
V ⊆ L
2
. Then
V
is closed if whenever (
f
n
)
is a sequence in V with f
n
→ f , then there exists v ∈ V with v ∼ f.
Thee main thing that makes
L
2
nice is that we can use closed subspaces to
decompose functions orthogonally.
Theorem.
Let
V
be a closed subspace of
L
2
. Then each
f ∈ L
2
has an
orthogonal decomposition
f = u + v,
where v ∈ V and u ∈ V
⊥
. Moreover,
kf − vk
2
≤ kf − gk
2
for all g ∈ V with equality iff g ∼ v.
To prove this result, we need two simple identities, which can be easily proven
by writing out the expression.
Lemma (Pythagoras identity).
kf + gk
2
= kfk
2
+ kgk
2
+ 2hf, gi.
Lemma (Parallelogram law).
kf + gk
2
+ kf − gk
2
= 2(kfk
2
+ kgk
2
).
To prove the existence of orthogonal decomposition, we need to use a slight
trick involving the parallelogram law.
Proof of orthogonal decomposition.
Given
f ∈ L
2
, we take a sequence (
g
n
) in
V
such that
kf − g
n
k
2
→ d(f, V ) = inf
g
kf − gk
2
.
We now want to show that the infimum is attained. To do so, we show that
g
n
is a Cauchy sequence, and by the completeness of L
2
, it will have a limit.
If we apply the parallelogram law with
u
=
f − g
n
and
v
=
f − g
m
, then we
know
ku + vk
2
2
+ ku − vk
2
2
= 2(kuk
2
2
+ kvk
2
2
).
Using our particular choice of u and v, we obtain
2
f −
g
n
+ g
m
2
2
2
+ kg
n
− g
m
k
2
2
= 2(kf − g
n
k
2
2
+ kf − g
m
k
2
2
).
So we have
kg
n
− g
m
k
2
2
= 2(kf − g
n
k
2
2
+ kf − g
m
k
2
2
) − 4
f −
g
n
+ g
m
2
2
2
.
The first two terms on the right hand side tend to
d
(
f, V
)
2
, and the last term
is bounded below in magnitude by 4
d
(
f, V
). So as
n, m → ∞
, we must have
kg
n
−g
m
k
2
→
0. By completeness of
L
2
, there exists a
g ∈ L
2
such that
g
n
→ g
.
Now since
V
is assumed to be closed, we can find a
v ∈ V
such that
g
=
v
a.e. Then we know
kf − vk
2
= lim
n→∞
kf − g
n
k
2
= d(f, V ).
So
v
attains the infimum. To show that this gives us an orthogonal decomposition,
we want to show that
u = f − v ∈ V
⊥
.
Suppose
h ∈ V
. We need to show that
hu, hi
= 0. We need to do another funny
trick. Suppose t ∈ R. Then we have
d(f, V )
2
≤ kf − (v + th)k
2
2
= kf − vk
2
+ t
2
khk
2
2
− 2thf − v, hi.
We think of this as a quadratic in t, which is minimized when
t =
hf − v, hi
khk
2
2
.
But we know this quadratic is minimized when t = 0. So hf −v, hi = 0.
We are now going to look at the relationship between conditional expectation
and orthogonal projection.
Definition
(Conditional expectation)
.
Suppose we have a probability space
(Ω
, F, P
), and (
G
n
) is a collection of pairwise disjoint events with
S
n
G
n
= Ω.
We let
G = σ(G
n
: n ∈ N).
The conditional expectation of X given G is the random variable
Y =
∞
X
n=1
E[X | G
n
]1
G
n
,
where
E[X | G
n
] =
E[X1
G
n
]
P[G
n
]
for P[G
n
] > 0.
In other words, given any x ∈ Ω, say x ∈ G
n
, then Y (x) = E[X | G
n
].
If
X ∈ L
2
(
P
), then
Y ∈ L
2
(
P
), and it is clear that
Y
is
G
-measurable. We
claim that this is in fact the projection of
X
onto the subspace
L
2
(
G, P
) of
G-measurable L
2
random variables in the ambient space L
2
(P).
Proposition.
The conditional expectation of
X
given
G
is the projection of
X
onto the subspace
L
2
(
G, P
) of
G
-measurable
L
2
random variables in the ambient
space L
2
(P).
In some sense, this tells us
Y
is our best prediction of
X
given only the
information encoded in G.
Proof.
Let
Y
be the conditional expectation. It suffices to show that
E
[(
X −W
)
2
]
is minimized for
W
=
Y
among
G
-measurable random variables. Suppose that
W is a G-measurable random variable. Since
G = σ(G
n
: n ∈ N),
it follows that
W =
∞
X
n=1
a
n
1
G
n
.
where a
n
∈ R. Then
E[(X − W )
2
] = E
∞
X
n=1
(X − a
n
)1
G
n
!
2
= E
"
X
n
(X
2
+ a
2
n
− 2a
n
X)1
G
n
#
= E
"
X
n
(X
2
+ a
2
n
− 2a
n
E[X | G
n
])1
G
n
#
We now optimize the quadratic
X
2
+ a
2
n
− 2a
n
E[X | G
n
]
over a
n
. We see that this is minimized for
a
n
= E[X | G
n
].
Note that this does not depend on what
X
is in the quadratic, since it is in the
constant term.
Therefore we know that E[X | G
n
] is minimized for W = Y .
We can also rephrase variance and covariance in terms of the L
2
spaces.
Suppose X, Y ∈ L
2
(P) with
m
X
= E[X], m
Y
= E[Y ].
Then variance and covariance just correspond to
L
2
inner product and norm.
In fact, we have
var(X) = E[(X − m
X
)
2
] = kX − m
X
k
2
2
,
cov(X, Y ) = E[(X − m
X
)(Y − m
Y
)] = hX − m
X
, Y −m
Y
i.
More generally, the covariance matrix of a random vector
X
= (
X
1
, ··· , X
n
) is
given by
var(X) = (cov(X
i
, X
j
))
ij
.
On the example sheet, we will see that the covariance matrix is a positive definite
matrix.
4.4 Convergence in L
1
(P) and uniform integrability
What we are looking at here is the following question — suppose (
X
n
)
, X
are
random variables and
X
n
→ X
in probability. Under what extra assumptions is
it true that X
n
also converges to X in L
1
, i.e. E[X
n
− X] → 0 as X → ∞?
This is not always true.
Example. If we take (Ω, F, P) = ((0, 1), B((0, 1)), Lebesgue), and
X
n
= n1
(0,1/n)
.
Then X
n
→ 0 in probability, and in fact X
n
→ 0 almost surely. However,
E[|X
n
− 0|] = E[X
n
] = n ·
1
n
= 1,
which does not converge to 1.
We see that the problem with this series is that there is a lot of “stuff”
concentrated near 0, and indeed the functions can get unbounded near 0. We
can easily curb this problem by requiring our functions to be bounded:
Theorem
(Bounded convegence theorem)
.
Suppose
X,
(
X
n
) are random vari-
ables. Assume that there exists a (non-random) constant
C >
0 such that
|X
n
| ≤ C. If X
n
→ X in probability, then X
n
→ X in L
1
.
The proof is a rather standard manipulation.
Proof. We first show that |X| ≤ C a.e. Let ε > 0. We then have
P[|X| > C + ε] ≤ P[|X − X
n
| + |X
n
| > C + ε]
≤ P[|X − X
n
| > ε] + P[|X
n
| > C]
We know the second term vanishes, while the first term
→
0 as
n → ∞
. So we
know
P[|X| > C + ε] = 0
for all ε. Since ε was arbitrary, we know |X| ≤ C a.s.
Now fix an ε > 0. Then
E[|X
n
− X| = E
|X
n
− X|(1
|X
n
−X|≤ε
+ 1
|X
n
−X|>ε
)
≤ ε + 2C P [|X
n
− X| > ε] .
Since
X
n
→ X
in probability, for
N
sufficiently large, the second term is
≤ ε
.
So E[|X
n
− X|] ≤ 2ε, and we have convergence in L
1
.
But we can do better than that. We don’t need the functions to be actually
bounded. We just need that the functions aren’t concentrated in arbitrarily
small subsets of Ω. Thus, we make the following definition:
Definition
(Uniformly integrable)
.
Let
X
be a family of random variables.
Define
I
X
(δ) = sup{E[|X|1
A
] : X ∈ X, A ∈ F with P[A] < δ}.
Then we say
X
is uniformly integrable if
X
is
L
1
-bounded (see below), and
I
X
(δ) → 0 as δ → 0.
Definition
(
L
p
-bounded)
.
Let
X
be a family of random variables. Then we say
X is L
p
-bounded if
sup{kXk
p
: X ∈ X} < ∞.
In some sense, this is “uniform continuity for integration”. It is immediate
that
Proposition.
Finite unions of uniformly integrable sets are uniformly integrable.
How can we find uniformly integrable families? The following proposition
gives us a large class of such families.
Proposition.
Let
X
be an
L
p
-bounded family for some
p >
1. Then
X
is
uniformly integrable.
Proof. We let
C = sup{kXk
p
: X ∈ X} < ∞.
Suppose that X ∈ X and A ∈ F. We then have
E[|X|1
A
] =≤ E[|X|
p
]
1/p
P[A]
1/q
≤ CP[A]
1/q
.
by H¨older’s inequality, where p, q are conjugates. This is now a uniform bound
depending only on P[A]. So done.
This is the best we can get.
L
1
boundedness is not enough. Indeed, our
earlier example
X
n
= n1
(0,1/n)
,
is L
1
bounded but not uniformly integrable. So L
1
boundedness is not enough.
For many practical purposes, it is convenient to rephrase the definition of
uniform integrability as follows:
Lemma.
Let
X
be a family of random variables. Then
X
is uniformly integrable
if and only if
sup{E[|X|1
|X|>k
] : X ∈ X} → 0
as k → ∞.
Proof.
(⇒)
Suppose that
χ
is uniformly integrable. For any
k
, and
X ∈ X
by Cheby-
shev inequality, we have
P[|X| ≥ k] ≤
E[X]
k
.
Given
ε >
0, we pick
δ
such that
P
[
|X|1
A
]
< ε
for all
A
with
µ
(
A
)
< δ
.
Then pick
k
sufficiently large such that
kδ < sup{E
[
X
] :
X ∈ X}
. Then
P[|X| ≥ k] < δ, and hence E[|X|1
|X|>k
] < ε for all X ∈ X.
(⇐)
Suppose that the condition in the lemma holds. We first show that
X
is
L
1
-bounded. We have
E[|X|] = E[|X|(1
|X|≤k
+ 1
|X|>k
)] ≤ k + E[|X|1
|X|>k
] < ∞
by picking a large enough k.
Next note that for any measurable A and X ∈ X, we have
E[|X|1
A
] = E[|X|1
A
(1
|X|>k
+ 1
|X|≤k
)] ≤ E[|X|1
|X|>k
] + kP[A].
Thus, for any
ε >
0, we can pick
k
sufficiently large such that the first
term is
<
ε
2
for all
X ∈ X
by assumption. Then when
P
[
A
]
<
ε
2k
, we have
E|X|1
A
] ≤ ε.
As a corollary, we find that
Corollary. Let X = {X}, where X ∈ L
1
(P). Then X is uniformly integrable.
Hence, a finite collection of L
1
functions is uniformly integrable.
Proof. Note that
E[|X|] =
∞
X
k=0
E[|X|1
X∈[k,k+1)
].
Since the sum is finite, we must have
E[|X|1
|X|≥K
] =
∞
X
k=K
E[|X|1
X∈[k,k+1)
] → 0.
With all that preparation, we now come to the main theorem on uniform
integrability.
Theorem.
Let
X,
(
X
n
) be random variables. Then the following are equivalent:
(i) X
n
, X ∈ L
1
for all n and X
n
→ X in L
1
.
(ii) {X
n
} is uniformly integrable and X
n
→ X in probability.
The (i)
⇒
(ii) direction is just a standard manipulation. The idea of the
(ii)
⇒
(i) direction is that we use uniformly integrability to cut off
X
n
and
X
at some large value
K
, which gives us a small error, then apply bounded
convergence.
Proof.
We first assume that
X
n
, X
are
L
1
and
X
n
→ X
in
L
1
. We want to show
that {X
n
} is uniformly integrable and X
n
→ X in probability.
We first show that
X
n
→ X
in probability. This is just going to come from
the Chebyshev inequality. For ε > 0. Then we have
P[|X − X
n
| > ε] ≤
E[|X − X
n
|]
ε
→ 0
as n → ∞.
Next we show that
{X
n
}
is uniformly integrable. Fix
ε >
0. Take
N
such
that
n ≥ N
implies
E
[
|X −X
n
|
]
≤
ε
2
. Since finite families of
L
1
random variables
are uniformly integrable, we can pick
δ >
0 such that
A ∈ F
and
P
[
A
]
< δ
implies
E[X1
A
], E[|X
n
|1
A
] ≤
ε
2
for n = 1, ··· , N.
Now when n > N and A ∈ F with P[A] ≤ δ, then we have
E[|X
n
|1
A
] ≤ E[|X − X
n
|1
A
] + E[|X|1
A
]
≤ E[|X − X
n
|] +
ε
2
≤
ε
2
+
ε
2
= ε.
So {X
n
} is uniformly integrable.
Assume that {X
n
} is uniformly integrable and X
n
→ X in probability.
The first step is to show that
X ∈ L
1
. We want to use Fatou’s lemma, but
to do so, we want almost sure convergence, not just convergence in probability.
Recall that we have previously shown that there is a subsequence (
X
n
k
) of
(X
n
) such that X
n
k
→ X a.s. Then we have
E[|X|] = E
lim inf
k→∞
|X
n
k
|
≤ lim inf
k→∞
E[|X
n
k
|] < ∞
since uniformly integrable families are
L
1
bounded. So
E
[
|X|
]
< ∞
, hence
X ∈ L
1
.
Next we want to show that
X
n
→ X
in
L
1
. Take
ε >
0. Then there exists
K ∈ (0, ∞) such that
E
|X|1
{|X|>K}
, E
|X
n
|1
{|X
n
|>K}
≤
ε
3
.
To set things up so that we can use the bounded convergence theorem, we have
to invent new random variables
X
K
n
= (X
n
∨ −K) ∧ K, X
K
= (X ∨ −K) ∧ K.
Since X
n
→ X in probability, it follows that X
K
n
→ X
K
in probability.
Now bounded convergence tells us that there is some
N
such that
n ≥ N
implies
E[|X
K
n
− X
K
|] ≤
ε
3
.
Combining, we have for n ≥ N that
E[|X
n
− X|] ≤ E[|X
K
n
− X
K
|] + E[|X|1
{|X|≥K}
] + E[|X
n
|1
{|X
n
|≥K}
] ≤ ε.
So we know that X
n
→ X in L
1
.
The main application is that when
{X
n
}
is a type of stochastic process known
as a martingale. This will be done in III Advanced Probability and III Stochastic
Calculus.
5 Fourier transform
We now turn to the exciting topic of the Fourier transform. There are two main
questions we want to ask — when does the Fourier transform exist, and when
we can recover a function from its Fourier transform.
Of course, not only do we want to know if the Fourier transform exists. We
also want to know if it lies in some nice space, e.g. L
2
.
It turns out that when we want to prove things about Fourier transforms,
it is often helpful to “smoothen” the function by doing what is known as a
Gaussian convolution. So after defining the Fourier transform and proving some
really basic properties, we are going to investigate convolutions and Gaussians
for a bit (convolutions are also useful on their own, since they correspond to
sums of independent random variables). After that, we can go and prove the
actual important properties of the Fourier transform.
5.1 The Fourier transform
When talking about Fourier transforms, we will mostly want to talk about
functions
R
d
→ C
. So from now on, we will write
L
p
for complex valued Borel
functions on R
d
with
kfk
p
=
Z
R
d
|f|
p
1/p
< ∞.
The integrals of complex-valued function are defined on the real and imaginary
parts separately, and satisfy the properties we would expect them to. The details
are on the first example sheet.
Definition
(Fourier transform)
.
The Fourier transform
ˆ
f
:
R
d
→ C
of
f ∈
L
1
(R
d
) is given by
ˆ
f(u) =
Z
R
d
f(x)e
i(u,x)
dx,
where u ∈ R
d
and (u, x) denotes the inner product, i.e.
(u, x) = u
1
x
1
+ ··· + u
d
x
d
.
Why do we care about Fourier transforms? Many computations are easier
with
ˆ
f
in place of
f
, especially computations that involve differentiation and
convolutions (which are relevant to sums of independent random variables). In
particular, we will use it to prove the central limit theorem.
More generally, we can define the Fourier transform of a measure:
Definition
(Fourier transform of measure)
.
The Fourier transform of a finite
measure µ on R
d
is the function ˆµ : R
d
→ C given by
ˆµ(u) =
Z
R
d
e
i(u,x)
µ(dx).
In the context of probability, we give these things a different name:
Definition
(Characteristic function)
.
Let
X
be a random variable. Then the
characteristic function of X is the Fourier transform of its law, i.e.
φ
X
(u) = E[e
i(u,X)
] = ˆµ
X
(u),
where µ
X
is the law of X.
We now make the following (trivial) observations:
Proposition.
k
ˆ
fk
∞
≤ kf k
1
, kˆµk
∞
≤ µ(R
d
).
Less trivially, we have the following result:
Proposition. The functions
ˆ
f, ˆµ are continuous.
Proof. If u
n
→ u, then
f(x)e
i(u
n
,x)
→ f (x)e
i(u,x)
.
Also, we know that
|f(x)e
i(u
n
,x)
| = |f (x)|.
So we can apply dominated convergence theorem with |f| as the bound.
5.2 Convolutions
To actually do something useful about the Fourier transforms, we need to talk
about convolutions.
Definition
(Convolution of random variables)
.
Let
µ, ν
be probability measures.
Their convolution
µ ∗ ν
is the law of
X
+
Y
, where
X
has law
µ
and
Y
has law
ν, and X, Y are independent. Explicitly, we have
µ ∗ ν(A) = P[X + Y ∈ A]
=
ZZ
1
A
(x + y) µ(dx) ν(dy)
Let’s suppose that
µ
has a density function
f
with respect to the Lebesgue
measure. Then we have
µ ∗ ν(A) =
ZZ
1
A
(x + y)f(x) dx ν(dy)
=
ZZ
1
A
(x)f(x − y) dx ν(dy)
=
Z
1
A
(x)
Z
f(x − y) ν(dy)
dx.
So we know that µ ∗ ν has law
Z
f(x − y) ν(dy).
This thing has a name.
Definition
(Convolution of function with measure)
.
Let
f ∈ L
p
and
ν
a
probability measure. Then the convolution of f with µ is
f ∗ ν(x) =
Z
f(x − y) ν(dy) ∈ L
p
.
Note that we do have to treat the two cases of convolutions separately, since
a measure need not have a density, and a function need not specify a probability
measure (it may not integrate to 1).
We check that it is indeed in L
p
. Since ν is a probability measure, Jensen’s
inequality says we have
kf ∗ νk
p
p
=
Z
Z
|f(x − y)|ν(dy)
p
dx
≤
ZZ
|f(x − y)|
p
ν(dy) dx
=
ZZ
|f(x − y)|
p
dx ν(dy)
= kfk
p
p
< ∞.
In fact, from this computation, we see that
Proposition. For any f ∈ L
p
and ν a probability measure, we have
kf ∗ νk
p
≤ kf k
p
.
The interesting thing happens when we try to take the Fourier transform of
a convolution.
Proposition.
[
f ∗ ν(u) =
ˆ
f(u)ˆν(u).
Proof. We have
[
f ∗ ν(u) =
Z
Z
f(x − y)ν(dy)
e
i(u,x)
dx
=
ZZ
f(x − y)e
i(u,x)
dx ν(dy)
=
Z
Z
f(x − y)e
i(u,x−y)
d(x − y)
e
i(u,y)
µ(dy)
=
Z
Z
f(x)e
i(u,x)
d(x)
e
i(u,y)
µ(dy)
=
Z
ˆ
f(u)e
i(u,x)
µ(dy)
=
ˆ
f(u)
Z
e
i(u,x)
µ(dy)
=
ˆ
f(u)ˆν(u).
In the context of random variables, we have a similar result:
Proposition.
Let
µ, ν
be probability measures, and
X, Y
be independent vari-
ables with laws µ, ν respectively. Then
[
µ ∗ ν(u) = ˆµ(u)ˆν(u).
Proof. We have
[
µ ∗ ν(u) = E[e
i(u,X+Y )
] = E[e
i(u,X)
]E[e
i(u,Y )
] = ˆµ(u)ˆν(u).
5.3 Fourier inversion formula
We now want to work towards proving the Fourier inversion formula:
Theorem (Fourier inversion formula). Let f,
ˆ
f ∈ L
1
. Then
f(x) =
1
(2π)
d
Z
ˆ
f(u)e
−i(u,x)
du a.e.
Our strategy is as follows:
(i)
Show that the Fourier inversion formula holds for a Gaussian distribution
by direct computations.
(ii)
Show that the formula holds for Gaussian convolutions, i.e. the convolution
of an arbitrary function with a Gaussian.
(iii)
We show that any function can be approximated by a Gaussian convolution.
Note that the last part makes a lot of sense. If
X
is a random variable, then
convolving with a Gaussian is just adding
X
+
√
tZ
, and if we take
t →
0, we
recover the original function. What we have to do is to show that this behaves
sufficiently well with the Fourier transform and the Fourier inversion formula
that we will actually get the result we want.
Gaussian densities
Before we start, we had better start by defining the Gaussian distribution.
Definition (Gaussian density). The Gaussian density with variance t is
g
t
(x) =
1
2πt
d/2
e
−|x|
2
/2t
.
This is equivalently the density of
√
tZ
, where
Z
= (
Z
1
, ··· , Z
d
) with
Z
i
∼
N(0, 1) independent.
We now want to compute the Fourier transformation directly and show that
the Fourier inversion formula works for this.
We start off by working in the case
d
= 1 and
Z ∼ N
(0
,
1). We want to
compute the Fourier transform of the law of this guy, i.e. its characteristic
function. We will use a nice trick.
Proposition. Let Z ∼ N(0, 1). Then
φ
Z
(a) = e
−u
2
/2
.
We see that this is in fact a Gaussian up to a factor of 2π.
Proof. We have
φ
Z
(u) = E[e
iuZ
]
=
1
√
2π
Z
e
iux
e
−x
2
/2
dx.
We now notice that the function is bounded, so we can differentiate under the
integral sign, and obtain
φ
0
Z
(u) = E[iZe
iuZ
]
=
1
√
2π
Z
ixe
iux
e
−x
2
/2
dx
= −uφ
Z
(u),
where the last equality is obtained by integrating by parts. So we know that
φ
Z
(u) solves
φ
0
Z
(u) = −uφ
Z
(u).
This is easy to solve, since we can just integrate this. We find that
log φ
Z
(u) = −
1
2
u
2
+ C.
So we have
φ
Z
(u) = Ae
−u
2
/2
.
We know that A = 1, since φ
Z
(0) = 1. So we have
φ
Z
(u) = e
−u
2
/2
.
We now do this problem in general.
Proposition.
Let
Z
= (
Z
1
, ··· , Z
d
) with
Z
j
∼ N
(0
,
1) independent. Then
√
tZ
has density
g
t
(x) =
1
(2πt)
d/2
e
−|x|
2
/(2t)
.
with
ˆg
t
(u) = e
−|u|
2
t/2
.
Proof. We have
ˆg
t
(u) = E[e
i(u,
√
tZ)
]
=
d
Y
j=1
E[e
i(u
j
,
√
tZ
j
)
]
=
d
Y
j=1
φ
Z
(
√
tu
j
)
=
d
Y
j=1
e
−tu
2
j
/2
= e
−|u|
2
t/2
.
Again,
g
t
and
ˆg
t
are almost the same, apart form the factor of (2
πt
)
−d/2
and
the position of t shifted. We can thus write this as
ˆg
t
(u) = (2π)
d/2
t
−d/2
g
1/t
(u).
So this tells us that
ˆ
ˆg
t
(u) = (2π)
d
g
t
(u).
This is not exactly the same as saying the Fourier inversion formula works,
because in the Fourier inversion formula, we integrated against
e
−i(u,x)
, not
e
i(u,x)
. However, we know that by the symmetry of the Gaussian distribution,
we have
g
t
(x) = g
t
(−x) = (2π)
−d
ˆ
ˆg
t
(−x) =
1
2π
d
Z
ˆg
t
(u)e
−i(u,x)
du.
So we conclude that
Lemma.
The Fourier inversion formula holds for the Gaussian density function.
Gaussian convolutions
Definition
(Gaussian convolution)
.
Let
f ∈ L
1
. Then a Gaussian convolution
of f is a function of the form f ∗g
t
.
We are now going to do a little computation that shows that functions of
this type also satisfy the Fourier inversion formula.
Before we start, we make some observations about the Gaussian convolution.
By general theory of convolutions, we know that we have
Proposition.
kf ∗ g
t
k
1
≤ kf k
1
.
We also have a pointwise bound
|f ∗ g
t
(x)| =
Z
f(x − y)e
−|y|
2
/(2t)
1
2πt
d/2
dy
≤ (2πt)
−d/2
Z
|f(x − y)| dx
≤ (2πt)
−d/2
kfk
1
.
This tells us that in fact
Proposition.
kf ∗ g
t
k
∞
≤ (2πt)
−d/2
kfk
1
.
So in fact the convolution is pointwise bounded. We see that the bound gets
worse as
t →
0, and we will see that this is because as
t →
0, the convolution
f ∗ g
t
becomes a better and better approximation of
f
, and we did not assume
that f is bounded.
Similarly, we can compute that
Proposition.
k
\
f ∗ g
t
k
1
= k
ˆ
f ˆg
t
k
1
≤ (2π)
d/2
t
−d/2
k
ˆ
fk
1
,
and
k
\
f ∗ g
t
k
∞
≤ k
ˆ
fk
∞
.
Now given these bounds, it makes sense to write down the Fourier inversion
formula for a Gaussian convolution.
Lemma. The Fourier inversion formula holds for Gaussian convolutions.
We are going to reduce this to the fact that the Gaussian distribution itself
satisfies Fourier inversion.
Proof. We have
f ∗ g
t
(x) =
Z
f(x − y)g
t
(y) dy
=
Z
f(x − y)
1
(2π)
d
Z
ˆg
t
(u)e
−i(u,y)
du
dy
=
1
2π
d
ZZ
f(x − y)ˆg
t
(u)e
−i(u,y)
du dy
=
1
2π
d
Z
Z
f(x − y)e
−i(u,x−y)
dy
ˆg
t
(u)e
−i(u,x)
du
=
1
2π
d
Z
ˆ
f(u)ˆg
t
(u)e
−i(u,x)
du
=
1
2π
d
Z
\
f ∗ g
t
(u)e
−i(u,x)
du
So done.
The proof
Finally, we are going to extend the Fourier inversion formula to the case where
f,
ˆ
f ∈ L
2
.
Theorem (Fourier inversion formula). Let f ∈ L
1
and
f
t
(x) = (2π)
−d
Z
ˆ
f(u)e
−|u|
2
t/2
e
−i(u,x)
du = (2π)
−d
Z
\
f ∗ g
t
(u)e
−i(u,x)
du.
Then
kf
t
−f k
1
→
0, as
t →
0, and the Fourier inversion holds whenever
f,
ˆ
f ∈ L
1
.
To prove this, we first need to show that the Gaussian convolution is indeed
a good approximation of f:
Lemma.
Suppose that
f ∈ L
p
with
p ∈
[1
, ∞
). Then
kf ∗ g
t
− fk
p
→
0 as
t → 0.
Note that this cannot hold for
p
=
∞
. Indeed, if
p
=
∞
, then the
∞
-norm
is the uniform norm. But we know that
f ∗ g
t
is always continuous, and the
uniform limit of continuous functions is continuous. So the formula cannot hold
if f is not already continuous.
Proof.
We fix
ε >
0. By a question on the example sheet, we can find
h
which
is continuous and with compact support such that kf −hk
p
≤
ε
3
. So we have
kf ∗ g
t
− h ∗ g
t
k
p
= k(f − h) ∗ g
t
k
p
≤ kf − hk
p
≤
ε
3
.
So it suffices for us to work with a continuous function
h
with compact support.
We let
e(y) =
Z
|h(x − y) − h(x)|
p
dx.
We first show that e is a bounded function:
e(y) ≤
Z
2
p
(|h(x − y)|
p
+ |h(x)|
p
) dx
= 2
p+1
khk
p
p
.
Also, since
h
is continuous and bounded, the dominated convergence theorem
tells us that e(y) → 0 as y → 0.
Moreover, using the fact that
R
g
t
(y) dy = 1, we have
kh ∗ g
t
− hk
p
p
=
Z
Z
(h(x − y) − h(x))g
t
(y) dy
p
dx
Since
g
t
(
y
) d
y
is a probability measure, by Jensen’s inequality, we can bound
this by
≤
ZZ
|h(x − y) − h(x)|
p
g
t
(y) dy dx
=
Z
Z
|h(x − y) − h(x)|
p
dx
g
t
(y) dy
=
Z
e(y)g
t
(y) dy
=
Z
e(
√
ty)g
1
(y) dy,
where we used the definition of
g
and substitution. We know that this tends to 0
as
t →
0 by the bounded convergence theorem, since we know that
e
is bounded.
Finally, we have
kf ∗ g
t
− f k
p
≤ kf ∗ g
t
− h ∗ g
t
k
p
+ kh ∗ g
t
− hk
p
+ kh − fk
p
≤
ε
3
+
ε
3
+ kh ∗ g
t
− hk
p
=
2ε
3
+ kh ∗ g
t
− hk
p
.
Since we know that
kh ∗ g
t
− hk
p
→
0 as
t →
0, we know that for all sufficiently
small t, the function is bounded above by ε. So we are done.
With this lemma, we can now prove the Fourier inversion theorem.
Proof of Fourier inversion theorem.
The first part is just a special case of the
previous lemma. Indeed, recall that
\
f ∗ g
t
(u) =
ˆ
f(u)e
−|u|
2
t/2
.
Since Gaussian convolutions satisfy Fourier inversion formula, we know that
f
t
= f ∗ g
t
.
So the previous lemma says exactly that kf
t
− f k
1
→ 0.
Suppose now that
ˆ
f ∈ L
1
as well. Then looking at the integrand of
f
t
(x) = (2π)
−d
Z
ˆ
f(u)e
−|u|
2
t/2
e
−i(u,x)
du,
we know that
ˆ
f(u)e
−|u|
2
t/2
e
−i(u,x)
≤ |
ˆ
f|.
Then by the dominated convergence theorem with dominating function
|
ˆ
f|
, we
know that this converges to
f
t
(x) → (2π)
−d
Z
ˆ
f(u)e
−i(u,x)
du as t → 0.
By the first part, we know that
kf
t
− f k
1
→
0 as
t →
0. So we can find a
sequence (
t
n
) with
t
n
>
0,
t
n
→
0 so that
f
t
n
→ f
a.e. Combining these, we
know that
f(x) =
Z
ˆ
f(u)e
−i(u,x)
du a.e.
So done.
5.4 Fourier transform in L
2
It turns out wonderful things happen when we take the Fourier transform of an
L
2
function.
Theorem
(Plancherel identity)
.
For any function
f ∈ L
1
∩ L
2
, the Plancherel
identity holds:
k
ˆ
fk
2
= (2π)
d/2
kfk
2
.
As we are going to see in a moment, this is just going to follow from the
Fourier inversion formula plus a clever trick.
Proof.
We first work with the special case where
f,
ˆ
f ∈ L
1
, since the Fourier
inversion formula holds for f. We then have
kfk
2
2
=
Z
f(x)f(x) dx
=
1
(2π)
d
Z
Z
ˆ
f(u)e
−i(u,x)
du
f(x) dx
=
1
(2π)
d
Z
ˆ
f(u)
f(x)e
−i(u,x)
dx
du
=
1
(2π)
d
Z
ˆ
f(u)
f(x)e
i(u,x)
dx
du
=
1
(2π)
d
Z
ˆ
f(u)
ˆ
f(u) du
=
1
(2π)
d
k
ˆ
f(u)k
2
2
.
So the Plancherel identity holds for f .
To prove it for the general case, we use this result and an approximation
argument. Suppose that
f ∈ L
1
∩ L
2
, and let
f
t
=
f ∗g
t
. Then by our earlier
lemma, we know that
kf
t
k
2
→ kf k
2
as t → 0.
Now note that
ˆ
f
t
(u) =
ˆ
f(u)ˆg
t
(u) =
ˆ
f(u)e
−|u|
2
t/2
.
The important thing is that e
−|u|
2
t/2
% 1 as t → 0. Therefore, we know
k
ˆ
f
t
k
2
2
=
Z
|
ˆ
f(u)|
2
e
−|u|
2
t
du →
Z
|
ˆ
f(u)|
2
du = k
ˆ
fk
2
2
as t → 0, by monotone convergence.
Since f
t
,
ˆ
f
t
∈ L
1
, we know that the Plancherel identity holds, i.e.
k
ˆ
f
t
k
2
= (2π)
d/2
kf
t
k
2
.
Taking the limit as t → 0, the result follows.
What is this good for? It turns out that the Fourier transform gives as
a bijection from
L
2
to itself. While it is not true that the Fourier inversion
formula holds for everything in
L
2
, it holds for enough of them that we can just
approximate everything else by the ones that are nice. Then the above tells us
that in fact this bijection is a norm-preserving automorphism.
Theorem.
There exists a unique Hilbert space automorphism
F
:
L
2
→ L
2
such that
F ([f]) = [(2π)
−d/2
ˆ
f]
whenever f ∈ L
1
∩ L
2
.
Here [
f
] denotes the equivalence class of
f
in
L
2
, and we say
F
:
L
2
→ L
2
is
a Hilbert space automorphism if it is a linear bijection that preserves the inner
product.
Note that in general, there is no guarantee that
F
sends a function to its
Fourier transform. We know that only if it is a well-behaved function (i.e. in
L
1
∩ L
2
). However, the formal property of it being a bijection from
L
2
to itself
will be convenient for many things.
Proof. We define F
0
: L
1
∩ L
2
→ L
2
by
F
0
([f]) = [(2π)
−d/2
ˆ
f].
By the Plancherel identity, we know F
0
preserves the L
2
norm, i.e.
kF
0
([f])k
2
= k[f]k
2
.
Also, we know that
L
1
∩ L
2
is dense in
L
2
, since even the continuous functions
with compact support are dense. So we know
F
0
extends uniquely to an isometry
F : L
2
→ L
2
.
Since it preserves distance, it is in particular injective. So it remains to show
that the map is surjective. By Fourier inversion, the subspace
V = {[f] ∈ L
2
: f,
ˆ
f ∈ L
1
}
is sent to itself by the map
F
. Also if
f ∈ V
, then
F
4
[
f
] = [
f
] (note that
applying it twice does not suffice, because we actually have
F
2
[
f
](
x
) = [
f
](
−x
)).
So
V
is contained in the image
F
, and also
V
is dense in
L
2
, again because it
contains all Gaussian convolutions (we have
ˆ
f
t
=
ˆ
f ˆg
t
, and
ˆ
f
is bounded and
ˆg
t
is decaying exponentially). So we know that F is surjective.
5.5 Properties of characteristic functions
We are now going to state a bunch of theorems about characteristic functions.
Since the proofs are not examinable (but the statements are!), we are only going
to provide a rough proof sketch.
Theorem.
The characteristic function
φ
X
of a distribution
µ
X
of a random
variable
X
determines
µ
X
. In other words, if
X
and
˜
X
are random variables
and φ
X
= φ
˜
X
, then µ
X
= µ
˜
X
Proof sketch.
Use the Fourier inversion to show that
φ
X
determines
µ
X
(
g
) =
E[g(X)] for any bounded, continuous g.
Theorem.
If
φ
X
is integrable, then
µ
X
has a bounded, continuous density
function
f
X
(x) = (2π)
−d
Z
φ
X
(u)e
−i(u,x)
du.
Proof sketch.
Let
Z ∼ N
(0
,
1) be independent of
X
. Then
X
+
√
tZ
has a
bounded continuous density function which, by Fourier inversion, is
f
t
(x) = (2π)
−d
Z
φ
X
(u)e
−|u|
2
t/2
e
−i(u,x)
du.
Sending
t →
0 and using the dominated convergence theorem with dominating
function |φ
X
|.
The next theorem relates to the notion of weak convergence.
Definition
(Weak convergence of measures)
.
Let
µ,
(
µ
n
) be Borel probability
measures. We say that
µ
n
→ µ
weakly if and only if
µ
n
(
g
)
→ µ
(
g
) for all
bounded continuous g.
Similarly, we can define weak convergence of random variables.
Definition
(Weak convergence of random variables)
.
Let
X,
(
X
n
) be random
variables. We say
X
n
→ X
weakly iff
µ
X
n
→ µ
X
weakly, iff
E
[
g
(
X
n
)]
→ E
[
g
(
X
)]
for all bounded continuous g.
This is related to the notion of convergence in distribution, which we defined
long time ago without talking about it much. It is an exercise on the example
sheet that weak convergence of random variables in
R
is equivalent to convergence
in distribution.
It turns out that weak convergence is very useful theoretically. One reason is
that they are related to convergence of characteristic functions.
Theorem.
Let
X,
(
X
n
) be random variables with values in
R
d
. If
φ
X
n
(
u
)
→
φ
X
(u) for each u ∈ R
d
, then µ
X
n
→ µ
X
weakly.
The main application of this that will appear later is that this is the fact
that allows us to prove the central limit theorem.
Proof sketch.
By the example sheet, it suffices to show that
E
[
g
(
X
n
)]
→ E
[
g
(
X
)]
for all compactly supported
g ∈ C
∞
. We then use Fourier inversion and
convergence of characteristic functions to check that
E[g(X
n
+
√
tZ)] → E[g(X +
√
tZ)]
for all
t >
0 for
Z ∼ N
(0
,
1) independent of
X,
(
X
n
). Then we check that
E[g(X
n
+
√
tZ)] is close to E[g(X
n
)] for t > 0 small, and similarly for X.
5.6 Gaussian random variables
Recall that in the proof of the Fourier inversion theorem, we used these things
called Gaussians, but didn’t really say much about them. These will be useful
later on when we want to prove the central limit theorem, because the central
limit theorem says that in the long run, things look like Gaussians. So here we
lay out some of the basic definitions and properties of Gaussians.
Definition
(Gaussian random variable)
.
Let
X
be a random variable on
R
.
This is said to be Gaussian if there exists
µ ∈ R
and
σ ∈
(0
, ∞
) such that the
density of X is
f
X
(x) =
1
√
2πσ
2
exp
−
(x − µ)
2
2σ
2
.
A constant random variable
X
=
µ
corresponds to
σ
= 0. We say this has mean
µ and variance σ
2
.
When this happens, we write X ∼ N(µ, σ
2
).
For completeness, we record some properties of Gaussian random variables.
Proposition. Let X ∼ N (µ, σ
2
). Then
E[X] = µ, var(X) = σ
2
.
Also, for any a, b ∈ R, we have
aX + b ∼ N(aµ + b, a
2
σ
2
).
Lastly, we have
φ
X
(u) = e
−iµu−u
2
σ
2
/2
.
Proof.
All but the last of them follow from direct calculation, and can be found
in IA Probability.
For the last part, if X ∼ N (µ, σ
2
), then we can write
X = σZ + µ,
where
Z ∼ N
(0
,
1). Recall that we have previously found that the characteristic
function of a N(0, 1) function is
φ
Z
(u) = e
−|u|
2
/2
.
So we have
φ
X
(u) = E[e
iu(σZ+µ)
]
= e
iuµ
E[e
iuσZ
]
= e
iuµ
φ
Z
(iuσ)
= e
iuµ−u
2
σ
2
/2
.
What we are next going to do is to talk about the corresponding facts for
the Gaussian in higher dimensions. Before that, we need to come up with the
definition of a higher-dimensional Gaussian distribution. This might be different
from the one you’ve seen previously, because we want to allow some degeneracy
in our random variable, e.g. some of the dimensions can be constant.
Definition (Gaussian random variable). Let X be a random variable. We say
that X is a Gaussian on R
n
if (u, X) is Gaussian on R for all u ∈ R
n
.
We are now going to prove a version of our previous theorem to higher
dimensional Gaussians.
Theorem.
Let
X
be Gaussian on
R
n
, and le t
A
be an
m×n
matrix and
b ∈ R
m
.
Then
(i) AX + b is Gaussian on R
m
.
(ii) X ∈ L
2
and its law
µ
X
is determined by
µ
=
E
[
X
] and
V
=
var
(
X
), the
covariance matrix.
(iii) We have
φ
X
(u) = e
i(u,µ)−(u,V u)/2
.
(iv) If V is invertible, then X has a density of
f
X
(x) = (2π)
−n/2
(det V )
−1/2
exp
−
1
2
(x − µ, V
−1
(x − µ))
.
(v)
If
X
= (
X
1
, X
2
) where
X
i
∈ R
n
i
, then
cov
(
X
1
, X
2
) = 0 iff
X
1
and
X
2
are
independent.
Proof.
(i) If u ∈ R
m
, then we have
(AX + b, u) = (AX, u) + (b, u) = (X, A
T
u) + (b, u).
Since (
X, A
T
u
) is Gaussian and (
b, u
) is constant, it follows that (
AX
+
b, u
)
is Gaussian.
(ii)
We know in particular that each component of
X
is a Gaussian random
variable, which are in
L
2
. So
X ∈ L
2
. We will prove the second part of (ii)
with (iii)
(iii) If µ = E[X] and V = var(X), then if u ∈ R
n
, then we have
E[(u, X)] = (u, µ), var((u, X)) = (u, V u).
So we know
(u, X) ∼ N((u, µ), (u, V u)).
So it follows that
φ
X
(u) = E[e
i(u,X)
] = e
i(u,µ)−(u,V u)/2
.
So
µ
and
V
determine the characteristic function of
X
, which in turn
determines the law of X.
(iv)
We start off with a boring Gaussian vector
Y
= (
Y
1
, ··· , Y
n
), where the
Y
i
∼ N (0, 1) are independent. Then the density of Y is
f
Y
(y) = (2π)
−n/2
e
−|y|
2
/2
.
We are now going to construct X from Y . We define
˜
X = V
1/2
Y + µ.
This makes sense because
V
is always non-negative definite. Then
˜
X
is
Gaussian with
E
[
˜
X
] =
µ
and
var
(
˜
X
) =
V
. Therefore
X
has the same
distribution as
˜
X
. Since
V
is assumed to be invertible, we can compute
the density of
˜
X using the change of variables formula.
(v) It is clear that if X
1
and X
2
are independent, then cov(X
1
, X
2
) = 0.
Conversely, let X = (X
1
, X
2
), where cov(X
1
, X
2
) = 0. Then we have
V = var(X) =
V
11
0
0 V
22
.
Then for u = (u
1
, u
2
), we have
(u, V u) = (u
1
V
11
u
1
) + (u
2
, V
22
u
2
),
where V
11
= var(X
1
) and V
22
var(X
2
). Then we have
φ
X
(u) = e
iµu−(u,V u)/2
= e
iµ
1
u
1
−(u
1
,V
11
u
1
)/2
e
iµ
2
u
2
−(u
2
,V
22
u
2
)/2
= φ
X
1
(u
1
)φ
X
2
(u
2
).
So it follows that X
1
and X
2
are independent.
6 Ergodic theory
We are now going to study a new topic — ergodic theory. This is the study
the “long run behaviour” of system under the evolution of some Θ. Due to time
constraints, we will not do much with it. We are going to prove two ergodic
theorems that tell us what happens in the long run, and this will be useful when
we prove our strong law of large numbers at the end of the course.
The general settings is that we have a measure space (
E, E, µ
) and a measur-
able map Θ :
E → E
that is measure preserving, i.e.
µ
(
A
) =
µ
(Θ
−1
(
A
)) for all
A ∈ E.
Example.
Take (
E, E, µ
) = ([0
,
1)
, B
([0
,
1))
, Lebesgue
). For each
a ∈
[0
,
1), we
can define
Θ
a
(x) = x + a mod 1.
By what we’ve done earlier in the course, we know this translation map preserves
the Lebesgue measure on [0, 1).
Our goal is to try to understand the “long run averages” of the system when
we apply Θ many times. One particular quantity we are going to look at is the
following:
Let f be measurable. We define
S
n
(f) = f + f ◦Θ + ··· + f ◦Θ
n−1
.
We want to know what is the long run behaviour of
S
n
(f)
n
as n → ∞.
The ergodic theorems are going to give us the answer in a certain special
case. Finally, we will apply this in a particular case to get the strong law of
large numbers.
Definition
(Invariant subset)
.
We say
A ∈ E
is invariant for Θ if
A
= Θ
−1
(
A
).
Definition
(Invariant function)
.
A measurable function
f
is invariant if
f
=
f ◦ Θ.
Definition (E
Θ
). We write
E
Θ
= {A ∈ E : A is invariant}.
It is easy to show that
E
Θ
is a
σ
-algebra, and
f
:
E → R
is invariant iff it is
E
Θ
measurable.
Definition
(Ergodic)
.
We say Θ is ergodic if
A ∈ E
Θ
implies
µ
(
A
) = 0 or
µ(A
C
) = 0.
Example.
For the translation map on [0
,
1), we have Θ
a
is ergodic iff
a
is
irrational. Proof is left on example sheet 4.
Proposition.
If
f
is integrable and Θ is measure-preserving. Then
f ◦
Θ is
integrable and
Z
f ◦ Θdµ =
Z
E
fdµ.
It turns out that if Θ is ergodic, then there aren’t that many invariant
functions.
Proposition. If Θ is ergodic and f is invariant, then there exists a constant c
such that f = c a.e.
The proofs of these are left as an exercise on example sheet 4.
We are now going to spend a little bit of time studying a particular example,
because this will be needed to prove the strong law of large numbers.
Example
(Bernoulli shifts)
.
Let
m
be a probability distribution on
R
. Then
there exists an iid sequence
Y
1
, Y
2
, ···
with law
m
. Recall we constructed this
in a really funny way. Now we are going to build it in a more natural way.
We let
E
=
R
N
be the set of all real sequences (
x
n
). We define the
σ
-algebra
E
to be the
σ
-algebra generated by the projections
X
n
(
x
) =
x
n
. In other
words, this is the smallest
σ
-algebra such that all these functions are measurable.
Alternatively, this is the σ-algebra generated by the π-system
A =
(
Y
n∈N
A
n
, A
n
∈ B for all n and A
n
= R eventually
)
.
Finally, to define the measure µ, we let
Y = (Y
1
, Y
2
, ···) : Ω → E
where
Y
i
are iid random variables defined earlier, and Ω is the sample space of
the Y
i
.
Then
Y
is a measurable map because each of the
Y
i
’s is a random variable.
We let µ = P ◦ Y
−1
.
By the independence of Y
i
’s, we have that
µ(A) =
Y
n∈N
m(A
n
)
for any
A = A
1
× A
2
× ··· × A
n
× R × ···× R.
Note that the product is eventually 1, so it is really a finite product.
This (
E, E, µ
) is known as the canonical space associated with the sequence
of iid random variables with law m.
Finally, we need to define Θ. We define Θ : E → E by
Θ(x) = Θ(x
1
, x
2
, x
3
, ···) = (x
2
, x
3
, x
4
, ···).
This is known as the shift map.
Why do we care about this? Later, we are going to look at the function
f(x) = f(x
1
, x
2
, ···) = x
1
.
Then we have
S
n
(f) = f + f ◦Θ + ··· + f ◦Θ
n−1
= x
1
+ ··· + x
n
.
So
S
n
(f)
n
will the average of the first
n
things. So ergodic theory will tell us
about the long-run behaviour of the average.
Theorem. The shift map Θ is an ergodic, measure preserving transformation.
Proof. It is an exercise to show that Θ is measurable and measure preserving.
To show that Θ is ergodic, recall the definition of the tail σ-algebra
T
n
= σ(X
m
: m ≥ n + 1), T =
\
n
T
n
.
Suppose that A ∈
Q
n∈N
A
n
∈ A. Then
Θ
−n
(A) = {X
n+k
∈ A
k
for all k} ∈ T
n
.
Since
T
n
is a
σ
-algebra, we and Θ
−n
(
A
)
∈ T
N
for all
A ∈ A
and
σ
(
A
) =
E
, we
know Θ
−n
(A) ∈ T
N
for all A ∈ E.
So if A ∈ E
Θ
, i.e. A = Θ
−1
(A), then A ∈ T
N
for all N. So A ∈ T .
From the Kolmogorov 0-1 law, we know either
µ
[
A
] = 1 or
µ
[
A
] = 0. So
done.
6.1 Ergodic theorems
The proofs in this section are non-examinable.
Instead of proving the ergodic theorems directly, we first start by proving
the following magical lemma:
Lemma (Maximal ergodic lemma). Let f be integrable, and
S
∗
= sup
n≥0
S
n
(f) ≥ 0,
where S
0
(f) = 0 by convention. Then
Z
{S
∗
>0}
f dµ ≥ 0.
Proof. We let
S
∗
n
= max
0≤m≤n
S
m
and
A
n
= {S
∗
n
> 0}.
Now if 1 ≤ m ≤ n, then we know
S
m
= f + S
m−1
◦ Θ ≤ f + S
∗
n
◦ Θ.
Now on A
n
, we have
S
∗
n
= max
1≤m≤n
S
m
,
since S
0
= 0. So we have
S
∗
n
≤ f + S
∗
n
◦ Θ.
On A
C
n
, we have
S
∗
n
= 0 ≤ S
∗
n
◦ Θ.
So we know
Z
E
S
∗
n
dµ =
Z
A
n
S
∗
n
dµ +
Z
A
C
n
S
∗
n
dµ
≤
Z
A
n
f dµ +
Z
A
n
S
∗
n
◦ Θ dµ +
Z
A
C
n
S
∗
n
◦ Θ dµ
=
Z
A
n
f dµ +
Z
E
S
∗
n
◦ Θ dµ
=
Z
A
n
f dµ +
Z
E
S
∗
n
dµ
So we know
Z
A
n
f dµ ≥ 0.
Taking the limit as
n → ∞
gives the result by dominated convergence with
dominating function f.
We are now going to prove the two ergodic theorems, which tell us the
limiting behaviour of S
n
(f).
Theorem
(Birkhoff’s ergodic theorem)
.
Let (
E, E, µ
) be
σ
-finite and
f
be
integrable. There exists an invariant function
¯
f such that
µ(|
¯
f|) ≤ µ(|f |),
and
S
n
(f)
n
→
¯
f a.e.
If Θ is ergodic, then
¯
f is a constant.
Note that the theorem only gives
µ
(
|
¯
f|
)
≤ µ
(
|f|
). However, in many cases,
we can use some integration theorems such as dominated convergence to argue
that they must in fact be equal. In particular, in the ergodic case, this will allow
us to find the value of
¯
f.
Theorem
(von Neumann’s ergodic theorem)
.
Let (
E, E, µ
) be a finite measure
space. Let
p ∈
[1
, ∞
) and assume that
f ∈ L
p
. Then there is some function
¯
f ∈ L
p
such that
S
n
(f)
n
→
¯
f in L
p
.
Proof of Birkhoff’s ergodic theorem. We first note that
lim sup
n
S
n
n
, lim sup
n
S
n
n
are invariant functions, Indeed, we know
S
n
◦ Θ = f ◦ Θ + f ◦Θ
2
+ ··· + f ◦ Θ
n
= S
n+1
− f
So we have
lim sup
n→∞
S
n
◦ Θ
n
= lim sup
n→∞
S
n
n
+
f
n
→ lim sup
n→∞
S
n
n
.
Exactly the same reasoning tells us the lim inf is also invariant.
What we now need to show is that the set of points on which
lim sup
and
lim inf do not agree have measure zero. We set a < b. We let
D = D(a, b) =
x ∈ E : lim inf
n→∞
S
n
(x)
n
< a < b < lim sup
n→∞
S
n
(x)
n
.
Now if
lim sup
S
n
(x)
n
6
=
lim inf
S
n
(x)
n
, then there is some
a, b ∈ Q
such that
x ∈ D
(
a, b
). So by countable subadditivity, it suffices to show that
µ
(
D
(
a, b
)) = 0
for all a, b.
We now fix
a, b
, and just write
D
. Since
lim sup
S
n
n
and
lim inf
S
n
n
are both
invariant, we have that
D
is invariant. By restricting to
D
, we can assume that
D = E.
Suppose that B ∈ E and µ(G) < ∞. We let
g = f −b1
B
.
Then
g
is integrable because
f
is integrable and
µ
(
B
)
< ∞
. Moreover, we have
S
n
(g) = S
n
(f − b1
B
) ≥ S
n
(f) − nb.
Since we know that
lim sup
n
S
n
(f)
n
> b
by definition, we can find an
n
such that
S
n
(g) > 0. So we know that
S
∗
(g)(x) = sup
n
S
n
(g)(x) > 0
for all x ∈ D. By the maximal ergodic lemma, we know
0 ≤
Z
D
g dµ =
Z
D
f − b1
B
dµ =
Z
D
f dµ − bµ(B).
If we rearrange this, we know
bµ(B) ≤
Z
D
f dµ.
for all measurable sets
B ∈ E
with finite measure. Since our space is
σ
-finite, we
can find B
n
% D such µ(B
n
) < ∞ for all n. So taking the limit above tells
bµ(D) ≤
Z
D
f dµ. (†)
Now we can apply the same argument with (
−a
) in place of
b
and (
−f
) in place
of f to get
(−a)µ(D) ≤ −
Z
D
f dµ. (‡)
Now note that since
b > a
, we know that at least one of
b >
0 and
a <
0 has to
be true. In the first case, (
†
) tells us that
µ
(
D
) is finite, since
f
is integrable.
Then combining with (‡), we see that
bµ(D) ≤
Z
D
f dµ ≤ aµ(D).
But
a < b
. So we must have
µ
(
D
) = 0. The second case follows similarly (or
follows immediately by flipping the sign of f).
We are almost done. We can now define
¯
f(x) =
(
lim S
n
(f)/n the limit exists
0 otherwise
Then by the above calculations, we have
S
n
(f)
n
→
¯
f a.e.
Also, we know
¯
f
is invariant, because
lim S
n
(
f
)
/n
is invariant, and so is the set
where the limit exists.
Finally, we need to show that
µ(
¯
f) ≤ µ(|f |).
This is since
µ(|f ◦ Θ
n
|) = µ(|f|)
as Θ
n
preserves the metric. So we have that
µ(|S
n
|) ≤ nµ(|f |) < ∞.
So by Fatou’s lemma, we have
µ(|
¯
f|) ≤ µ
lim inf
n
S
n
n
≤ lim inf
n
µ
S
n
n
≤ µ(|f |)
The proof of the von Neumann ergodic theorem follows easily from Birkhoff’s
ergodic theorem.
Proof of von Neumann ergodic theorem.
It is an exercise on the example sheet
to show that
kf ◦ Θk
p
p
=
Z
|f ◦ Θ|
p
dµ =
Z
|f|
p
dµ = kf k
p
p
.
So we have
S
n
n
p
=
1
n
kf + f ◦Θ + ··· + f ◦Θ
n−1
k ≤ kfk
p
by Minkowski’s inequality.
So let ε > 0, and take M ∈ (0, ∞) so that if
g = (f ∨(−M)) ∧ M,
then
kf − gk
p
<
ε
3
.
By Birkhoff’s theorem, we know
S
n
(g)
n
→ ¯g
a.e.
Also, we know
S
n
(g)
n
≤ M
for all n. So by bounded convergence theorem, we know
S
n
(g)
n
− ¯g
p
→ 0
as n → ∞. So we can find N such that n ≥ N implies
S
n
(g)
n
− ¯g
p
<
ε
3
.
Then we have
¯
f − ¯g
p
p
=
Z
lim inf
n
S
n
(f − g)
n
p
dµ
≤ lim inf
Z
S
n
(f − g)
n
p
dµ
≤ kf − gk
p
p
.
So if n ≥ N , then we know
S
n
(f)
n
−
¯
f
p
≤
S
n
(f − g)
n
p
+
S
n
(g)
n
−
¯
f
p
+
¯g −
¯
f
p
≤ ε.
So done.
7 Big theorems
We are now going to use all the tools we have previously developed to prove
two of the most important theorems about the sums of independent random
variables, namely the strong law of large numbers and the central limit theorem.
7.1 The strong law of large numbers
Before we start proving the strong law of large numbers, we first spend some
time discussing the difference between the strong law and the weak law. In both
cases, we have a sequence (
X
n
) of iid random variables with
E
[
X
i
] =
µ
. We let
S
n
= X
1
+ ··· + X
n
.
–
The weak law of large number says
S
n
/n → µ
in probability as
n → ∞
,
provided E[X
2
1
] < ∞.
– The strong law of large number says S
n
/n → µ a.s. provided E|X
1
| < ∞.
So we see that the strong law is indeed stronger, because convergence almost
everywhere implies convergence in measure.
We are actually going to do two versions of the strong law with different
hypothesis.
Theorem
(Strong law of large numbers assuming finite fourth moments)
.
Let
(
X
n
) be a sequence of independent random variables such that there exists
µ ∈ R
and M > 0 such that
E[X
n
] = µ, E[X
4
n
] ≤ M
for all n. With S
n
= X
1
+ ··· + X
n
, we have that
S
n
n
→ µ a.s. as n → ∞.
Note that in this version, we do not require that the
X
n
are iid. We simply
need that they are independent and have the same mean.
The proof is completely elementary.
Proof. We reduce to the case that µ = 0 by setting
Y
n
= X
n
− µ.
We then have
E[Y
n
] = 0, E[Y
4
n
] ≤ 2
4
(E[µ
4
+ X
4
n
]) ≤ 2
4
(µ
4
+ M ).
So it suffices to show that the theorem holds with
Y
n
in place of
X
n
. So we can
assume that µ = 0.
By independence, we know that for i 6= j, we have
E[X
i
X
3
j
] = E[X
i
]E[X
3
j
] = 0.
Similarly, for all i, j, k, ` distinct, we have
E[X
i
X
j
X
2
k
] = E[X
i
X
j
X
k
X
`
] = 0.
Hence we know that
E[S
4
n
] = E
n
X
k=1
X
4
k
+ 6
X
1≤i<j≤n
X
2
i
X
2
j
.
We know the first term is bounded by
nM
, and we also know that for
i 6
=
j
, we
have
E[X
2
i
X
2
j
] = E[X
2
i
]E[X
2
j
] ≤
q
E[X
4
i
]E[X
4
j
] ≤ M
by Jensen’s inequality. So we know
E
6
X
1≤i<j≤n
X
2
i
X
2
j
≤ 3n(n − 1)M.
Putting everything together, we have
E[S
4
n
] ≤ nM + 3n(n − 1)M ≤ 3n
2
M.
So we know
E
(S
n
/n)
4
≤
3M
n
2
.
So we know
E
"
∞
X
n=1
S
n
n
4
#
≤
∞
X
n=1
3M
n
2
< ∞.
So we know that
∞
X
n=1
S
n
n
4
< ∞ a.s.
So we know that (S
n
/n)
4
→ 0 a.s., i.e. S
n
/n → 0 a.s.
We are now going to get rid of the assumption that we have finite fourth
moments, but we’ll need to work with iid random variables.
Theorem
(Strong law of large numbers)
.
Let (
Y
n
) be an iid sequence of inte-
grable random variables with mean ν. With S
n
= Y
1
+ ··· + Y
n
, we have
S
n
n
→ ν a.s.
We will use the ergodic theorem to prove this. This is not the “usual” proof
of the strong law, but since we’ve done all that work on ergodic theory, we might
as well use it to get a clean proof. Most of the work left is setting up the right
setting for the proof.
Proof.
Let
m
be the law of
Y
1
and let
Y
= (
Y
1
, Y
2
, Y
3
, ···
). We can view
Y
as
a function
Y : Ω → R
N
= E.
We let (
E, E, µ
) be the canonical space associated with the distribution
m
so
that
µ = P ◦ Y
−1
.
We let f : E → R be given by
f(x
1
, x
2
, ···) = X
1
(x
1
, ··· , x
n
) = x
1
.
Then
X
1
has law given by
m
, and in particular is integrable. Also, the shift map
Θ : E → E given by
Θ(x
1
, x
2
, ···) = (x
2
, x
3
, ···)
is measure-preserving and ergodic. Thus, with
S
n
(f) = f + f ◦Θ + ··· + f ◦Θ
n−1
= X
1
+ ··· + X
n
,
we have that
S
n
(f)
n
→
¯
f a.e.
by Birkhoff’s ergodic theorem. We also have convergence in
L
1
by von Neumann
ergodic theorem.
Here
¯
f
is
E
Θ
-measurable, and Θ is ergodic, so we know that
¯
f
=
c
a.e. for
some constant c. Moreover, we have
c = µ(
¯
f) = lim
n→∞
µ(S
n
(f)/n) = ν.
So done.
7.2 Central limit theorem
Theorem.
Let (
X
n
) be a sequence of iid random variables with
E
[
X
i
] = 0 and
E[X
2
1
] = 1. Then if we set
S
n
= X
1
+ ··· + X
n
,
then for all x ∈ R, we have
P
S
n
√
n
≤ x
→
Z
x
−∞
e
−y
2
/2
√
2π
dy = P[N(0, 1) ≤ x]
as n → ∞.
Proof.
Let
φ
(
u
) =
E
[
e
iuX
1
]. Since
E
[
X
2
1
] = 1
< ∞
, we can differentiate under
the expectation twice to obtain
φ(u) = E[e
iuX
1
], φ
0
(u) = E[iX
1
e
iuX
1
], φ
00
(u) = E[−X
2
1
e
iuX
1
].
Evaluating at 0, we have
φ(0) = 1, φ
0
(0) = 0, φ
00
(0) = −1.
So if we Taylor expand φ at 0, we have
φ(u) = 1 −
u
2
2
+ o(u
2
).
We consider the characteristic function of S
n
/
√
n
φ
n
(u) = E[e
iuS
n
/
√
n
]
=
n
Y
i=1
E[e
iuX
j
/
√
n
]
= φ(u/
√
n)
n
=
1 −
u
2
2n
+ o
u
2
n
n
.
We now take the logarithm to obtain
log φ
n
(u) = n log
1 −
u
2
2n
+ o
u
2
n
= −
u
2
2
+ o(1)
→ −
u
2
2
So we know that
φ
n
(u) → e
−u
2
/2
,
which is the characteristic function of a N(0, 1) random variable.
So we have convergence in characteristic function, hence weak convergence,
hence convergence in distribution.