2Measurable functions and random variables

II Probability and Measure



2.4 Convergence of measurable functions
The next thing to look at is the convergence of measurable functions. In measure
theory, wonderful things happen when we talk about convergence. In analysis,
most of the time we had to require uniform convergence, or even stronger
notions, if we want limits to behave well. However, in measure theory, the kinds
of convergence we talk about are somewhat pointwise in nature. In fact, it
will be weaker than pointwise convergence. Yet, we are still going to get good
properties out of them.
Definition
(Convergence almost everywhere)
.
Suppose that (
E, E, µ
) is a mea-
sure space. Suppose that (
f
n
)
, f
are measurable functions. We say
f
n
f
almost everywhere (a.e.) if
µ({x E : f
n
(x) 6→ f(x)}) = 0.
If (E, E, µ) is a probability space, this is called almost sure convergence.
To see this makes sense, i.e. the set in there is actually measurable, note that
{x E : f
n
(x) 6→ f(x)} = {x E : lim sup |f
n
(x) f(x)| > 0}.
We have previously seen that
lim sup |f
n
f|
is non-negative measurable. So the
set {x E : lim sup |f
n
(x) f(x)| > 0} is measurable.
Another useful notion of convergence is convergence in measure.
Definition
(Convergence in measure)
.
Suppose that (
E, E, µ
) is a measure space.
Suppose that (
f
n
)
, f
are measurable functions. We say
f
n
f
in measure if for
each ε > 0, we have
µ({x E : |f
n
(x) f(x)| ε}) 0 as n ,
then we say that f
n
f in measure.
If (
E, E, µ
) is a probability space, then this is called convergence in probability.
In the case of a probability space, this says
P(|X
n
X| ε) 0 as n
for all ε, which is how we state the weak law of large numbers in the past.
After we define integration, we can consider the norms of a function f by
kfk
p
=
Z
|f(x)|
p
dx
1/p
.
Then in particular, if
kf
n
f k
p
0, then
f
n
f
in measure, and this provides
an easy way to see that functions converge in measure.
In general, neither of these notions imply each other. However, the following
theorem provides us with a convenient dictionary to translate between the two
notions.
Theorem.
(i) If µ(E) < , then f
n
f a.e. implies f
n
f in measure.
(ii)
For any
E
, if
f
n
f
in measure, then there exists a subsequence (
f
n
k
)
such that f
n
k
f a.e.
Proof.
(i) First suppose µ(E) < , and fix ε > 0. Consider
µ({x E : |f
n
(x) f(x)| ε}).
We use the result from the first example sheet that for any sequence of
events (A
n
), we have
lim inf µ(A
n
) µ(lim inf A
n
).
Applying to the above sequence says
lim inf µ({x : |f
n
(x) f(x)| ε}) µ({x : |f
m
(x) f(x)| ε eventually})
µ({x E : |f
m
(x) f(x)| 0})
= µ(E).
As µ(E) < , we have µ({x E : |f
n
(x) f(x)| > ε}) 0 as n .
(ii) Suppose that f
n
f in measure. We pick a subsequence (n
k
) such that
µ

x E : |f
n
k
(x) f(x)| >
1
k

2
k
.
Then we have
X
k=1
µ

x E : f
n
k
(x) f(x)| >
1
k

X
k=1
2
k
= 1 < .
By the first Borel–Cantelli lemma, we know
µ

x E : |f
n
k
(x) f(x)| >
1
k
i.o.

= 0.
So f
n
k
f a.e.
It is important that we assume that µ(E) < for the first part.
Example.
Consider (
E, E, µ
) = (
R, B, Lebesgue
). Take
f
n
(
x
) =
1
[n,)
(
x
).
Then
f
n
(
x
)
0 for all
x
, and in particular almost everywhere. However, we
have
µ

x R : |f
n
(x)| >
1
2

= µ([n, )) =
for all n.
There is one last type of convergence we are interested in. We will only
first formulate it in the probability setting, but there is an analogous notion in
measure theory known as weak convergence, which we will discuss much later on
in the course.
Definition
(Convergence in distribution)
.
Let (
X
n
)
, X
be random variables
with distribution functions
F
X
n
and
F
X
, then we say
X
n
X
in distribution if
F
X
n
(x) F
X
(x) for all x R at which F
X
is continuous.
Note that here we do not need that (
X
n
) and
X
live on the same probability
space, since we only talk about the distribution functions.
But why do we have the condition with continuity points? The idea is that
if the resulting distribution has a “jump” at
x
, it doesn’t matter which side of
the jump
F
X
(
x
) is at. Here is a simple example that tells us why this is very
important:
Example.
Let
X
n
to be uniform on [0
,
1
/n
]. Intuitively, this should converge
to the random variable that is always zero.
We can compute
F
X
n
(x) =
0 x 0
nx 0 < x < 1/n
1 x 1/n
.
We can also compute the distribution of the zero random variable as
F
0
=
(
0 x < 0
1 x 0
.
But F
X
n
(0) = 0 for all n, while F
X
(0) = 1.
One might now think of cheating by cooking up some random variable such
that
F
is discontinuous at so many points that random, unrelated things converge
to
F
. However, this cannot be done, because
F
is a non-decreasing function,
and thus can only have countably many points of discontinuities.
The big theorem we are going to prove about convergence in distribution is
that actually it is very boring and doesn’t give us anything new.
Theorem (Skorokhod representation theorem of weak convergence).
(i)
If (
X
n
)
, X
are defined on the same probability space, and
X
n
X
in
probability. Then X
n
X in distribution.
(ii)
If
X
n
X
in distribution, then there exists random variables (
˜
X
n
) and
˜
X
defined on a common probability space with
F
˜
X
n
=
F
X
n
and
F
˜
X
=
F
X
such that
˜
X
n
˜
X a.s.
Proof. Let S = {x R : F
X
is continuous}.
(i)
Assume that
X
n
X
in probability. Fix
x S
. We need to show that
F
X
n
(x) F
X
(x) as n .
We fix
ε >
0. Since
x S
, this implies that there is some
δ >
0 such that
F
X
(x δ) F
X
(x)
ε
2
F
X
(x + δ) F
X
(x) +
ε
2
.
We fix N large such that n N implies P[|X
n
X| δ]
ε
2
. Then
F
X
n
(x) = P[X
n
x]
= P[(X
n
X) + X x]
We now notice that
{
(
X
n
X
) +
X x} {X x
+
δ}{|X
n
X| > δ}
.
So we have
P[X x + δ] + P[|X
n
X| > δ]
F
X
(x + δ) +
ε
2
F
X
(x) + ε.
We similarly have
F
X
n
(x) = P[X
n
x]
P[X x δ] P[|X
n
X| > δ]
F
X
(x δ)
ε
2
F
X
(x) ε.
Combining, we have that
n N
implying
|F
x
n
(
x
)
F
X
(
x
)
| ε
. Since
ε
was arbitrary, we are done.
(ii) Suppose X
n
X in distribution. We again let
(Ω, F, B) = ((0, 1), B((0, 1)), Lebesgue).
We let
˜
X
n
(ω) = inf{x : ω F
X
n
(x)},
˜
X(ω) = inf{x : ω F
X
(x)}.
Recall from before that
˜
X
n
has the same distribution function as
X
n
for
all n, and
˜
X has the same distribution as X. Moreover, we have
˜
X
n
(ω) x ω F
X
n
(x)
x <
˜
X
n
(ω) F
X
n
(x) < ω,
and similarly if we replace X
n
with X.
We are now going to show that with this particular choice, we have
˜
X
n
˜
X
a.s.
Note that
˜
X
is a non-decreasing function (0
,
1)
R
. Then by general
analysis,
˜
X has at most countably many discontinuities. We write
0
= {ω (0, 1) :
˜
X is continuous at ω
0
}.
Then (0, 1) \
0
is countable, and hence has Lebesgue measure 0. So
P[Ω
0
] = 1.
We are now going to show that
˜
X
n
(ω)
˜
X(ω) for all ω
0
.
Note that
F
X
is a non-decreasing function, and hence the points of discon-
tinuity
R \ S
is also countable. So
S
is dense in
R
. Fix
ω
0
and
ε >
0.
We want to show that |
˜
X
n
(ω)
˜
X(ω)| ε for all n large enough.
Since S is dense in R, we can find x
, x
+
in S such that
x
<
˜
X(ω) < x
+
and
x
+
x
< ε
. What we want to do is to use the characteristic property
of
˜
X and F
X
to say that this implies
F
X
(x
) < ω < F
X
(x
+
).
Then since
F
X
n
F
X
at the points
x
, x
+
, for sufficiently large
n
, we
have
F
X
n
(x
) < ω < F
X
n
(x
+
).
Hence we have
x
<
˜
X
n
(ω) < x
+
.
Then it follows that |
˜
X
n
(ω)
˜
X(ω)| < ε.
However, this doesn’t work, since
˜
X
(
ω
)
< x
+
only implies
ω F
X
(
x
+
),
and our argument will break down. So we do a funny thing where we
introduce a new variable ω
+
.
Since
˜
X
is continuous at
ω
, we can find
ω
+
(
ω,
1) such that
˜
X
(
ω
+
)
x
+
.
˜
X(ω)
ω
x
x
+
ω
+
ε
Then we have
x
<
˜
X(ω)
˜
X(ω
+
) < x
+
.
Then we have
F
X
(x
) < ω < ω
+
F
X
(x
+
).
So for sufficiently large n, we have
F
X
n
(x
) < ω < F
X
n
(x
+
).
So we have
x
<
˜
X
n
(ω) x
+
,
and we are done.