II Probability and Measure - Measurable functions and random variables

2Measurable functions and random variables

II Probability and Measure

2.4 Convergence of measurable functions

The next thing to look at is the convergence of measurable functions. In measure

theory, wonderful things happen when we talk about convergence. In analysis,

most of the time we had to require uniform convergence, or even stronger

notions, if we want limits to behave well. However, in measure theory, the kinds

of convergence we talk about are somewhat pointwise in nature. In fact, it

will be weaker than pointwise convergence. Yet, we are still going to get good

properties out of them.

Definition

(Convergence almost everywhere)

Suppose that (

E, E, µ

) is a mea-

sure space. Suppose that (

)

, f

are measurable functions. We say

→ f

almost everywhere (a.e.) if

µ({x ∈ E : f

(x) 6→ f(x)}) = 0.

If (E, E, µ) is a probability space, this is called almost sure convergence.

To see this makes sense, i.e. the set in there is actually measurable, note that

{x ∈ E : f

(x) 6→ f(x)} = {x ∈ E : lim sup |f

(x) − f(x)| > 0}.

We have previously seen that

lim sup |f

−f|

is non-negative measurable. So the

set {x ∈ E : lim sup |f

(x) − f(x)| > 0} is measurable.

Another useful notion of convergence is convergence in measure.

Definition

(Convergence in measure)

Suppose that (

E, E, µ

) is a measure space.

Suppose that (

)

, f

are measurable functions. We say

→ f

in measure if for

each ε > 0, we have

µ({x ∈ E : |f

(x) − f(x)| ≥ ε}) → 0 as n → ∞,

then we say that f

→ f in measure.

If (

E, E, µ

) is a probability space, then this is called convergence in probability.

In the case of a probability space, this says

P(|X

− X| ≥ ε) → 0 as n → ∞

for all ε, which is how we state the weak law of large numbers in the past.

After we define integration, we can consider the norms of a function f by

kfk



|f(x)|



1/p

Then in particular, if

−f k

→

0, then

→ f

in measure, and this provides

an easy way to see that functions converge in measure.

In general, neither of these notions imply each other. However, the following

theorem provides us with a convenient dictionary to translate between the two

notions.

Theorem.

(i) If µ(E) < ∞, then f

→ f a.e. implies f

→ f in measure.

(ii)

For any

, if

→ f

in measure, then there exists a subsequence (

)

such that f

→ f a.e.

Proof.

(i) First suppose µ(E) < ∞, and fix ε > 0. Consider

µ({x ∈ E : |f

(x) − f(x)| ≤ ε}).

We use the result from the first example sheet that for any sequence of

events (A

), we have

lim inf µ(A

) ≥ µ(lim inf A

Applying to the above sequence says

lim inf µ({x : |f

(x) − f(x)| ≤ ε}) ≥ µ({x : |f

(x) − f(x)| ≤ ε eventually})

≥ µ({x ∈ E : |f

(x) − f(x)| → 0})

= µ(E).

As µ(E) < ∞, we have µ({x ∈ E : |f

(x) − f(x)| > ε}) → 0 as n → ∞.

(ii) Suppose that f

→ f in measure. We pick a subsequence (n

) such that



x ∈ E : |f

(x) − f(x)| >



≤ 2

−k

Then we have

∞

k=1



x ∈ E : f

(x) − f(x)| >



≤

∞

k=1

−k

= 1 < ∞.

By the first Borel–Cantelli lemma, we know



x ∈ E : |f

(x) − f(x)| >

i.o.



= 0.

So f

→ f a.e.

It is important that we assume that µ(E) < ∞ for the first part.

Example.

Consider (

E, E, µ

) = (

R, B, Lebesgue

). Take

(

) =

[n,∞)

(

Then

(

)

→

0 for all

, and in particular almost everywhere. However, we

have



x ∈ R : |f

(x)| >



= µ([n, ∞)) = ∞

for all n.

There is one last type of convergence we are interested in. We will only

first formulate it in the probability setting, but there is an analogous notion in

measure theory known as weak convergence, which we will discuss much later on

in the course.

Definition

(Convergence in distribution)

Let (

)

, X

be random variables

with distribution functions

and

, then we say

→ X

in distribution if

(x) → F

(x) for all x ∈ R at which F

is continuous.

Note that here we do not need that (

) and

live on the same probability

space, since we only talk about the distribution functions.

But why do we have the condition with continuity points? The idea is that

if the resulting distribution has a “jump” at

, it doesn’t matter which side of

the jump

(

) is at. Here is a simple example that tells us why this is very

important:

Example.

Let

to be uniform on [0

]. Intuitively, this should converge

to the random variable that is always zero.

We can compute

(x) =











0 x ≤ 0

nx 0 < x < 1/n

1 x ≥ 1/n

We can also compute the distribution of the zero random variable as

(

0 x < 0

1 x ≥ 0

But F

(0) = 0 for all n, while F

(0) = 1.

One might now think of cheating by cooking up some random variable such

that

is discontinuous at so many points that random, unrelated things converge

. However, this cannot be done, because

is a non-decreasing function,

and thus can only have countably many points of discontinuities.

The big theorem we are going to prove about convergence in distribution is

that actually it is very boring and doesn’t give us anything new.

Theorem (Skorokhod representation theorem of weak convergence).

(i)

If (

)

, X

are defined on the same probability space, and

→ X

probability. Then X

→ X in distribution.

(ii)

→ X

in distribution, then there exists random variables (

) and

defined on a common probability space with

and

such that

→

X a.s.

Proof. Let S = {x ∈ R : F

is continuous}.

(i)

Assume that

→ X

in probability. Fix

x ∈ S

. We need to show that

(x) → F

(x) as n → ∞.

We fix

ε >

0. Since

x ∈ S

, this implies that there is some

δ >

0 such that

(x − δ) ≥ F

(x) −

(x + δ) ≤ F

(x) +

We fix N large such that n ≥ N implies P[|X

− X| ≥ δ] ≤

. Then

(x) = P[X

≤ x]

= P[(X

− X) + X ≤ x]

We now notice that

{

(

−X

) +

X ≤ x} ⊆ {X ≤ x

δ}∪{|X

−X| > δ}

So we have

≤ P[X ≤ x + δ] + P[|X

− X| > δ]

≤ F

(x + δ) +

≤ F

(x) + ε.

We similarly have

(x) = P[X

≤ x]

≥ P[X ≤ x − δ] − P[|X

− X| > δ]

≥ F

(x − δ) −

≥ F

(x) − ε.

Combining, we have that

n ≥ N

implying

(

)

− F

(

)

| ≤ ε

. Since

was arbitrary, we are done.

(ii) Suppose X

→ X in distribution. We again let

(Ω, F, B) = ((0, 1), B((0, 1)), Lebesgue).

We let

(ω) = inf{x : ω ≤ F

(x)},

X(ω) = inf{x : ω ≤ F

(x)}.

Recall from before that

has the same distribution function as

for

all n, and

X has the same distribution as X. Moreover, we have

(ω) ≤ x ⇔ ω ≤ F

(x)

x <

(ω) ⇔ F

(x) < ω,

and similarly if we replace X

with X.

We are now going to show that with this particular choice, we have

→

a.s.

Note that

is a non-decreasing function (0

→ R

. Then by general

analysis,

X has at most countably many discontinuities. We write

Ω

= {ω ∈ (0, 1) :

X is continuous at ω

Then (0, 1) \ Ω

is countable, and hence has Lebesgue measure 0. So

P[Ω

] = 1.

We are now going to show that

(ω) →

X(ω) for all ω ∈ Ω

Note that

is a non-decreasing function, and hence the points of discon-

tinuity

R \ S

is also countable. So

is dense in

. Fix

ω ∈

Ω

and

ε >

We want to show that |

(ω) −

X(ω)| ≤ ε for all n large enough.

Since S is dense in R, we can find x

−

, x

in S such that

−

X(ω) < x

and

−x

−

< ε

. What we want to do is to use the characteristic property

X and F

to say that this implies

−

) < ω < F

Then since

→ F

at the points

−

, x

, for sufficiently large

, we

have

−

) < ω < F

Hence we have

−

(ω) < x

Then it follows that |

(ω) −

X(ω)| < ε.

However, this doesn’t work, since

(

)

< x

only implies

ω ≤ F

(

and our argument will break down. So we do a funny thing where we

introduce a new variable ω

Since

is continuous at

, we can find

∈

(

ω,

1) such that

(

)

≤ x

X(ω)

−

Then we have

−

X(ω) ≤

X(ω

) < x

Then we have

−

) < ω < ω

≤ F

So for sufficiently large n, we have

−

) < ω < F

So we have

−

(ω) ≤ x

and we are done.