1Some measure theory

III Advanced Probability



1.2 Conditional expectation
In this course, conditional expectation is going to play an important role, and
it is worth spending some time developing the theory. We are going to focus
on probability theory, which, mathematically, just means we assume
µ
(
E
) = 1.
Practically, it is common to change notation to
E
= Ω,
E
=
F
,
µ
=
P
and
R
d
µ
=
E
. Measurable functions will be written as
X, Y, Z
, and will be called
random variables. Elements in
F
will be called events. An element
ω
will
be called a realization.
There are many ways we can think about conditional expectations. The first
one is how most of us first encountered conditional probability.
Suppose
B F
, with
P
(
B
)
>
0. Then the conditional probability of the
event A given B is
P(A | B) =
P(A B)
P(B)
.
This should be interpreted as the probability that
A
happened, given that
B
happened. Since we assume
B
happened, we ought to restrict to the subset of the
probability space where
B
in fact happened. To make this a probability space,
we scale the probability measure by
P
(
B
). Then given any event
A
, we take the
probability of
A B
under this probability measure, which is the formula given.
More generally, if
X
is a random variable, the conditional expectation of
X
given B is just the expectation under this new probability measure,
E[X | B] =
E[X1
B
]
P[B]
.
We probably already know this from high school, and we are probably not quite
excited by this. One natural generalization would be to allow B to vary.
Let G
1
, G
2
, . . . F be disjoint events such that
S
n
G
n
= Ω. Let
G = σ(G
1
, G
2
, . . .) =
(
[
nI
G
n
: I N
)
.
Let X L
1
. We then define
Y =
X
n=1
E(X | G
n
)1
G
n
.
Let’s think about what this is saying. Suppose a random outcome
ω
happens.
To compute
Y
, we figure out which of the
G
n
our
ω
belongs to. Let’s say
ω G
k
. Then
Y
returns the expected value of
X
given that we live in
G
k
. In
this processes, we have forgotten the exact value of
ω
. All that matters is which
G
n
the outcome belongs to. We can “visually” think of the
G
n
as cutting up
the sample space into compartments:
We then average out
X
in each of these compartments to obtain
Y
. This is what
we are going to call the conditional expectation of
X
given
G
, written
E
(
X | G
).
Ultimately, the characterizing property of Y is the following lemma:
Lemma.
The conditional expectation
Y
=
E
(
X | G
) satisfies the following
properties:
Y is G-measurable
We have Y L
1
, and
EY 1
A
= EX1
A
for all A G.
Proof. It is clear that Y is G-measurable. To show it is L
1
, we compute
E[|Y |] = E
X
n=1
E(X | G
n
)1
G
n
E
X
n=1
E(|X| | G
n
)1
G
n
=
X
E (E(|X| | G
n
)1
G
n
)
=
X
E|X|1
G
n
= E
X
|X|1
G
n
= E|X|
< ,
where we used monotone convergence twice to swap the expectation and the
sum.
The final part is also clear, since we can explicitly enumerate the elements in
G and see that they all satisfy the last property.
It turns out for any
σ
-subalgebra
G F
, we can construct the conditional
expectation
E
(
X | G
), which is uniquely characterized by the above two proper-
ties.
Theorem
(Existence and uniqueness of conditional expectation)
.
Let
X L
1
,
and G F. Then there exists a random variable Y such that
Y is G-measurable
Y L
1
, and EX1
A
= EY 1
A
for all A G.
Moreover, if
Y
0
is another random variable satisfying these conditions, then
Y
0
= Y almost surely.
We call Y a (version of) the conditional expectation given G.
We will write the condition expectation as
E
(
X | G
), and if
X
=
1
A
, we will
write P(A | G) = E(1
A
| G).
Recall also that if
Z
is a random variable, then
σ
(
Z
) =
{Z
1
(
B
) :
B B}
.
In this case, we will write E(X | Z) = E(X | σ(Z)).
By, say, bounded convergence, it follows from the second condition that
EXZ = EY Z for all bounded G-measurable functions Z.
Proof.
We first consider the case where
X L
2
(Ω
, F, µ
). Then we know from
functional analysis that for any
G F
, the space
L
2
(
G
) is a Hilbert space with
inner product
hX, Y i = µ(XY ).
In particular,
L
2
(
G
) is a closed subspace of
L
2
(
F
). We can then define
Y
to be the orthogonal projection of
X
onto
L
2
(
G
). It is immediate that
Y
is
G
-measurable. For the second part, we use that
X Y
is orthogonal to
L
2
(
G
),
since that’s what orthogonal projection is supposed to be. So
E
(
X Y
)
Z
= 0
for all
Z L
2
(
G
). In particular, since the measure space is finite, the indicator
function of any measurable subset is L
2
. So we are done.
We next focus on the case where X mE
+
. We define
X
n
= X n
We want to use monotone convergence to obtain our result. To do so, we need
the following result:
Claim.
If (
X, Y
) and (
X
0
, Y
0
) satisfy the conditions of the theorem, and
X
0
X
a.s., then Y
0
Y a.s.
Proof.
Define the event
A
=
{Y
0
Y } G
. Consider the event
Z
= (
Y Y
0
)
1
A
.
Then Z 0. We then have
EY
0
1
A
= EX
0
1
A
EX1
A
= EY 1
A
.
So it follows that we also have
E
(
Y Y
0
)
1
A
0. So in fact
EZ
= 0. So
Y
0
Y
a.s.
We can now define
Y
n
=
E
(
X
n
| G
), picking them so that
{Y
n
}
is increasing.
We then take
Y
=
lim Y
n
. Then
Y
is certainly
G
-measurable, and by monotone
convergence, if A G, then
EX1
A
= lim EX
n
1
A
= lim EY
n
1
A
= EY
1
A
.
Now if
EX <
, then
EY
=
EX <
. So we know
Y
is finite a.s., and we
can define Y = Y
1
Y
<
.
Finally, we work with arbitrary
X L
1
. We can write
X
=
X
+
X
, and
then define Y
±
= E(X
±
| G), and take Y = Y
+
Y
.
Uniqueness is then clear.
Lemma.
If
Y
is
σ
(
Z
)-measurable, then there exists
h
:
R R
Borel-measurable
such that Y = h(Z). In particular,
E(X | Z) = h(Z) a.s.
for some h : R R.
We can then define
E
(
X | Z
=
z
) =
h
(
z
). The point of doing this is that we
want to allow for the case where in fact we have
P
(
Z
=
z
) = 0, in which case
our original definition does not make sense.
Exercise.
Consider
X L
1
, and
Z
:
N
discrete. Compute
E
(
X | Z
) and
compare our different definitions of conditional expectation.
Example.
Let (
U, V
)
R
2
with density
f
U,V
(
u, v
), so that for any
B
1
, B
2
B
,
we have
P(U B
1
, V B
2
) =
Z
B
1
Z
b
2
f
U,V
(u, v) du dv.
We want to compute
E
(
h
(
V
)
| U
), where
h
:
R R
is Borel measurable. We
can define
f
U
(u) =
Z
R
f
U,V
(u, v) dv,
and we define the conditional density of V given U by
F
|U
(v | u) =
f
U,V
(u, v)
f
U
(u)
.
We define
g(u) =
Z
h(u)f
V |U
(v | u) dv.
We claim that E(h(V ) | U) is just g(U).
To check this, we show that it satisfies the two desired conditions. It is clear
that it is
σ
(
U
)-measurable. To check the second condition, fix an
A σ
(
U
).
Then A = {(u, v) : u B} for some B. Then
E(h(V )1
A
) =
ZZ
h(v)1
uB
f
U,V
(u, v) du dv
=
ZZ
h(v)1
uB
f
V |U
(v | u)f
V
(u) du dv
=
Z
g(U)1
uB
f
U
(u) du
= E(g(U)1
A
),
as desired.
The point of this example is that to compute conditional expectations, we
use our intuition to guess what the conditional expectation should be, and then
check that it satisfies the two uniquely characterizing properties.
Example.
Suppose (
X, W
) are Gaussian. Then for all linear functions
ϕ
:
R
2
R, the quantity ϕ(X, W) is Gaussian.
One nice property of Gaussians is that lack of correlation implies independence.
We want to compute
E
(
X | W
). Note that if
Y
is such that
EX
=
EY
,
X Y
is independent of
W
, and
Y
is
W
-measurable, then
Y
=
E
(
X | W
), since
E(X Y )1
A
= 0 for all σ(W )-measurable A.
The guess is that we want
Y
to be a Gaussian variable. We put
Y
=
aW
+
b
.
Then EX = EY implies we must have
aEW + b = EX. ()
The independence part requires
cov
(
X Y, W
) = 0. Since covariance is linear,
we have
0 = cov(X Y, W ) = cov(X, W )cov(aW + b, W ) = cov(X, W )a cov(W, W).
Recalling that cov(W, W) = var(W), we need
a =
cov(X, W)
var(W)
.
This then allows us to use (
) to compute
b
as well. This is how we compute the
conditional expectation of Gaussians.
We note some immediate properties of conditional expectation. As usual,
all (in)equality and convergence statements are to be taken with the quantifier
“almost surely”.
Proposition.
(i) E(X | G) = X iff X is G-measurable.
(ii) E(E(X | G)) = EX
(iii) If X 0 a.s., then E(X | G) 0
(iv) If X and G are independent, then E(X | G) = E[X]
(v) If α, β R and X
1
, X
2
L
1
, then
E(αX
1
+ βX
2
| G) = αE(X
1
| G) + βE(X
2
| G).
(vi) Suppose X
n
% X. Then
E(X
n
| G) % E(X | G).
(vii) Fatou’s lemma: If X
n
are non-negative measurable, then
E
lim inf
n→∞
X
n
| G
lim inf
n→∞
E(X
n
| G).
(viii)
Dominated convergence theorem: If
X
n
X
and
Y L
1
such that
Y |X
n
| for all n, then
E(X
n
| G) E(X | G).
(ix) Jensen’s inequality: If c : R R is convex, then
E(c(X) | G) c(E(X) | G).
(x) Tower property: If H G, then
E(E(X | G) | H) = E(X | H).
(xi) For p 1,
kE(X | G)k
p
kXk
p
.
(xii) If Z is bounded and G-measurable, then
E(ZX | G) = ZE(X | G).
(xiii)
Let
X L
1
and
G, H F
. Assume that
σ
(
X, G
) is independent of
H
.
Then
E(X | G) = E(X | σ(G, H)).
Proof.
(i) Clear.
(ii) Take A = ω.
(iii) Shown in the proof.
(iv) Clear by property of expected value of independent variables.
(v)
Clear, since the RHS satisfies the unique characterizing property of the
LHS.
(vi) Clear from construction.
(vii) Same as the unconditional proof, using the previous property.
(viii) Same as the unconditional proof, using the previous property.
(ix) Same as the unconditional proof.
(x) The LHS satisfies the characterizing property of the RHS
(xi) Using the convexity of |x|
p
, Jensen’s inequality tells us
kE(X | G)k
p
p
= E|E(X | G)|
p
E(E(|X|
p
| G))
= E|X|
p
= kXk
p
p
(xii) If Z = 1
B
, and let b G. Then
E(ZE(X | G)1
A
) = E(E(X | G) · 1
AB
) = E(X1
AB
) = E(ZX1
A
).
So the lemma holds. Linearity then implies the result for
Z
simple, then
apply our favorite convergence theorems.
(xiii) Take B H and A G. Then
E(E(X | σ(G, H)) · 1
AB
) = E(X · 1
AB
)
= E(X1
A
)P(B)
= E(E(X | G)1
A
)P(B)
= E(E(X | G)1
AB
)
If instead of
A B
, we had any
σ
(
G, H
)-measurable set, then we would
be done. But we are fine, since the set of subsets of the form
A B
with
A G, B H is a generating π-system for σ(H, G).
We shall end with the following key lemma. We will later use it to show that
many of our martingales are uniformly integrable.
Lemma.
If
X L
1
, then the family of random variables
Y
G
=
E
(
X | G
) for all
G F is uniformly integrable.
In other words, for all ε > 0, there exists λ > 0 such that
E(Y
G
1
|Y
G
|
) < ε
for all G.
Proof.
Fix
ε >
0. Then there exists
δ >
0 such that
E|X|1
A
< ε
for any
A
with
P(A) < δ.
Take Y = E(X | G). Then by Jensen, we know
|Y | E(|X| | G)
In particular, we have
E|Y | E|X|.
By Markov’s inequality, we have
P(|Y | λ)
E|Y |
λ
E|X|
λ
.
So take λ such that
E|X|
λ
< δ. So we have
E(|Y |1
|Y |≥λ
) E(E(|X| | G)1
|Y |≥λ
) = E(|X|1
|Y |≥λ
) < ε
using that 1
|Y |≥λ
is a G-measurable function.