4Inequalities and L^{p} spaces

II Probability and Measure

4.3 Orthogonal projection in L

2

In the particular case

p

= 2, we have an extra structure on

L

2

, namely an inner

product structure, given by

hf, gi =

Z

fg dµ.

This inner product induces the L

2

norm by

kfk

2

2

= hf, fi.

Recall the following definition:

Definition

(Hilbert space)

.

A Hilbert space is a vector space with a complete

inner product.

So L

2

is not only a Banach space, but a Hilbert space as well.

Somehow Hilbert spaces are much nicer than Banach spaces, because you

have an inner product structure as well. One particular thing we can do is

orthogonal complements.

Definition (Orthogonal functions). Two functions f, g ∈ L

2

are orthogonal if

hf, gi = 0,

Definition (Orthogonal complement). Let V ⊆ L

2

. We then set

V

⊥

= {f ∈ L

2

: hf, vi = 0 for all v ∈ V }.

Note that we can always make these definitions for any inner product space.

However, the completeness of the space guarantees nice properties of the orthog-

onal complement.

Before we proceed further, we need to make a definition of what it means

for a subspace of

L

2

to be closed. This isn’t the usual definition, since

L

2

isn’t

really a normed vector space, so we need to accommodate for that fact.

Definition

(Closed subspace)

.

Let

V ⊆ L

2

. Then

V

is closed if whenever (

f

n

)

is a sequence in V with f

n

→ f , then there exists v ∈ V with v ∼ f .

Thee main thing that makes

L

2

nice is that we can use closed subspaces to

decompose functions orthogonally.

Theorem.

Let

V

be a closed subspace of

L

2

. Then each

f ∈ L

2

has an

orthogonal deco mposition

f = u + v,

where v ∈ V and u ∈ V

⊥

. Moreover,

kf −vk

2

≤ kf − gk

2

for all g ∈ V with equality iff g ∼ v.

To prove this result, we need two simple identities, which can be easily proven

by writing out the expression.

Lemma (Pythagoras identity).

kf + gk

2

= kfk

2

+ kgk

2

+ 2hf, gi.

Lemma (Parallelogram law).

kf + gk

2

+ kf −gk

2

= 2(kfk

2

+ kgk

2

).

To prove the existence of orthogonal decomposition, we need to use a slight

trick involving the parallelogram law.

Proof of orthogonal decomposition.

Given

f ∈ L

2

, we take a sequence (

g

n

) in

V

such that

kf − g

n

k

2

→ d(f, V ) = inf

g

kf − gk

2

.

We now want to show that the infimum is attained. To do so, we show that

g

n

is a Cauchy sequence, and by the completeness of L

2

, it will have a limit.

If we apply the parallelogram law with

u

=

f − g

n

and

v

=

f − g

m

, then we

know

ku + vk

2

2

+ ku − vk

2

2

= 2(kuk

2

2

+ kvk

2

2

).

Using our particular choice of u and v, we obtain

2

f −

g

n

+ g

m

2

2

2

+ kg

n

− g

m

k

2

2

= 2(kf − g

n

k

2

2

+ kf − g

m

k

2

2

).

So we have

kg

n

− g

m

k

2

2

= 2(kf − g

n

k

2

2

+ kf − g

m

k

2

2

) − 4

f −

g

n

+ g

m

2

2

2

.

The first two terms on the right hand side tend to

d

(

f, V

)

2

, and the last term

is bounded below in magnitude by 4

d

(

f, V

). So as

n, m → ∞

, we must have

kg

n

−g

m

k

2

→

0. By completeness of

L

2

, there exists a

g ∈ L

2

such that

g

n

→ g

.

Now since

V

is assumed to be closed, we can find a

v ∈ V

such that

g

=

v

a.e. Then we know

kf − vk

2

= lim

n→∞

kf − g

n

k

2

= d(f, V ).

So

v

attains the infimum. To show that this gives us an orthogonal decomposition,

we want to show that

u = f − v ∈ V

⊥

.

Suppose

h ∈ V

. We need to show that

hu, hi

= 0. We need to do another funny

trick. Suppose t ∈ R. Then we have

d(f, V )

2

≤ kf − (v + th)k

2

2

= kf − vk

2

+ t

2

khk

2

2

− 2thf − v, hi.

We think of this as a quadratic in t, which is minimized when

t =

hf − v, hi

khk

2

2

.

But we know this quadratic is minimized when t = 0. So hf − v, hi = 0.

We are now going to look at the relationship between conditional expectation

and orthogonal projection.

Definition

(Conditional expectation)

.

Suppose we have a probability space

(Ω

, F, P

), and (

G

n

) is a collection of pairwise disjoint events with

S

n

G

n

= Ω.

We let

G = σ(G

n

: n ∈ N).

The conditional expectation of X given G is the random variable

Y =

∞

X

n=1

E[X | G

n

]1

G

n

,

where

E[X | G

n

] =

E[X1

G

n

]

P[G

n

]

for P[G

n

] > 0.

In other words, given any x ∈ Ω, say x ∈ G

n

, then Y (x) = E[X | G

n

].

If

X ∈ L

2

(

P

), then

Y ∈ L

2

(

P

), and it is clear that

Y

is

G

-measurable. We

claim that this is in fact the projection of

X

onto the subspace

L

2

(

G, P

) of

G-measurable L

2

random variables in the ambient space L

2

(P).

Proposition.

The conditional expectation of

X

given

G

is the projection of

X

onto the subspace

L

2

(

G, P

) of

G

-measurable

L

2

random variables in the ambient

space L

2

(P).

In some sense, this tells us

Y

is our best prediction of

X

given only the

information encoded in G.

Proof.

Let

Y

be the conditional expectation. It suffices to show that

E

[(

X −W

)

2

]

is minimized for

W

=

Y

among

G

-measurable random variables. Suppose that

W is a G-measurable random variable. Since

G = σ(G

n

: n ∈ N),

it follows that

W =

∞

X

n=1

a

n

1

G

n

.

where a

n

∈ R. Then

E[(X − W )

2

] = E

∞

X

n=1

(X − a

n

)1

G

n

!

2

= E

"

X

n

(X

2

+ a

2

n

− 2a

n

X)1

G

n

#

= E

"

X

n

(X

2

+ a

2

n

− 2a

n

E[X | G

n

])1

G

n

#

We now optimize the quadratic

X

2

+ a

2

n

− 2a

n

E[X | G

n

]

over a

n

. We see that this is minimized for

a

n

= E[X | G

n

].

Note that this does not depend on what

X

is in the quadratic, since it is in the

constant term.

Therefore we know that E[X | G

n

] is minimized for W = Y .

We can also rephrase variance and covariance in terms of the L

2

spaces.

Suppose X, Y ∈ L

2

(P) with

m

X

= E[X], m

Y

= E[Y ].

Then variance and covariance just correspond to

L

2

inner product and norm.

In fact, we have

var(X) = E[(X − m

X

)

2

] = kX − m

X

k

2

2

,

cov(X, Y ) = E[(X − m

X

)(Y − m

Y

)] = hX − m

X

, Y − m

Y

i.

More generally, the covariance matrix of a random vector

X

= (

X

1

, ··· , X

n

) is

given by

var(X) = (cov(X

i

, X

j

))

ij

.

On the example sheet, we will see that the covariance matrix is a positive definite

matrix.