III Modern Statistical Methods - The Lasso and beyond

3The Lasso and beyond

III Modern Statistical Methods

3.2 Basic concentration inequalities

We now have to prove the lemma we left out in the theorem just now. In this

section, we are going to prove a bunch of useful inequalities that we will later

use to prove bounds.

Consider the event Ω as defined before. Then



εk

∞

> λ



= P





[

j=1



ε| > λ







≤

j=1



ε| > λ



Now

ε ∼ N

). So we just have to bound tail probabilities of normal

random variables.

The simplest tail bound we have is Markov’s inequality.

Lemma (Markov’s inequality). Let

be a non-negative random variable. Then

P(W ≥ t) ≤

EW.

Proof. We have

W ≥t

≤ W.

The result then follows from taking expectations and then dividing both sides

by t.

While this is a very simple bound, it is actually quite powerful, because it

assumes almost nothing about

. This allows for the following trick: given any

strictly increasing function

R →

, ∞

) and any random variable

, we have

P(W ≥ t) = P(ϕ(W ) ≥ ϕ(t)) ≤

Eϕ(W )

ϕ(t)

So we get a bound on the tail probability of

for any such function. Even

better, we can minimize the right hand side over a class of functions to get an

even better bound.

In particular, applying this with ϕ(t) = e

αt

gives

Corollary (Chernoff bound). For any random variable W , we have

P(W ≥ t) ≤ inf

α>0

−αt

αW

Note that

αW

is just the moment generating function of

, which we can

compute quite straightforwardly.

We can immediately apply this when

is a normal random variable,

W ∼

N(0, σ

). Then

αW

= e

So we have

P(W ≥ t) ≤ inf

α>0

exp



− αt



= e

−t

/(2σ

)

Observe that in fact this tail bound works for any random variable whose moment

generating function is bounded above by e

Definition (Sub-Gaussian random variable). A random variable

is sub-

Gaussian (with parameter σ) if

α(W −EW )

≤ e

for all α ∈ R.

Corollary. Any sub-Gaussian random variable W with parameter σ satisfies

P(W ≥ t) ≤ e

−t

/2σ

. 

In general, bounded random variables are sub-Gaussian.

Lemma (Hoeffding’s lemma). If

has mean zero and takes values in [

a, b

then W is sub-Gaussian with parameter

b−a

Recall that the sum of two independent Gaussians is still a Gaussian. This

continues to hold for sub-Gaussian random variables.

Proposition. Let (

)

i=1

be independent mean-zero sub-Gaussian random

variables with parameters (

)

i=0

, and let

γ ∈ R

. Then

is sub-Gaussian

with parameter



(γ

)



1/2

Proof. We have

E exp

i=1

E exp (αγ

)

≤

i=1

exp





= exp

i=1

We can now prove our bound for the Lasso, which in fact works for any

sub-Gaussian random variable.

Lemma. Suppose (

)

i=1

are independent, mean-zero sub-Gaussian with com-

mon parameter σ. Let

λ = Aσ

log p

Let X be a matrix whose columns all have norm

√

n. Then



εk

∞

≤ λ



≥ 1 − 2p

−(A

/2−1)

In particular, this includes ε ∼ N

(0, σ

I).

Proof. We have



εk

∞

> λ



≤

j=1



ε| > λ



But ±

ε are both sub-Gaussian with parameter





1/2

√

Then by our previous corollary, we get

j=1



ε|

∞

> λ



≤ 2p exp



−

2σ



Note that we have the factor of 2 since we need to consider the two cases

ε > λ and −

ε > λ.

Plugging in our expression of λ, we write the bound as

2p exp



−

log p



= 2p

1−A

This is all we need for our result on the Lasso. We are going to go a bit

further into this topic of concentration inequalities, because we will need them

later when we impose conditions on the design matrix. In particular, we would

like to bound the tail probabilities of products.

Definition (Bernstein’s condition). We say that a random variable

satisfies

Bernstein’s condition with parameters (σ, b) where a, b > 0 if

E[|W − EW |

] ≤

k! σ

k−2

for k = 2, 3, . . ..

The point is that these bounds on the moments lets us bound the moment

generating function of W .

Proposition (Bernstein’s inequality). Let

, W

, . . . , W

be independent ran-

dom variables with

, and suppose each

satisfies Bernstein’s condition

with parameters (σ, b). Then

α(W

−µ)

≤ exp



1 − b|α|



for all |α| <

i=1

− µ ≥ t

≤ exp



−

2(σ

+ bt)



for all t ≥ 0.

Note that for large t, the bound goes as e

−t

instead of e

−t

Proof. For the first part, we fix i and write W = W

. Let |α| <

. Then

α(W

−µ)

∞

k=0



− µ|



≤ 1 +

∞

k=2

|α|

k−2

= 1 +

1 − |α|b

≤ exp



1 − b|α|



Then

E exp

i=1

α(W

− µ)

i=1

E exp



− µ)



≤ exp





1 − b



assuming



. So it follows that

i=1

− µ ≥ t

≤ e

−αt

exp





1 − b



Setting

bt + σ

∈





gives the result.

Lemma. Let

W, Z

be mean-zero sub-Gaussian random variables with parameters

and

respectively. Then

W Z

satisfies Bernstein’s condition with parameter

(8σ

, 4σ

Proof.

For any random variable

(which we will later take to be

W Z

), for

k > 1, we know

E|Y − EY |

= 2



Y −



≤ 2



|Y | +

|EY |



Note that



|Y | +

|EY |



≤

|Y |

+ |EY |

by Jensen’s inequality. Applying Jensen’s inequality again, we have

|EY |

≤ E|Y |

Putting the whole thing together, we have

E|Y − EY |

≤ 2

E|Y |

Now take Y = W Z. Then

E|W Z − EW Z| ≤ 2

E|W Z|

≤ 2





1/2

(EZ

)

1/2

by the Cauchy–Schwarz inequality.

We know that sub-Gaussians satisfy a bound on the tail probability. We can

then use this to bound the moments of W and Z. First note that

∞

x<W

dx.

Taking expectations of both sides, we get

∞

P(x < W

) dx.

Since we have a tail bound on

instead of

, we substitute

. Then

dx = 2kt

2k−1

dt. So we get

= 2k

∞

2k−1

P(|W | > t) dt

= 4k

∞

exp



−

2σ



dt.

where again we have a factor of 2 to account for both signs. We perform yet

another substitution

x =

2σ

, dx =

dt.

Then we get

= 2

k+1

kσ

∞

k−1

−x

dx = 4 · k!σ

Plugging this back in, we have

E|W Z − EW Z|

≤ 2

k+1

k!σ

k!2

2k+2

k!(8σ

)

(4σ

)

k−2