3The Lasso and beyond
III Modern Statistical Methods
3.2 Basic concentration inequalities
We now have to prove the lemma we left out in the theorem just now. In this
section, we are going to prove a bunch of useful inequalities that we will later
use to prove bounds.
Consider the event Ω as defined before. Then
P
1
n
kX
T
εk
∞
> λ
= P
p
[
j=1
1
n
|X
T
j
ε| > λ
≤
p
X
j=1
P
1
n
|X
T
j
ε| > λ
.
Now
1
n
X
T
j
ε ∼ N
(0
,
σ
2
n
). So we just have to bound tail probabilities of normal
random variables.
The simplest tail bound we have is Markov’s inequality.
Lemma (Markov’s inequality). Let
W
be a non-negative random variable. Then
P(W ≥ t) ≤
1
t
EW.
Proof. We have
t1
W ≥t
≤ W.
The result then follows from taking expectations and then dividing both sides
by t.
While this is a very simple bound, it is actually quite powerful, because it
assumes almost nothing about
W
. This allows for the following trick: given any
strictly increasing function
ϕ
:
R →
(0
, ∞
) and any random variable
W
, we have
P(W ≥ t) = P(ϕ(W ) ≥ ϕ(t)) ≤
Eϕ(W )
ϕ(t)
.
So we get a bound on the tail probability of
W
for any such function. Even
better, we can minimize the right hand side over a class of functions to get an
even better bound.
In particular, applying this with ϕ(t) = e
αt
gives
Corollary (Chernoff bound). For any random variable W , we have
P(W ≥ t) ≤ inf
α>0
e
−αt
Ee
αW
.
Note that
Ee
αW
is just the moment generating function of
W
, which we can
compute quite straightforwardly.
We can immediately apply this when
W
is a normal random variable,
W ∼
N(0, σ
2
). Then
Ee
αW
= e
α
2
σ
2
/2
.
So we have
P(W ≥ t) ≤ inf
α>0
exp
α
2
σ
2
2
− αt
= e
−t
2
/(2σ
2
)
.
Observe that in fact this tail bound works for any random variable whose moment
generating function is bounded above by e
α
2
σ
2
/2
.
Definition (Sub-Gaussian random variable). A random variable
W
is sub-
Gaussian (with parameter σ) if
Ee
α(W −EW )
≤ e
α
2
σ
2
/2
for all α ∈ R.
Corollary. Any sub-Gaussian random variable W with parameter σ satisfies
P(W ≥ t) ≤ e
−t
2
/2σ
2
.
In general, bounded random variables are sub-Gaussian.
Lemma (Hoeffding’s lemma). If
W
has mean zero and takes values in [
a, b
],
then W is sub-Gaussian with parameter
b−a
2
.
Recall that the sum of two independent Gaussians is still a Gaussian. This
continues to hold for sub-Gaussian random variables.
Proposition. Let (
W
i
)
n
i=1
be independent mean-zero sub-Gaussian random
variables with parameters (
σ
i
)
n
i=0
, and let
γ ∈ R
n
. Then
γ
T
W
is sub-Gaussian
with parameter
X
(γ
i
σ
i
)
2
1/2
.
Proof. We have
E exp
α
n
X
i=1
γ
i
W
i
!
=
n
Y
i=1
E exp (αγ
i
W
i
)
≤
n
Y
i=1
exp
α
2
2
γ
2
i
σ
2
i
= exp
α
2
2
n
X
i=1
σ
2
i
γ
2
i
!
.
We can now prove our bound for the Lasso, which in fact works for any
sub-Gaussian random variable.
Lemma. Suppose (
ε
i
)
n
i=1
are independent, mean-zero sub-Gaussian with com-
mon parameter σ. Let
λ = Aσ
r
log p
n
.
Let X be a matrix whose columns all have norm
√
n. Then
P
1
n
kX
T
εk
∞
≤ λ
≥ 1 − 2p
−(A
2
/2−1)
.
In particular, this includes ε ∼ N
n
(0, σ
2
I).
Proof. We have
P
1
n
kX
T
εk
∞
> λ
≤
p
X
j=1
P
1
n
|X
T
j
ε| > λ
.
But ±
1
n
X
T
j
ε are both sub-Gaussian with parameter
σ
X
i
X
ij
n
2
!
1/2
=
σ
√
n
.
Then by our previous corollary, we get
p
X
j=1
P
1
n
|X
T
j
ε|
∞
> λ
≤ 2p exp
−
λ
2
n
2σ
2
.
Note that we have the factor of 2 since we need to consider the two cases
1
n
X
T
j
ε > λ and −
1
n
X
T
j
ε > λ.
Plugging in our expression of λ, we write the bound as
2p exp
−
1
2
A
2
log p
= 2p
1−A
2
/2
.
This is all we need for our result on the Lasso. We are going to go a bit
further into this topic of concentration inequalities, because we will need them
later when we impose conditions on the design matrix. In particular, we would
like to bound the tail probabilities of products.
Definition (Bernstein’s condition). We say that a random variable
W
satisfies
Bernstein’s condition with parameters (σ, b) where a, b > 0 if
E[|W − EW |
k
] ≤
1
2
k! σ
2
b
k−2
for k = 2, 3, . . ..
The point is that these bounds on the moments lets us bound the moment
generating function of W .
Proposition (Bernstein’s inequality). Let
W
1
, W
2
, . . . , W
n
be independent ran-
dom variables with
EW
i
=
µ
, and suppose each
W
i
satisfies Bernstein’s condition
with parameters (σ, b). Then
Ee
α(W
i
−µ)
≤ exp
α
2
σ
2
/2
1 − b|α|
for all |α| <
1
b
,
P
1
n
n
X
i=1
W
i
− µ ≥ t
!
≤ exp
−
nt
2
2(σ
2
+ bt)
for all t ≥ 0.
Note that for large t, the bound goes as e
−t
instead of e
−t
2
.
Proof. For the first part, we fix i and write W = W
i
. Let |α| <
1
b
. Then
Ee
α(W
i
−µ)
=
∞
X
k=0
E
1
k!
α
k
|W
i
− µ|
k
≤ 1 +
σ
2
α
2
2
∞
X
k=2
|α|
k−2
b
k−2
= 1 +
σ
2
α
2
2
1
1 − |α|b
≤ exp
α
2
σ
2
/2
1 − b|α|
.
Then
E exp
1
n
n
X
i=1
α(W
i
− µ)
!
=
n
Y
i=1
E exp
α
n
(W
i
− µ)
≤ exp
n
α
n
2
σ
2
/2
1 − b
α
n
!
,
assuming
α
n
<
1
b
. So it follows that
P
1
n
n
X
i=1
W
i
− µ ≥ t
!
≤ e
−αt
exp
n
α
n
2
σ
2
/2
1 − b
α
n
!
.
Setting
α
n
=
t
bt + σ
2
∈
0,
1
b
gives the result.
Lemma. Let
W, Z
be mean-zero sub-Gaussian random variables with parameters
σ
W
and
σ
Z
respectively. Then
W Z
satisfies Bernstein’s condition with parameter
(8σ
W
σ
Z
, 4σ
W
σ
Z
).
Proof.
For any random variable
Y
(which we will later take to be
W Z
), for
k > 1, we know
E|Y − EY |
k
= 2
k
E
1
2
Y −
1
2
EY
k
≤ 2
k
E
1
2
|Y | +
1
2
|EY |
k
.
Note that
1
2
|Y | +
1
2
|EY |
k
≤
|Y |
k
+ |EY |
k
2
by Jensen’s inequality. Applying Jensen’s inequality again, we have
|EY |
k
≤ E|Y |
k
.
Putting the whole thing together, we have
E|Y − EY |
k
≤ 2
k
E|Y |
k
.
Now take Y = W Z. Then
E|W Z − EW Z| ≤ 2
k
E|W Z|
k
≤ 2
k
EW
2k
1/2
(EZ
2k
)
1/2
,
by the Cauchy–Schwarz inequality.
We know that sub-Gaussians satisfy a bound on the tail probability. We can
then use this to bound the moments of W and Z. First note that
W
2k
=
Z
∞
0
1
x<W
2k
dx.
Taking expectations of both sides, we get
EW
2k
=
Z
∞
0
P(x < W
2k
) dx.
Since we have a tail bound on
W
instead of
W
2k
, we substitute
x
=
t
2k
. Then
dx = 2kt
2k−1
dt. So we get
EW
2k
= 2k
Z
∞
0
t
2k−1
P(|W | > t) dt
= 4k
Z
∞
0
t
2k
exp
−
t
2
2σ
2
N
dt.
where again we have a factor of 2 to account for both signs. We perform yet
another substitution
x =
t
2
2σ
2
N
, dx =
t
σ
2
W
dt.
Then we get
EW
2k
= 2
k+1
σ
2k
W
kσ
2
W
Z
∞
0
x
k−1
e
−x
dx = 4 · k!σ
2
W
.
Plugging this back in, we have
E|W Z − EW Z|
k
≤ 2
k
2
k+1
k!σ
k
W
σ
k
Z
σ
k
Z
=
1
2
k!2
2k+2
σ
k
W
σ
k
Z
=
1
2
k!(8σ
W
σ
Z
)
2
(4σ
W
σ
Z
)
k−2
.