3The Lasso and beyond
III Modern Statistical Methods
3.3 Convex analysis and optimization theory
We’ll leave these estimates aside for a bit, and give more background on convex
analysis and convex optimization. Recall the following definition:
Definition (Convex set). A set
A ⊆ R
d
is convex if for any
x, y ∈ A
and
t ∈ [0, 1], we have
(1 − t)x + ty ∈ A.
non-convex convex
We are actually more interested in convex functions. We shall let our functions
to take value
∞
, so let us define
¯
R
=
R ∪{∞}
. The point is that if we want our
function to be defined in [
a, b
], then it is convenient to extend it to be defined
on all of R by setting the function to be ∞ outside of [a, b].
Definition (Convex function). A function f : R
d
→
¯
R is convex iff
f((1 − t)x + ty) ≤ (1 − t)f(x) + tf(y)
for all
x, y ∈ R
d
and
t ∈
(0
,
1). Moreover, we require that
f
(
x
)
< ∞
for at least
one x.
We say it is strictly convex if the inequality is strict for all
x, y
and
t ∈
(0
,
1).
x y
(1 − t)x + ty
(1 − t)f(x) + tf(y)
Definition (Domain). Define the domain of a function f : R
d
→
¯
R to be
dom f = {x : f(x) < ∞}.
One sees that the domain of a convex function is always convex.
Proposition.
(i)
Let
f
1
, . . . , f
m
:
R
d
→
¯
R
be convex with
dom f
1
∩ ··· ∩ dom f
m
6
=
∅
, and
let c
1
, . . . , c
m
≥ 0. Then c
1
+ ··· + c
m
f
m
is a convex function.
(ii) If f : R
d
→ R is twice continuously differentiable, then
(a) f is convex iff its Hessian is positive semi-definite everywhere.
(b) f is strictly convex if its Hessian positive definite everywhere.
Note that having a positive definite Hessian is not necessary for strict con-
vexity, e.g. x
4
is strictly convex but has vanishing Hessian at 0.
Now consider a constrained optimization problem
minimize f (x) subject to g(x) = 0
where x ∈ R
d
and g : R
d
→ R
b
. The Lagrangian for this problem is
L(x, θ) = f(x) + θ
T
g(x).
Why is this helpful? Suppose
c
∗
is the minimum of
f
. Then note that for any
θ
,
we have
inf
x∈R
d
L(x, θ) ≤ inf
x∈R
d
,g(x)=0
L(x, θ) = inf
x∈R
d
:g(x)=0
f(x) = c
∗
.
Thus, if we can find some
θ
∗
, x
∗
such that
x
∗
minimizes
L
(
x, θ
∗
) and
g
(
x
∗
) = 0,
then this is indeed the optimal solution.
This gives us a method to solve the optimization problem — for each fixed
θ
,
solve the unconstrained optimization problem
argmin
x
L
(
x, θ
). If we are doing
this analytically, then we would have a formula for
x
in terms of
θ
. Then seek
for a θ such that g(x) = 0 holds.
Subgradients
Usually, when we have a function to optimize, we take its derivative and set it
to zero. This works well if our function is actually differentiable. However, the
`
1
norm is not a differentiable function, since
|x|
is not differentiable at 0. This
is not some exotic case we may hope to avoid most of the time — when solving
the Lasso, we actively want our solutions to have zeroes, so we really want to
get to these non-differentiable points.
Thus, we seek some generalized notion of derivative that works on functions
that are not differentiable.
Definition (Subgradient). A vector
v ∈ R
d
is a subgradient of a convex function
at x if f (y) ≥ f(x) + v
T
(y − x) for all y ∈ R
d
.
The set of subgradients of
f
at
x
is denoted
∂f
(
x
), and is called the subdif-
ferential.
f f
This is indeed a generalization of the derivative, since
Proposition. Let
f
be convex and differentiable at
x ∈ int
(
dom f
). Then
∂f(x) = {∇f(x)}.
The following properties are immediate from definition.
Proposition. Suppose
f
and
g
are convex with
int
(
dom f
)
∩ int
(
dom g
)
6
=
∅
,
and α > 0. Then
∂(αf)(x) = α∂f (x) = {αv : v ∈ ∂f(x)}
∂(f + g)(x) = ∂g(x) + ∂f(x).
The condition (for convex differentiable functions) that “
x
is a minimum iff
f
0
(x) = 0” now becomes
Proposition. If f is convex, then
x
∗
∈ argmin
x∈R
d
f(x) ⇔ 0 ∈ ∂f(x
∗
).
Proof.
Both sides are equivalent to the requirement that
f
(
y
)
≥ f
(
x
∗
) for all
y.
We are interested in applying this to the Lasso. So we want to compute the
subdifferential of the `
1
norm. Let’s first introduce some notation.
Notation. For
x ∈ R
d
and
A ⊆ {
1
. . . , d}
, we write
x
A
for the sub-vector of
x
formed by the components of
x
induced by
A
. We write
x
−j
=
x
{j}
c
=
x
{1,...,d}\j
.
Similarly, we write x
−jk
= x
{jk}
c
etc.
We write
sgn(x
i
) =
−1 x
i
< 0
1 x
i
> 0
0 x
i
= 0
,
and sgn(x) = (sgn(x
1
), . . . , sgn(x
d
))
T
.
First note that k · k
1
is convex, as it is a norm.
Proposition. For x ∈ R
d
and A ∈ {j : x
j
6= 0}, we have
∂kxk
1
= {v ∈ R
d
: kvk
∞
≤ 1, v
A
= sgn(x
A
)}.
Proof.
It suffices to look at the subdifferential of the absolute value function,
and then add them up.
For
j
= 1
, . . . , d
, we define
g
j
:
R
d
→ R
by sending
x
to
|x
j
|
. If
x
j
6
= 0, then
g
j
is differentiable at
x
, and so we know
∂g
j
(
x
) =
{sgn
(
x
j
)
e
j
}
, with
e
j
the
j
th
standard basis vector.
When x
j
= 0, if v ∈ ∂g
j
(x
j
), then
g
j
(y) ≥ g
j
(x) + v
T
(y − x).
So
|y
j
| ≥ v
T
(y − x).
We claim this holds iff
v
j
∈
[
−
1
,
1] and
v
−j
= 0. The
⇐
direction is an immediate
calculation, and to show
⇒
, we pick
y
−j
=
v
−j
+
x
−j
, and
y
j
= 0. Then we have
0 ≥ v
T
−j
v
−j
.
So we know that v
−j
= 0. Once we know this, the condition says
|y
j
| ≥ v
j
y
j
for all
y
j
. This is then true iff
v
j
∈
[
−
1
,
1]. Forming the set sum of the
∂g
j
(
x
)
gives the result.