3The Lasso and beyond

III Modern Statistical Methods



3.3 Convex analysis and optimization theory
We’ll leave these estimates aside for a bit, and give more background on convex
analysis and convex optimization. Recall the following definition:
Definition (Convex set). A set
A R
d
is convex if for any
x, y A
and
t [0, 1], we have
(1 t)x + ty A.
non-convex convex
We are actually more interested in convex functions. We shall let our functions
to take value
, so let us define
¯
R
=
R {∞}
. The point is that if we want our
function to be defined in [
a, b
], then it is convenient to extend it to be defined
on all of R by setting the function to be outside of [a, b].
Definition (Convex function). A function f : R
d
¯
R is convex iff
f((1 t)x + ty) (1 t)f(x) + tf(y)
for all
x, y R
d
and
t
(0
,
1). Moreover, we require that
f
(
x
)
<
for at least
one x.
We say it is strictly convex if the inequality is strict for all
x, y
and
t
(0
,
1).
x y
(1 t)x + ty
(1 t)f(x) + tf(y)
Definition (Domain). Define the domain of a function f : R
d
¯
R to be
dom f = {x : f(x) < ∞}.
One sees that the domain of a convex function is always convex.
Proposition.
(i)
Let
f
1
, . . . , f
m
:
R
d
¯
R
be convex with
dom f
1
··· dom f
m
6
=
, and
let c
1
, . . . , c
m
0. Then c
1
+ ··· + c
m
f
m
is a convex function.
(ii) If f : R
d
R is twice continuously differentiable, then
(a) f is convex iff its Hessian is positive semi-definite everywhere.
(b) f is strictly convex if its Hessian positive definite everywhere.
Note that having a positive definite Hessian is not necessary for strict con-
vexity, e.g. x
4
is strictly convex but has vanishing Hessian at 0.
Now consider a constrained optimization problem
minimize f (x) subject to g(x) = 0
where x R
d
and g : R
d
R
b
. The Lagrangian for this problem is
L(x, θ) = f(x) + θ
T
g(x).
Why is this helpful? Suppose
c
is the minimum of
f
. Then note that for any
θ
,
we have
inf
xR
d
L(x, θ) inf
xR
d
,g(x)=0
L(x, θ) = inf
xR
d
:g(x)=0
f(x) = c
.
Thus, if we can find some
θ
, x
such that
x
minimizes
L
(
x, θ
) and
g
(
x
) = 0,
then this is indeed the optimal solution.
This gives us a method to solve the optimization problem for each fixed
θ
,
solve the unconstrained optimization problem
argmin
x
L
(
x, θ
). If we are doing
this analytically, then we would have a formula for
x
in terms of
θ
. Then seek
for a θ such that g(x) = 0 holds.
Subgradients
Usually, when we have a function to optimize, we take its derivative and set it
to zero. This works well if our function is actually differentiable. However, the
`
1
norm is not a differentiable function, since
|x|
is not differentiable at 0. This
is not some exotic case we may hope to avoid most of the time when solving
the Lasso, we actively want our solutions to have zeroes, so we really want to
get to these non-differentiable points.
Thus, we seek some generalized notion of derivative that works on functions
that are not differentiable.
Definition (Subgradient). A vector
v R
d
is a subgradient of a convex function
at x if f (y) f(x) + v
T
(y x) for all y R
d
.
The set of subgradients of
f
at
x
is denoted
f
(
x
), and is called the subdif-
ferential.
f f
This is indeed a generalization of the derivative, since
Proposition. Let
f
be convex and differentiable at
x int
(
dom f
). Then
f(x) = {∇f(x)}.
The following properties are immediate from definition.
Proposition. Suppose
f
and
g
are convex with
int
(
dom f
)
int
(
dom g
)
6
=
,
and α > 0. Then
(αf)(x) = α∂f (x) = {αv : v f(x)}
(f + g)(x) = g(x) + f(x).
The condition (for convex differentiable functions) that
x
is a minimum iff
f
0
(x) = 0” now becomes
Proposition. If f is convex, then
x
argmin
xR
d
f(x) 0 f(x
).
Proof.
Both sides are equivalent to the requirement that
f
(
y
)
f
(
x
) for all
y.
We are interested in applying this to the Lasso. So we want to compute the
subdifferential of the `
1
norm. Let’s first introduce some notation.
Notation. For
x R
d
and
A {
1
. . . , d}
, we write
x
A
for the sub-vector of
x
formed by the components of
x
induced by
A
. We write
x
j
=
x
{j}
c
=
x
{1,...,d}\j
.
Similarly, we write x
jk
= x
{jk}
c
etc.
We write
sgn(x
i
) =
1 x
i
< 0
1 x
i
> 0
0 x
i
= 0
,
and sgn(x) = (sgn(x
1
), . . . , sgn(x
d
))
T
.
First note that k · k
1
is convex, as it is a norm.
Proposition. For x R
d
and A {j : x
j
6= 0}, we have
kxk
1
= {v R
d
: kvk
1, v
A
= sgn(x
A
)}.
Proof.
It suffices to look at the subdifferential of the absolute value function,
and then add them up.
For
j
= 1
, . . . , d
, we define
g
j
:
R
d
R
by sending
x
to
|x
j
|
. If
x
j
6
= 0, then
g
j
is differentiable at
x
, and so we know
g
j
(
x
) =
{sgn
(
x
j
)
e
j
}
, with
e
j
the
j
th
standard basis vector.
When x
j
= 0, if v g
j
(x
j
), then
g
j
(y) g
j
(x) + v
T
(y x).
So
|y
j
| v
T
(y x).
We claim this holds iff
v
j
[
1
,
1] and
v
j
= 0. The
direction is an immediate
calculation, and to show
, we pick
y
j
=
v
j
+
x
j
, and
y
j
= 0. Then we have
0 v
T
j
v
j
.
So we know that v
j
= 0. Once we know this, the condition says
|y
j
| v
j
y
j
for all
y
j
. This is then true iff
v
j
[
1
,
1]. Forming the set sum of the
g
j
(
x
)
gives the result.