III Modern Statistical Methods - The Lasso and beyond

3The Lasso and beyond

III Modern Statistical Methods

3.3 Convex analysis and optimization theory

We’ll leave these estimates aside for a bit, and give more background on convex

analysis and convex optimization. Recall the following definition:

Definition (Convex set). A set

A ⊆ R

is convex if for any

x, y ∈ A

and

t ∈ [0, 1], we have

(1 − t)x + ty ∈ A.

non-convex convex

We are actually more interested in convex functions. We shall let our functions

to take value

∞

, so let us define

R ∪{∞}

. The point is that if we want our

function to be defined in [

a, b

], then it is convenient to extend it to be defined

on all of R by setting the function to be ∞ outside of [a, b].

Definition (Convex function). A function f : R

→

R is convex iff

f((1 − t)x + ty) ≤ (1 − t)f(x) + tf(y)

for all

x, y ∈ R

and

t ∈

1). Moreover, we require that

(

)

< ∞

for at least

one x.

We say it is strictly convex if the inequality is strict for all

x, y

and

t ∈

1).

x y

(1 − t)x + ty

(1 − t)f(x) + tf(y)

Definition (Domain). Define the domain of a function f : R

→

R to be

dom f = {x : f(x) < ∞}.

One sees that the domain of a convex function is always convex.

Proposition.

(i)

Let

, . . . , f

→

be convex with

dom f

∩ ··· ∩ dom f

∅

, and

let c

, . . . , c

≥ 0. Then c

+ ··· + c

is a convex function.

(ii) If f : R

→ R is twice continuously differentiable, then

(a) f is convex iff its Hessian is positive semi-definite everywhere.

(b) f is strictly convex if its Hessian positive definite everywhere.

Note that having a positive definite Hessian is not necessary for strict con-

vexity, e.g. x

is strictly convex but has vanishing Hessian at 0.

Now consider a constrained optimization problem

minimize f (x) subject to g(x) = 0

where x ∈ R

and g : R

→ R

. The Lagrangian for this problem is

L(x, θ) = f(x) + θ

g(x).

Why is this helpful? Suppose

∗

is the minimum of

. Then note that for any

we have

inf

x∈R

L(x, θ) ≤ inf

x∈R

,g(x)=0

L(x, θ) = inf

x∈R

:g(x)=0

f(x) = c

∗

Thus, if we can find some

∗

, x

∗

such that

∗

minimizes

(

x, θ

∗

) and

(

∗

) = 0,

then this is indeed the optimal solution.

This gives us a method to solve the optimization problem — for each fixed

solve the unconstrained optimization problem

argmin

(

x, θ

). If we are doing

this analytically, then we would have a formula for

in terms of

. Then seek

for a θ such that g(x) = 0 holds.

Subgradients

Usually, when we have a function to optimize, we take its derivative and set it

to zero. This works well if our function is actually differentiable. However, the

norm is not a differentiable function, since

|x|

is not differentiable at 0. This

is not some exotic case we may hope to avoid most of the time — when solving

the Lasso, we actively want our solutions to have zeroes, so we really want to

get to these non-differentiable points.

Thus, we seek some generalized notion of derivative that works on functions

that are not differentiable.

Definition (Subgradient). A vector

v ∈ R

is a subgradient of a convex function

at x if f (y) ≥ f(x) + v

(y − x) for all y ∈ R

The set of subgradients of

is denoted

∂f

(

), and is called the subdif-

ferential.

f f

This is indeed a generalization of the derivative, since

Proposition. Let

be convex and differentiable at

x ∈ int

(

dom f

). Then

∂f(x) = {∇f(x)}.

The following properties are immediate from definition.

Proposition. Suppose

and

are convex with

int

(

dom f

)

∩ int

(

dom g

)

∅

and α > 0. Then

∂(αf)(x) = α∂f (x) = {αv : v ∈ ∂f(x)}

∂(f + g)(x) = ∂g(x) + ∂f(x).

The condition (for convex differentiable functions) that “

is a minimum iff

(x) = 0” now becomes

Proposition. If f is convex, then

∗

∈ argmin

x∈R

f(x) ⇔ 0 ∈ ∂f(x

∗

Proof.

Both sides are equivalent to the requirement that

(

)

≥ f

(

∗

) for all

We are interested in applying this to the Lasso. So we want to compute the

subdifferential of the `

norm. Let’s first introduce some notation.

Notation. For

x ∈ R

and

A ⊆ {

. . . , d}

, we write

for the sub-vector of

formed by the components of

induced by

. We write

−j

{j}

{1,...,d}\j

Similarly, we write x

−jk

= x

{jk}

etc.

We write

sgn(x

) =











−1 x

< 0

1 x

> 0

0 x

= 0

and sgn(x) = (sgn(x

), . . . , sgn(x

))

First note that k · k

is convex, as it is a norm.

Proposition. For x ∈ R

and A ∈ {j : x

6= 0}, we have

∂kxk

= {v ∈ R

: kvk

∞

≤ 1, v

= sgn(x

)}.

Proof.

It suffices to look at the subdifferential of the absolute value function,

and then add them up.

For

= 1

, . . . , d

, we define

→ R

by sending

. If

= 0, then

is differentiable at

, and so we know

∂g

(

) =

{sgn

(

)

}

, with

the

standard basis vector.

When x

= 0, if v ∈ ∂g

), then

(y) ≥ g

(x) + v

(y − x).

| ≥ v

(y − x).

We claim this holds iff

∈

[

−

1] and

−j

= 0. The

⇐

direction is an immediate

calculation, and to show

⇒

, we pick

−j

, and

= 0. Then we have

0 ≥ v

−j

So we know that v

−j

= 0. Once we know this, the condition says

| ≥ v

for all

. This is then true iff

∈

[

−

1]. Forming the set sum of the

∂g

(

)

gives the result.