0% found this document useful (0 votes)
35 views26 pages

Convex Cardinality Optimization

This document summarizes a lecture on convex cardinality optimization problems. It begins with an introduction to cardinality and its properties, such as being non-convex and discontinuous. It then discusses convex-cardinality problems and various applications, including sparse portfolio selection and high dimensional statistics. Exact solutions to convex-cardinality problems involve enumerating all possible sparsity patterns, which is computationally intractable. Approximate solutions instead replace cardinality with the l1-norm heuristic and methods like iterative shrinkage-thresholding algorithm are discussed for solving the resulting l1-regularized problems.

Uploaded by

Steve Yeo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views26 pages

Convex Cardinality Optimization

This document summarizes a lecture on convex cardinality optimization problems. It begins with an introduction to cardinality and its properties, such as being non-convex and discontinuous. It then discusses convex-cardinality problems and various applications, including sparse portfolio selection and high dimensional statistics. Exact solutions to convex-cardinality problems involve enumerating all possible sparsity patterns, which is computationally intractable. Approximate solutions instead replace cardinality with the l1-norm heuristic and methods like iterative shrinkage-thresholding algorithm are discussed for solving the resulting l1-regularized problems.

Uploaded by

Steve Yeo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Lecture: Convex Cardinality Optimization

Junyu Zhang

Based on Stephen Boyd’s lecture notes

October 28, 2019

1 / 26
Outline

Convex cardinality optimization

Case studies

Hardness of the problem

`1 -regularized methods

More problems cases

2 / 26
Convex Cardinality Optimization

The cardinality of x ∈ Rn , written card(x), is the number of


nonzero elements in x.

Cardinality is not continuous.

Cardinality is non-convex and non-concave. (Think of 1-d case


card(x), x ∈ R)

Quasi-concave on Rn+ since

card(λx + (1 − λ)y ) ≥ min{card(x), card(y )}

Useful concept: support (sparsity pattern)

supp(x) := {i : xi 6= 0}

3 / 26
More about cardinality

It is often called the “`0 -norm”, written as kxk0 . However,


cardinality is not a norm.
Triangle inequality: kx + y k0 ≤ kxk0 + ky k0 . (Yes)
Positive definite: kxk0 = 0 ⇐⇒ x = 0. (Yes)
Homogeneity: ka · xk0 = |a| · kxk0 , ∀a ∈ R. (No)

Define d(x, y ) := kx − y k0 , d(·, ·) is a metric (distance).


Triangle inequality: d(x, y ) ≤ d(x, z) + d(y , z). (Yes)
Symmetry: d(x, y ) = d(y , x). (Yes)
Identity of indiscernibles: d(x, y ) = 0 ⇐⇒ x = y . (Yes)

If x ∈ {0, 1}n : d(·, ·) is called Hamming distance. It has many


application in data minning, e.g., recommondation systems.

4 / 26
Convex-cardinality problems

A convex-cardinality problem is a problem which will be convex


without the presence of card.

minimizing cardinality as objective function:

minimize card(x)
s.t. x ∈C

having cardinality as constraint:

minimize f (x)
s.t. x ∈C
card(x) ≤ k

5 / 26
Sparse portfolio selection
The problem set up:
n assets, each with expected return ri , 1 ≤ i ≤ n.
The covariance matrix of the the returns Σ.
Each assets invest wi .

The portfolio selection problem:

minimize w T Σw
s.t. r T w = µ, 1T w = 1.

Solution w ∗ can be dense, many entries can be small but nonzero:

minimize w T Σw
s.t. r T w = µ, 1T w = 1,
card(w ) ≤ k.
6 / 26
Sparse portfolio selection

which is often formulated as

minimize w T Σw + γ · card(w )
s.t. r T w = µ, 1T w = 1.

In practice, solve

minimize w T Σw + λkw k1
s.t. r T w = µ, 1T w = 1.

7 / 26
High dimensional statistics

Fitting linear models:


y = x T β ∗ + , ∼ N(0, σ 2 ).
Collected data points {(xi , yi )}ni=1 .
The linear regression:
n
X
β̂n = arg min (xiT β − yi )2
β i=1

Classical senarios, abundant (nondegenerate) data:

lim β̂n = β ∗
n→∞

8 / 26
High dimensional statistics

Modern situation: x ∈ Rd , with d n.


Linear regression can be arbitrarily bad.
No theoretical guarantee.

Sepcial structure: sparsity: card(β ∗ ) ≤ O(n/ log d) d


Choose proper k, solve
n
X
β̂n = arg min (xiT β − yi )2 s.t. card(β) ≤ k.
β i=1

In practice, solve
n
X
β̂n = arg min (xiT β − yi )2 + λn kβk1 .
β i=1

With proper λn , good bound can be derived for kβ̂n − β ∗ k2 .

9 / 26
Sparse coding / Dictionary learning
Unsupervised methods for learning sets of over-complete bases to
represent data efficiently.
Belief: natural images are generated by linear combination of bases.

picture from Andrew Ng.


10 / 26
Sparse coding / Dictionary learning

The set of over-complete basis: Φ = [φ1 , φ2 , · · · , φm ].

Assume each data points xi ∈ Rd can be expressed as


m
(j)
X
xi = ai · φj + νi = Φai + νi
j=1

νi ∈ Rd is an additive noise; card ai is small.


Sparse coding: a := [a1 , ..., an ],


n
X
kxi − Φai k2 + λkai k1

minimize
a, Φ
i=1

Fix a, convex in Φ; Fix Φ, convex in a.

11 / 26
Exact solution to convex-cardinality problems

Consider the case:

minimize
n
f (x) + λ · card(x)
x∈R
s.t. x ∈C

If supp(x) is given, solvable convex problem.

Exact solution: enumerate all possible supp(x) and solve the


problems. In the worst case: solve 2n problems.

The problem is in general (NP-)hard.

12 / 26
Binary linear programming

Binary LP problems:

minimize cT x
x
s.t. Ax ≤ b, x ∈ {0, 1}n .

Includes hard problems traveling salesman, Hamiltonian cycle, etc.


Observation (“=” achieved if and only if x ∈ {0, 1}n ):
n
X
card(x) + card(1 − x) = card(xi ) + card(1 − xi ) ≥ n.
i=1

reduction:
minimize cT x
x
s.t. Ax ≤ b,
card(x) + card(1 − x) = n.

13 / 26
`1 -norm heuristic

Step 1. Replace card(x) with λkxk1 . Solve the problem to get x̂ ∗

Tune λ to get desired sparsity. The larger λ is, the sparser the
solution is.
P
Or sophisticated versions: i wi |xi |.

Step 2. Polishing: fix the sparsity pattern of supp(x̂ ∗ ) and resolve


the problem without regularization to get x ∗

Step 3. In statistics and machine learning, a cross validation is


needed to ensure the selection of λ is reasonable.

14 / 26
The bias in the solution

Solve minimize f (x) s.t. card(x) ≤ k by


x

x̂λ = minimize f (x) + λkxk1


x

KKT condition: 0 ∈ ∇f (x̂λ ) + λ∂(kx̂λ k1 ), element-wise:



∇i f (x̂λ ) = −λ,
 if (x̂λ )i > 0
∇i f (x̂λ ) = λ, if (x̂λ )i < 0

∇i f (x̂λ ) ∈ [−λ, λ], if (x̂λ )i = 0

Even if we find the correct sparsity pattern. There is a nonzero bias.


Namely, define B = supp(x̂λ ),

k∇B f (x̂λ )k2 = |B| · λ2 6= 0

15 / 26
Solving the `1 regularized problem
Suppose f is convex and smooth, consider

minimize f (x) + λ · kxk1 , s.t. x ∈C


x

ISTA / FISTA method. (Next page)


Subgradient method

xk+1 = xk − ηk · (∇f (xk ) + λvk ), vk ∈ ∂(kxk k1 )

ADMM formulation, for simple f such as kAx − bk2

minimize f (x) + λ · ky k1 , s.t. x = y, x ∈ C


x

Min-max formulation. (kxk1 = maxy {y T x : ky k∞ ≤ 1})

minimize maximize f (x) + λy T x


x∈C ky k∞ ≤1

...
16 / 26
Iterative shrinkage-thresholding algorithm (ISTA)

Define x̃k+1 = xk − η · ∇f (xk ). ISTA:


1
xk+1 = arg min kx − x̃k+1 k2 + λkxk1
x 2η
Equivalent formulation:
1
xk+1 = arg min f (xk ) + h∇f (xk ), x − xk i + λkxk1 + kx − xk k2
x 2η
Separable, each entry solves arg minz (z − a)2 + b|z|, with b > 0.
Explicit update:

(x̃k+1 )i − ηλ, if
 (x̃k+1 )i ∈ (ηλ, ∞)
(xk+1 )i = 0, if (x̃k+1 )i ∈ [−ηλ, ηλ]

(x̃k+1 )i + ηλ, if (x̃k+1 )i ∈ (−∞, −ηλ)

Apply Nesterov’s acceleration, fast ISTA (FISTA).


17 / 26
Interpretation as convex relaxation

For R large enough, consider

minimize f (x) + γ · card(x)


x
s.t. x ∈ C, kxk∞ ≤ R

Equivalent mixed integer programming:

minimize f (x) + γ · 1T z
x,z
s.t. |xi | ≤ Rzi , for i = 1, ..., n
x ∈ C, z ∈ {0, 1}n

18 / 26
Interpretation as convex relaxation

Relax z ∈ {0, 1}n to z ∈ [0, 1]n

minimize f (x) + γ · 1T z
x
s.t. |xi | ≤ Rzi , for i = 1, ..., n
x ∈ C, z ∈ [0, 1]n

Fix x, minimize over z gives zi = |xi |/R:


γ
minimize f (x) + · kxk1
x R
s.t. x ∈C

optimal value lower bounds the original problem.

19 / 26
Sparse design

Find design vector x with smallest card(x) such that some set of
specifications are satisfied.

Zero entries of x simplify design (components that are not needed).

Problem formulation

minimize card(x)
s.t. x ∈C

Examples
antenna array beamforming (zero coefficients correspond to unneeded
antenna elements)
truss design (zero coefficients correspond to bars that are not needed)
...

20 / 26
Truss design

Figure: Such truss contains a huge number of steal bars, potentially many of
them are not necessary.

21 / 26
Sparse modeling / regressor selection

Fitting data b as a linear combinition of k regressors, chosen from n


regressors.

minimize
n
kAx − bk2
x∈R
s.t. card(x) ≤ k

Reformulations:
minimize
n
kAx − bk2 + λ · card(x)
x∈R

minimize
n
kAx − bk22 + λ · card(x)
x∈R

minimize
n
card(x) s.t. kAx − bk2 ≤
x∈R

Lasso, feature selection, etc.

Generalized linear model: logistic regression, poisson regression, etc.


22 / 26
Sparse signal reconstruction (compressed sensing)

For an unknown signal x ∈ Rd


noisy measurement y = Ax + v , v ∼ N(0, σ 2 I ). A ∈ Rn×d is a known
fat matrix, n d.
prior knowledge x is sparse, card(x) ≤ k.

Maximum likelihood estimator (MLE):

minimize
n
kAx − y k22
x∈R
s.t. card(x) ≤ k

Exact recovery expected under certain conditions (for `1 heuristic).

23 / 26
Estimation with outliers

Data points {(xi , yi )}ni=1 : yi = xiT β ∗ + i + wi∗ .

i ∼ N(0, σ 2 ) are i.i.d. noise.

w ∗ = [w1∗ , ..., wn∗ ]T is sparse noise: card(w ∗ ) ≤ k.

supp(w ∗ ) corresponds to the set of outliers.

The MLE for β:

minimize ky − X β − w k22
β∈Rd ,w ∈Rn
s.t. card(w ) ≤ k

24 / 26
Robust principle component analysis∗

Data matrix M ∈ Rn×n . M has an approximate low-rank


decomposition: M = UV T + W , U, V ∈ Rn×k , k n and the noise
matrix W is sparse.

Classical PCA problem:

minimize kM − Lk
L∈Rn×n
s.t. rank(L) ≤ k

Robust PCA problem:

minimize kM − L − W k
L,W ∈Rn×n
s.t. rank(L) ≤ k, card(W ) ≤ k 0 .

25 / 26
Robust principle component analysis∗

Robust PCA problem:

minimize kM − L − W k
L,W ∈Rn×n
s.t. rank(L) ≤ k, card(W ) ≤ k 0 .

Matrix L is low-rank ⇐⇒ card(σ(L)) is small, σ(L) is the vector of


n singular values of L.

Nuclear norm kLk∗ := kσ(L)k1 .

Practical formulation of robust PCA:

minimize kLk∗ + λkW k1


L,W ∈Rn×n
s.t. L + W = M.

26 / 26

You might also like