0% found this document useful (0 votes)

37 views86 pages

Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India

This document discusses provable non-convex optimization algorithms for machine learning problems. It begins by introducing high-dimensional machine learning problems that require solving non-convex optimization problems. Popular approaches like convex relaxation are discussed, but these are not scalable to large problems. The document then focuses on two approaches - projected gradient descent and alternating minimization - that can provide scalable and provable algorithms for learning with structured models on large datasets in non-convex settings. Several examples where these approaches have worked well are also outlined.

Uploaded by

Pihu Jan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views86 pages

Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India

Uploaded by

Pihu Jan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 86

Provable Non-convex

Optimization for ML
Prateek Jain
Microsoft Research India

http://research.microsoft.com/en-us/people/prajain/
Overview
• High-dimensional Machine Learning
• Many many parameters
• Impose structural assumptions
• Requires solving non-convex optimization
• In general NP-hard
• No provable generic optimization tools

2
Overview
• Most popular approach: convex relaxation
• Solvable in poly-time
• Guarantees under certain assumptions
• Slow in practice

Theoretically
Theoretically
Practical Practical ProvableProvable
AlgorithmsAlgorithms Algorithms
Algorithms
For High-d ML Problems
For High-d ML Problems ForProblems
For High-d ML High-d ML Problems

3
Learning in Large No. of Dimensions

(
f
) {Learning, Optimization}

0 3 0 1… 0 … … …… …2 ... 0 9

𝑑 4
Linear Model

𝑓 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 = 〈𝑤, 𝑥〉
𝑖
• 𝑤: 𝑑 −dimensional vector
• No. of training samples: 𝑛 = 𝑂(𝑑)
• For bi-grams: 𝑛 = 1000𝐵 documents!
• Prediction and storage: O(d)
• Prediction time per query: 1000 secs
• Over-fitting
00 3 3 0 0… … …… … … ... …0 9... … 0 … 9… … 22 1
5
Another Example: Low-rank Matrix Completion

• Task: Complete ratings matrix

• No. of Parameters: 𝑑1 × 𝑑2
• 𝑑1 = 1𝑀, 𝑑2 = 10𝐾
• 𝑑1 × 𝑑2 = 10𝐵

6
Key Issues
• Large no. of training samples required
• Large training time
• Large storage and prediction time

7
Learning with Structure
• Restrict the parameter space
• Linear classification/regression: 𝑓 𝑥 = 〈𝑤, 𝑥〉
• Restrict no. of zeros in 𝑤 to 𝑠 ≪ 𝑑
0 3 0 0 1 0 0 0 0 9

• Say 𝑑 = 1𝑀, 𝑠 = 100

• Need to learn only 𝑂(𝑠 log 𝑑) parameters

8
Learning with Structure contd…
• Matrix completion:
𝑟

≅ ×

𝑊 ≅𝑈 × 𝑉 𝑇 • W: characterized by U, V
• No. of variables:
𝑑2 • U: d1 × 𝑟 = 𝑑1 𝑟
• V: d2 × 𝑟 = 𝑑2 𝑟 9
Learning with structure Data Fidelity
Function
min 𝐿(𝑤)
𝑤
𝑠. 𝑡. 𝑤 ∈ 𝐶
• Linear classification/regression
• 𝐶 = {𝑤, ||𝑤||0 ≤ 𝑠} • Comp. Complexity: NP-Hard
• ||𝑤||0 : Non-convex
• 𝑠 log 𝑑 ≪ 𝑑

• Matrix completion
• 𝐶 = {𝑊, 𝑟𝑎𝑛𝑘 𝑊 ≤ 𝑟}
• 𝑟(𝑑1 + 𝑑2 ) ≪ 𝑑1 𝑑2 • Comp. Complexity: NP-Hard
• 𝑟𝑎𝑛𝑘(𝑊): Non-convex 10
Other Examples
• Complexity: undecidable
• Low-rank Tensor completion
• 𝑡𝑒𝑛𝑠𝑜𝑟 − 𝑟𝑎𝑛𝑘 𝑊 : Non-convex
• 𝐶 = {𝑊, 𝑡𝑒𝑛𝑠𝑜𝑟 − 𝑟𝑎𝑛𝑘 𝑊 ≤ 𝑟}
• 𝑟(𝑑1 + 𝑑2 + 𝑑3 ) ≪ 𝑑1 𝑑2 𝑑3

• Robust PCA
• 𝐶 = {𝑊, 𝑊 = 𝐿 + 𝑆, 𝑟𝑎𝑛𝑘 𝐿 ≤ 𝑟, ||𝑆||0 ≤ 𝑠}
• 𝑟 𝑑1 + 𝑑2 + 𝑠 log(𝑑1 + 𝑑2 ) ≪ 𝑑1 𝑑2
• Complexity: NP-Hard
• 𝑟𝑎𝑛𝑘 𝑊 , ||𝑆||0 : Non-convex
11
Convex Relaxations
• Linear classification/regression
• 𝐶 = {𝑤, ||𝑤||0 ≤ 𝑠} 𝐶ሚ = {𝑤, ||𝑤||1 ≤ 𝜆(𝑠)}
• ||𝑤||1 ≤ σ𝑖 𝑤𝑖

• Matrix completion
• 𝐶 = {𝑊, 𝑟𝑎𝑛𝑘 𝑊 ≤ 𝑟} 𝐶ሚ = {𝑊, ||𝑊||∗ ≤ 𝜆 𝑟 }
• ||𝑊||∗ ≤ σ𝑖 𝜎𝑖 , 𝑊 = 𝑈Σ𝑉 𝑇
12
Convex Relaxations Contd…

• Low-rank Tensor completion

• 𝐶 = {𝑊, 𝑡𝑒𝑛𝑠𝑜𝑟 − 𝑟𝑎𝑛𝑘 𝑊 ≤ 𝑟} 𝐶ሚ = {𝑊, | 𝑊| ∗ ≤ 𝜆 𝑟 }

• Robust PCA
• 𝐶 = {𝑊, 𝑊 = 𝐿 + 𝑆, 𝑟𝑎𝑛𝑘 𝐿 ≤ 𝑟, ||𝑆||0 ≤ 𝑠}

𝐶ሚ = {𝑊, 𝑊 = 𝐿 + 𝑆, ||𝐿||∗ ≤ 𝜆 𝑟 , ||𝑆||1 ≤ 𝜆 𝑠 }

13
Convex Relaxation
• Advantage:
• Convex optimization: Polynomial time
• Generic tools available for optimization
• Systematic analysis

• Disadvantage:
• Optimizes over a much bigger set
• Not scalable to large problems

14
This tutorial’s focus
Don’t Relax!
• Advantage: scalability
• Disadvantage: optimization and its analysis is much harder
• Local minima problems

• Two approaches:
• Projected gradient descent
• Alternating minimization

15
Approach 1: Projected Gradient Descent
min 𝐿 𝑤
𝑤
𝑠. 𝑡. 𝑤 ∈ 𝐶
• 𝑤𝑡+1 = 𝑤𝑡 − 𝜕𝑤𝑡 𝐿(𝑤𝑡 )

• 𝑤𝑡+1 = 𝑃𝐶 (𝑤𝑡+1 )

16
Efficient Projection
• Sparse linear regression/classification
• 𝐶 = {𝑤, ||𝑤||0 ≤ 𝑠}
• 𝑠𝑢𝑝𝑝 𝑃𝑟𝑜𝑗𝐶 𝑧 = {𝑖1 , … , 𝑖𝑠 }
• 𝑧𝑖1 ≥ 𝑧𝑖2 ≥ ⋯ ≥ |𝑧𝑖𝑑 |
• 𝑂(𝑑 log 𝑑)

• Low-rank Matrix completion

• 𝐶 = {𝑊, 𝑟𝑎𝑛𝑘 𝑊 ≤ 𝑟}
• SVD (top-𝑟 singular components)
• 𝑂(𝑑1 ⋅ 𝑑2 ⋅ 𝑟)
17
Approach 2: Alternating Minimization
min 𝑓(𝑈, 𝑉)
𝑈,𝑉
• Alternating Minimization:
• Fix U, optimize for V
𝑉 𝑡 = 𝑎𝑟𝑔 min 𝑓(𝑈 𝑡 , 𝑉)
𝑉
• Fix V, optimize for U
𝑈 𝑡+1 = 𝑎𝑟𝑔 min 𝑓(𝑈, 𝑉 𝑡 )
𝑈

• Generic technique
• If each individual problem is “easy”
• Generic technique, e.g., EM algorithms
Results for Several Problems
• Sparse regression [Jain et al.’14, Garg and Khandekar’09]
• Sparsity
• Robust Regression [Bhatia et al.’15]
• Sparsity+output sparsity
• Dictionary Learning [Agarwal et al.’14]
• Matrix Factorization + Sparsity
• Phase Sensing [Netrapalli et al.’13]
• System of Quadratic Equations
• Vector-value Regression [Jain & Tewari’15]
• Sparsity+positive definite matrix

19
Results Contd…
• Low-rank Matrix Regression [Jain et al.’10, Jain et al.’13]
• Low-rank structure
• Low-rank Matrix Completion [Jain & Netrapalli’15, Jain et al.’13]
• Low-rank structure
• Robust PCA [Netrapalli et al.’14]
• Low-rank ∩ Sparse Matrices
• Tensor Completion [Jain and Oh’14]
• Low-tensor rank
• Low-rank matrix approximation [Bhojanapalli et al.’15]
• Low-rank structure

20
Sparse Linear Regression
0.1
0 d
=
n 1
⋮ ⋮

0.9
𝑦 = 𝑋 𝑤

• But: 𝑛 ≪ 𝑑
• 𝑤: 𝑠 −sparse (𝑠 non-zeros)
21
Motivation: Single Pixel Camera

• For 1Megapixel image, 1Million measurements would be required

Picture taken from Baranuik et al.

Sparsity

• Most images are sparse in wavelet transform space

• Typically around 2.5% coefficients are significant

Picture taken from Baranuik et al.

Motivation: Multi-label Classification

• Formulate as C 1-vs-all binary problems

• Learn 𝐰𝐢 , 1 ≤ 𝑖 ≤ 𝐶 s.t. prediction is 𝑠𝑖𝑔𝑛(𝐰𝐢 ⋅ 𝐳 )
• Imagenet has 20,000 categories
• Problem: Train 20,000 SVM’s
• Prediction time: O(20,000 ⋅ d)
Sparsity
• Typically an image has only 5 − 10 objects
Label
Compressive Sensing of Labels
𝑦 𝑋 Label
100
= ×
20000

• Learn 100 classifiers/regression functions

• Use Recovery algorithms to map back to label space
• Proposed by Hsu et al and then later pursued by several works
Sparse Linear Regression
min ||𝑦 − 𝑋𝑤||2
𝑤
𝑠. 𝑡. ||𝑤||0 ≤ 𝑠

• ||𝑦 − 𝑋𝑤||2 = σ𝑖 𝑦𝑖 − 𝑥𝑖 , 𝑤 2

• ||𝑤||0 : number of non-zeros

• NP-hard problem in general 

• 𝐿0 : non-convex function

27
Non-convexity of Low-rank manifold

1 0 0.5
0.5 0 + 0.5 1 = 0.5
0 0 0
Convex Relaxation
min ||𝑦 − 𝑋𝑤||2
𝑤
𝑠. 𝑡. ||𝑤||0 ≤ 𝑠
• Relaxed Problem:
min ||𝑦 − 𝑋𝑤||2
𝑤
𝑠. 𝑡. ||𝑤||1 ≤ 𝑠
• ||𝑤||1 = σ𝑖 |𝑤𝑖 |
• Known to promote sparsity
• Pros: a) Principled approach, b) Captures correlations between features
• Cons: Slow to optimize
29
Our Approach : Projected Gradient Descent
min 𝑓 𝑤 = ||𝑦 − 𝑋𝑤||2
𝑤
𝑠. 𝑡. ||𝑤||0 ≤ 𝑠
• 𝑤𝑡+1 = 𝑤𝑡 − 𝜕𝑤𝑡 𝑓(𝑤𝑡 )

• 𝑤𝑡+1 = 𝑃𝑠 (𝑤𝑡+1 )

30
[Jain, Tewari, Kar’2014]
Projection onto 𝐿0 ball?
min ||𝑥 − 𝑧||22
𝑥
𝑠. 𝑡. ||𝑥||0 ≤ 𝑠
Important Properties
A Stronger Result?
𝑑−𝑠
||𝑃𝑠 𝑧 − 𝑧||22 ≤ ||𝑃 ∗ 𝑧 − 𝑧|| 2
𝑑 − 𝑠∗ 𝑠 2
Our Approach : Projected Gradient Descent
min 𝑓 𝑤 = ||𝑦 − 𝑋𝑤||2
𝑤
𝑠. 𝑡. ||𝑤||0 ≤ 𝑠
• 𝑤𝑡+1 = 𝑤𝑡 − 𝜕𝑤𝑡 𝑓(𝑤𝑡 )

• 𝑤𝑡+1 = 𝑃𝑠 (𝑤𝑡+1 )

34
[Jain, Tewari, Kar’2014]
Convex-projections vs Non-convex Projections
• For non-convex sets, we only have:
∀𝑌 ∈ 𝐶, ||𝑃𝑟 𝑍 − 𝑍|| ≤ ||𝑌 − 𝑍||
• 0-th order condition
• But, for projection onto convex set 𝐶:
∀𝑌 ∈ 𝐶, ||𝑍 − 𝑃𝐶 𝑍 ||2 ≤ 〈𝑌 − 𝑍, 𝑃𝐶 𝑍 − 𝑍〉
• 1-st order condition

• 0 order condition sufficient for convergence of Proj. Grad. Descent?

• In general, NO 
• But, for certain specially structured problems, YES!!!
Convex-Projected Gradient Descent Proof?
• Let 𝑓 𝑤 = ||𝑋 𝑤 − 𝑤 ∗ ||22
• Let 𝛼 ⋅ 𝐼𝑑×𝑑 ≼ 𝑋 𝑇 𝑋 ≼ 𝐿 ⋅ 𝐼𝑑×𝑑
𝑇 ∗ 1
• Let 𝑤𝑡+1 = 𝑃𝐶 𝑤𝑡 − 𝜂 𝑔𝑡 , 𝑔𝑡 = 𝑋 𝑋 𝑤𝑡 − 𝑤 , 𝜂 =
𝐿
• 𝐶: convex set and 𝑤 ∗ ∈ 𝐶
∗
𝛼 ∗
||𝑤𝑡+1 − 𝑤 || ≤ 1− ||𝑤𝑡 − 𝑤 ||
𝐿
Restricted Isometry Property (RIP)
• X satisfies RIP if, for all sparse vectors Φ acts as an
Isometry
• Formally: For all 𝑠-sparse 𝒘

(1 − 𝛿𝑠 )||𝐰||2 ≤ ||𝐗𝐰||2 ≤ (1 + 𝛿𝑠 )||𝐰||2

𝑋
𝑤
Xw
Proof under RIP
• Let 𝑓 𝑤 = ||𝑋 𝑤 − 𝑤 ∗ ||22
1
• Let 𝛿3𝑠 ≤
2
• Let 𝑤𝑡+1 = 𝑃𝐶 𝑤𝑡 − 𝜂 𝑔𝑡 , 𝑔𝑡 = 𝑋 𝑇 𝑋 𝑤𝑡 − 𝑤 ∗ , 𝜂 = 1
• 𝐶: 𝐿0 ball with 𝑠 non-zeros and 𝑤 ∗ ∈ 𝐶

∗
3 ∗
||𝑤𝑡+1 − 𝑤 || ≤ ||𝑤𝑡 − 𝑤 ||
4
[Blumensath & Davies’09, Garg & Khandekar’09]
Variations
• Fully corrective version:
𝑢𝑡+1 = 𝑃𝐶 𝑤𝑡 − 𝜂 𝑔𝑡
𝑤𝑡+1 = arg min 𝑓(𝑤) , 𝑠. 𝑡. supp w = supp(u)
𝑤
• Two stage algorithms:
Summary so far…
• High-dimensional problems
• 𝑛≪𝑑
• Need to impose structure on 𝑤
• Sparsity
• Projection easy!
• Projected Gradient works (if RIP is satisfied)
• Several variants exist
Which Matrices Satisfy RIP?
1 − 𝛿𝑠 | 𝐰||2 ≤ ||𝐗𝐰||2 ≤ 1 + 𝛿𝑠 | 𝐰||2 , ||𝑤||0 ≤ 𝑠

• Several ensembles of random matrices

• Large enough 𝑚
𝑑
• For example: n = 𝑂(𝑠 log )
𝑠
• X𝑖𝑗 ∼ 𝐷
• 𝐷: 0-mean distribution 𝑋
• Bounded fourth moment
𝑑
𝑛 = 𝑂(𝑠 log )
𝑠

𝑛
Popular RIP Ensembles
𝑋
𝑑
𝑛 = 𝑂(𝑠 log )
𝑠

• Most popular examples:

• X𝑖𝑗 ∼ 𝑁(0,1/ 𝑚)
1 1 1 1
• X𝑖𝑗 = + 𝑤. 𝑝. 𝑎𝑛𝑑 − (𝑤. 𝑝. )
𝑚 2 𝑚 2
Proof of RIP for Gaussian Ensemble
• 𝑋 ∈ 𝑅𝑛×𝑑
1
• 𝑋𝑖𝑗 ∼ 𝑁(0,1)
𝑛
1
• 𝑛 ≥ 2 𝑠 log 𝑑
𝛿𝑠
• Then, 𝑋 satisfies RIP at 𝑠-sparsity with constant 𝛿𝑠
Other structures?
• Group sparsity
• Tree sparsity
• Union of subspaces (polynomially many subspaces)

• Projection easy for each one of these problems

• Gaussian matrices satisfy RIP (because union of small no. of subspaces)
General Result
• Let 𝑓 𝑤 = ||𝑋 𝑤 − 𝑤 ∗ ||22
1
• Let 𝑤𝑡+1 = 𝑃𝐶 𝑤𝑡 − 𝜂 𝑔𝑡 , 𝑔𝑡 = 𝑋𝑇 𝑋 𝑤𝑡 − 𝑤∗ , 𝜂=
(1+𝛿3𝑠 )
• 𝐶: Any non-convex set and 𝑤 ∗ ∈ 𝐶

1 − 𝛿𝑠 | 𝐰||2 ≤ ||𝐗𝐰||2 ≤ 1 + 𝛿𝑠 | 𝐰||2 , 𝑤∈𝐶

3∗ ∗
||𝑤𝑡+1 − 𝑤 || ≤ ||𝑤𝑡 − 𝑤 ||
4
Proof?
But what if RIP is not possible?
Statistical Guarantees
𝑦𝑖 = 〈𝑥𝑖 , 𝑤 ∗ 〉 + 𝜂𝑖
• 𝑥𝑖 ∼ 𝑁(0, Σ)
• 𝜂𝑖 ∼ 𝑁(0, 𝜎 2 )
• 𝑤 ∗ : 𝑠 −sparse
𝜎 ⋅ 𝜅 ⋅ 𝑠 log 𝑑
|| 𝑤
ෝ− 𝑤 ∗ || ≤
𝑛
• 𝜅 = 𝜆1 (Σ)/𝜆𝑑 (Σ)

56
[Jain, Tewari, Kar’2014]
Proof?
1
•𝑓 𝑤 = ||𝑋 𝑤 − 𝑤 ∗ ||2
2
• 𝑋 = [𝑥1 ; 𝑥2 ; … ; 𝑥𝑛 ]
• 𝑥𝑖 ∼ 𝑁 0, Σ , 𝛼 ⋅ 𝐼𝑑×𝑑 ≼ Σ ≼ 𝐿 ⋅ 𝐼𝑑×𝑑
2
• 𝑤𝑡+1 = 𝑃𝑠 𝑤𝑡 − 𝜂 𝑔𝑡 , 𝐿 =
3𝐿
𝐿 2 ∗
•𝑠= 𝑠
𝛼

∗ 2
𝛼 ∗ 2
||𝑤𝑡+1 − 𝑤 ||2 ≤ 1− ||𝑤𝑡 − 𝑤 ||2
10 ⋅ 𝐿
Proof?
General Result for Any Function
• 𝑓: 𝑅𝑑 → 𝑅
• 𝑓: satisfies RSC/RSS, i.e.,
𝛼𝑠 ⋅ 𝐼𝑑×𝑑 ≼ 𝐻 𝑤 ≼ 𝐿𝑠 ⋅ 𝐼𝑑×𝑑 , 𝑖𝑓, ||𝑤||0 ≤ 𝑠

• IHT and several similar algorithm guarantee:

𝑓 𝑤𝑇 ≤ 𝑓 𝑤 ∗ + 𝜖
𝑓 𝑤0
log 𝜖
After 𝑇 = 𝑂( 𝐿 ) steps
log(1−𝛼𝑠 )
𝑠
𝐿2𝑠 ∗
• If ||𝑤 ∗ || ≤ 𝑠 ∗ and 𝑠 ≥ 10 2 𝑠
𝛼𝑠

[Jain, Tewari, Kar’2014]

Theory and Practice
𝑦𝑖 = 〈𝑥𝑖 , 𝑤 ∗ 〉 + 𝜂𝑖

• 𝑥𝑖 ∼ 𝑁(0, Σ), 𝜂𝑖 ∼ 𝑁(0, 𝜎 2 )

Convex
• 𝑤 ∗ : 𝑠 −sparse
1
• Number of iterations: log( )
𝜖

𝜎𝜅 𝑠 log 𝑑
|| 𝑤
ෝ− 𝑤 ∗ ||
≤𝜖+
𝑛
Non-Convex • 𝜅 = 𝜆1 (Σ)/𝜆𝑑 (Σ)

[Jain, Tewari, Kar’2014] 62

Summary so far…
• High-dimensional problems
• 𝑛≪𝑑
• Need to impose structure on 𝑤
• Sparsity
• Projection easy!
• Projected Gradient works (if RIP is satisfied)
• Several variants exist
• RIP/RSC style proof works for subgaussian data
• Other structures also allowed
Robust Regression
0.1
0 d
n 1 = + n
⋮ ⋮

0.9
𝑦 = 𝑋 𝑤 + 𝑏

𝑦 = 𝑋𝑤 ∗ + 𝑏
Typical b:
a) Deterministic error : | 𝑤 − 𝑤 ∗ | ≤ ||𝑏||
||𝑏||
b) Gaussian error : | 𝑤 − 𝑤 ∗| ≤
𝑛
Robust Regression
• ||𝑏||0 ≤ 𝛽 ⋅ 𝑛
• We want 𝛽 to be a constant
• Entries of 𝑏 can be unbounded!
• | 𝑏| 2 can be arbitrarily large

• Still we want: ||𝑤 − 𝑤 ∗ || = 0

RR Algorithm
• 𝑆0 = {1, 2, … , 𝑛}
• For t=0, 1, ….
• 𝑤𝑡+1 = arg min ||𝑋𝑆𝑡 𝑤 − 𝑦𝑆𝑡 ||22
• 𝑟𝑡+1 = 𝑦 − 𝑋𝑤𝑡+1
• 𝑆𝑡+1 = 𝑇𝑜𝑝(|𝑟𝑡+1 |, 𝛽 ⋅ 𝑛)

• Algorithm: was vaguely proposed by Legendre-1805

Result
• 𝑦 = 𝑋𝑤 ∗ + 𝑏
• ||𝑏||0 ≤ 𝛽 ⋅ 𝑛
1
•𝛽≤
100
• 𝑛 ≥ 𝑑 log 𝑑
• 𝑋𝑖𝑗 ∼ 𝑁(0,1)
9
||𝑏𝑆𝑡+1 ||2 ≤ ||𝑏𝑆𝑡 ||2
10
Proof?
Proof?
Empirical Results
Empirical Results
Empirical Results
One-bit Compressive Sensing
• Compressed Sensing:
𝐲 = X𝐰 ∗
• Require to know 𝐲 exactly
• In practice, finite bit representation, some quantization
required
• One-bit CS: extreme quantization
𝐲 = 𝑠𝑖𝑔𝑛(X𝐰 ∗ )
• Easily implementable through comparators
• Results in two categories:
• Support Recovery [HB12, GNJN13]
• Approximate Recovery [PV12, GNJN13]
Phase-Retrieval
• Another extreme:
𝐲 = |X𝐰 ∗ |
• Useful in several imaging applications
• A field in itself
• Ideas from sparse-vector and low-rank matrix estimation [C12, NJS13]
Dictionary Learning

A
≅ ×
𝑟-dim, k-sparse
vector

𝑚 𝑑×𝑟
Data Point Dictionary
Dictionary Learning
Y A X
𝑑 ≅ × 𝑟

𝑛 𝑟

• Overcomplete dictionaries: 𝑟 ≫ 𝑑
• Goal: Given 𝑌, compute 𝐴, 𝑋
• Using small number of samples 𝑛
Existing Results
• Generalization error bounds [VMB’11, MPR’12, MG’13, TRS’13]
• But assumes that the optimal solution is reached
• Do not cover exact recovery with finite many samples
• Identifiability of 𝐴, 𝑋 [HS’11]
• Require exponentially many samples
• Exact recovery [SWW’12]
• Restricted to square dictionary (𝑑 = 𝑟)
• In practice, overcomplete dictionary (𝑑 ≪ 𝑟) is more useful
Generating Model
• Generate dictionary 𝐴
• Assume 𝐴 to be incoherent, i.e., 𝐴𝑖 , 𝐴𝑗 ≤ 𝜇/ 𝑑
• 𝑟≫𝑑
• Generate random samples 𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛
• Each 𝑥𝑖 is 𝑘-sparse
• Generate observations: 𝑌 = 𝐴𝑋
Algorithm
• Typically practical algorithm: alternating minimization
• 𝑋𝑡+1 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑋 ||𝑌 − 𝐴𝑡 𝑋||2𝐹
• 𝐴𝑡+1 = 𝑎𝑟𝑔𝑚𝑖𝑛𝐴 ||𝑌 − 𝐴𝑋𝑡+1 ||2𝐹
• Initialize 𝐴0
• Using clustering+SVD method of [AAN’13] or [AGM’13]
Results [AAJNT’13]
• Assumptions:
• 𝐴 is 𝜇 − incoherent ( 𝐴𝑖 , 𝐴𝑗 ≤ 𝜇/ 𝑑, ||𝐴𝑖 || = 1)
• 1 ≤ 𝑋𝑖𝑗 ≤ 100
1
𝑑6
• Sparsity: 𝑘 ≤ 1 (better result by AGM’13)
𝜇3
• 𝑛 ≥ 𝑂(𝑟 2 log 𝑟)
1
• After log( )-steps of AltMin:
𝜖
||𝐴𝑖𝑇 − 𝐴𝑖 ||2 ≤ 𝜖
Proof Sketch
• Initialization step ensures that:
1
||𝐴𝑖
− ≤ 2𝐴𝑖0 ||
𝑘
• Lower bound on each element of 𝑋𝑖𝑗 + above bound:
• 𝑠𝑢𝑝𝑝(𝑥𝑖 ) is recovered exactly
• Robustness of compressive sensing!
• 𝐴𝑡+1 can be expressed exactly as:
• 𝐴𝑡+1 = 𝐴 + 𝐸𝑟𝑟𝑜𝑟(𝐴𝑡, 𝑋𝑡 )
• Use randomness in 𝑠𝑢𝑝𝑝(𝑋𝑡 )
Simulations

Emirically: 𝑛 = 𝑂(𝑟)
Known result: 𝑛 = 𝑂 𝑟 2 log 𝑟
Summary
• Consider high-dimensional structured problems
• Sparsity
• Block sparsity
• Tree-based sparsity
• Error sparsity
• Iterative hard thresholding style method
• Practical/easy to implement
• Fast convergence
• RIP/RSC/subGaussian data: Provable guarantees

http://research.microsoft.com/en-us/people/prajain/
Purushottam Kar Kush Bhatia

PostDoc Research Fellow

MSR, India MSR, India
Ambuj Tewari

Asst. Prof.
Univ of Michigan
Next Lecture
• Low-rank Structure
• Matrix Regression
• Matrix Completion
• Robust PCA
• Low-rank Tensor Structure
• Tensor completion
Block-sparse Signals
𝐲1 = Φ1 𝐱1 , 𝐲2 = Φ2 𝒙2 , … , 𝐲𝑟 = Φ𝑟 𝐱 𝑟
• Total no. of measurements: 𝑂(𝑟 ⋅ 𝑘 ⋅ log 𝑛)
• Correlated signals: J = 𝑥1 ∪ 𝑥2 … 𝑥𝑟 ≤ 𝑘 ⋅ 𝑟
• Method--- Group norms: 𝐿2,1 or 𝐿2,∞
• Improvement in sample complexity if
𝐽 ≪𝑘⋅𝑟

Duchi SH Si CH 08
No ratings yet
Duchi SH Si CH 08
8 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Robust Matrix Completion With Heavy-Tailed Noise
No ratings yet
Robust Matrix Completion With Heavy-Tailed Noise
68 pages
8 SVMs
No ratings yet
8 SVMs
72 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
T R Ik-Cl Ervor Er Kis: (Example)
No ratings yet
T R Ik-Cl Ervor Er Kis: (Example)
122 pages
F-Bach
No ratings yet
F-Bach
36 pages
Index
No ratings yet
Index
127 pages
Lecture 7
No ratings yet
Lecture 7
46 pages
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
No ratings yet
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
64 pages
Rig Notes 17
No ratings yet
Rig Notes 17
168 pages
Wainwrightslides 1
No ratings yet
Wainwrightslides 1
67 pages
Wainwrightslides 2
No ratings yet
Wainwrightslides 2
77 pages
Non Convex Optimization PDF
No ratings yet
Non Convex Optimization PDF
204 pages
2022lectures1-8 Optimization For DataScience
No ratings yet
2022lectures1-8 Optimization For DataScience
35 pages
Non Convex Optimization
No ratings yet
Non Convex Optimization
139 pages
Convex Optimization in Classification Problems: MIT/ORC Spring Seminar
No ratings yet
Convex Optimization in Classification Problems: MIT/ORC Spring Seminar
39 pages
Convex Cardinality Optimization
No ratings yet
Convex Cardinality Optimization
26 pages
10 Convex Optimisation
No ratings yet
10 Convex Optimisation
31 pages
Norm Methods For Convex-Cardinality Problems
No ratings yet
Norm Methods For Convex-Cardinality Problems
31 pages
Least Squares Support Vector Machines: Johan Suykens
No ratings yet
Least Squares Support Vector Machines: Johan Suykens
84 pages
Lectures 2023
No ratings yet
Lectures 2023
115 pages
Sparse Regression and Dictionary Learning
No ratings yet
Sparse Regression and Dictionary Learning
14 pages
Ds 3
No ratings yet
Ds 3
25 pages
sparsitNERANETEE NEURA RMNURLA NER WERK MANUA KPAPER PD PDF BENE
No ratings yet
sparsitNERANETEE NEURA RMNURLA NER WERK MANUA KPAPER PD PDF BENE
8 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Projection-Free Online Learning
No ratings yet
Projection-Free Online Learning
8 pages
Convex Optimizatiom IP
No ratings yet
Convex Optimizatiom IP
97 pages
Lasso Linear Regression
No ratings yet
Lasso Linear Regression
8 pages
CS User Manual
No ratings yet
CS User Manual
53 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Lecture-05 - Least Squares and Optimization
No ratings yet
Lecture-05 - Least Squares and Optimization
34 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
A Performance Task in Grade 12A
100% (3)
A Performance Task in Grade 12A
3 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
DLbook
No ratings yet
DLbook
165 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Support Vector Machines: More Generally Kernel Methods
No ratings yet
Support Vector Machines: More Generally Kernel Methods
58 pages
MIT18 409S15 Bookex
No ratings yet
MIT18 409S15 Bookex
123 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
DLL-Math 10 Quarter 2 Week 5 SY 2023-2024
No ratings yet
DLL-Math 10 Quarter 2 Week 5 SY 2023-2024
13 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Rosenberg Self Esteem
No ratings yet
Rosenberg Self Esteem
2 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Savage Worlds RPG Battlestar Galactica
100% (8)
Savage Worlds RPG Battlestar Galactica
61 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Dis11 Sol
No ratings yet
Dis11 Sol
5 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Verilog HDL
100% (1)
Verilog HDL
204 pages
Bug Tracking System
No ratings yet
Bug Tracking System
12 pages
List of Star Pattern and Arrays Programming Exercises
No ratings yet
List of Star Pattern and Arrays Programming Exercises
26 pages
An Explicit Equation For Friction Factor in Pipe
No ratings yet
An Explicit Equation For Friction Factor in Pipe
2 pages
Be Cre8v School
No ratings yet
Be Cre8v School
14 pages
Lecture Notes On Nonlinear Dynamics PDF
No ratings yet
Lecture Notes On Nonlinear Dynamics PDF
345 pages
Ground Improvement Techniques by Grouting
100% (2)
Ground Improvement Techniques by Grouting
16 pages
189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
19MnCr5
No ratings yet
19MnCr5
8 pages
MODULE 5 - SUMMATIVE Test
No ratings yet
MODULE 5 - SUMMATIVE Test
2 pages
Portfolio Reflection
No ratings yet
Portfolio Reflection
2 pages
NASSCOM HR Summit Presentation - Final Event
No ratings yet
NASSCOM HR Summit Presentation - Final Event
16 pages
Progress in Energy and Combustion Science: Steffen Heidenreich, Pier Ugo Foscolo
No ratings yet
Progress in Energy and Combustion Science: Steffen Heidenreich, Pier Ugo Foscolo
24 pages
PDF - Mathematics - The Complexity of Boolean Functions
No ratings yet
PDF - Mathematics - The Complexity of Boolean Functions
469 pages
Decision Making and Problem Solving PDF
No ratings yet
Decision Making and Problem Solving PDF
39 pages
Arithmetic and Geometric Means
No ratings yet
Arithmetic and Geometric Means
17 pages
Ilani Fernandes Resume Policy Fellowship 2019
No ratings yet
Ilani Fernandes Resume Policy Fellowship 2019
1 page
Wave Equation - Wikipedia PDF
No ratings yet
Wave Equation - Wikipedia PDF
79 pages
HR'S Strategic Partnership With Line Management: Organizational Behaviour Final Report
No ratings yet
HR'S Strategic Partnership With Line Management: Organizational Behaviour Final Report
5 pages
Errata For Instructor's Solutions Manual For Gravity, An Introduction To Einstein's General Relativity
No ratings yet
Errata For Instructor's Solutions Manual For Gravity, An Introduction To Einstein's General Relativity
11 pages
Berthold Schwarz: Ralph Oesper
No ratings yet
Berthold Schwarz: Ralph Oesper
4 pages
Ch.0 Introduction To VHDL
No ratings yet
Ch.0 Introduction To VHDL
34 pages
Brochure Concept House Village English
No ratings yet
Brochure Concept House Village English
2 pages
Assignement 1 - NISR IZABAYO Jean de La Croix 220005236
No ratings yet
Assignement 1 - NISR IZABAYO Jean de La Croix 220005236
3 pages
Freehold Regional School District Progress Report
No ratings yet
Freehold Regional School District Progress Report
1 page
Poster - The Canary in The Mineshaft
No ratings yet
Poster - The Canary in The Mineshaft
1 page
The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India

Uploaded by

Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India

Uploaded by

Provable Non-convex

• Task: Complete ratings matrix

• Say 𝑑 = 1𝑀, 𝑠 = 100

• Need to learn only 𝑂(𝑠 log 𝑑) parameters

• Low-rank Tensor completion

𝐶ሚ = {𝑊, 𝑊 = 𝐿 + 𝑆, ||𝐿||∗ ≤ 𝜆 𝑟 , ||𝑆||1 ≤ 𝜆 𝑠 }

• Low-rank Matrix completion

• For 1Megapixel image, 1Million measurements would be required

Picture taken from Baranuik et al.

• Most images are sparse in wavelet transform space

Picture taken from Baranuik et al.

• Formulate as C 1-vs-all binary problems

• Learn 100 classifiers/regression functions

• ||𝑤||0 : number of non-zeros

• NP-hard problem in general 

• 0 order condition sufficient for convergence of Proj. Grad. Descent?

(1 − 𝛿𝑠 )||𝐰||2 ≤ ||𝐗𝐰||2 ≤ (1 + 𝛿𝑠 )||𝐰||2

• Several ensembles of random matrices

• Most popular examples:

• Projection easy for each one of these problems

1 − 𝛿𝑠 | 𝐰||2 ≤ ||𝐗𝐰||2 ≤ 1 + 𝛿𝑠 | 𝐰||2 , 𝑤∈𝐶

• IHT and several similar algorithm guarantee:

[Jain, Tewari, Kar’2014]

• 𝑥𝑖 ∼ 𝑁(0, Σ), 𝜂𝑖 ∼ 𝑁(0, 𝜎 2 )

[Jain, Tewari, Kar’2014] 62

• Still we want: ||𝑤 − 𝑤 ∗ || = 0

• Algorithm: was vaguely proposed by Legendre-1805

PostDoc Research Fellow

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.