0% found this document useful (0 votes)
28 views36 pages

ECE586BH Lecture1

This document summarizes a lecture on the interplay between feedback control and machine learning. It discusses how control theory concepts like robustness, safety, and model-based design relate to machine learning areas like large datasets, performance optimization, and data-driven training. The lecture outlines how tools from robust control like integral quadratic constraints can provide a unified framework for analyzing the robustness of neural networks and stochastic learning algorithms as dynamical systems. It also discusses how nonconvex optimization challenges in control design can be addressed using ideas from machine learning.

Uploaded by

alisina bayati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views36 pages

ECE586BH Lecture1

This document summarizes a lecture on the interplay between feedback control and machine learning. It discusses how control theory concepts like robustness, safety, and model-based design relate to machine learning areas like large datasets, performance optimization, and data-driven training. The lecture outlines how tools from robust control like integral quadratic constraints can provide a unified framework for analyzing the robustness of neural networks and stochastic learning algorithms as dynamical systems. It also discusses how nonconvex optimization challenges in control design can be addressed using ideas from machine learning.

Uploaded by

alisina bayati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

ECE586BH: Interplay between Control and

Machine Learning

Bin Hu
ECE , University of Illinois Urbana-Champaign

Lecture 1, Fall 2023


Feedback Control Machine learning

• dynamical systems • statistics/optimization


• robustness • large-scale (big data)
• safety-critical • performance-driven
• model-based design • train using data
• CDC/ACC/ECC • NeurIPS/ICML/ICLR

2
Artificial Intelligence Revolution

Safety-critical applications!
3
Flight Control Certification

Ref: J. Renfrow, S. Liebler, and J. Denham. “F-14 Flight Control Law Design,
Verification, and Validation Using Computer Aided Engineering Tools,” 1996.
4
Feedback Control Machine learning

Unified and automated tools for a repeatable and trustable


design process of next generation intelligent systems

5
Example: Robustness is crucial!
• Deep learning: Small adversarial perturbations can fool the classifier!

• Optimization: The oracle can be inexact! xk+1 = xk − α(∇f (xk )+ek )


• Decision and control: Model uncertainty and sim-to-real gap matter!

6
Control for Learning
Control theory addresses unified analysis and design of dynamical systems.

LTI systems Markov jump systems Lur’e systems

ξk+1 = Aξk + Buk ξk+1 = Aik ξk + Bik uk ξk+1 = Aξk + Bφ(Cξk )

 T
AT P B

Pn A PA − P
AT P A − P ≺ 0 T
i=1 pij Ai Pi Ai ≺ Pj ≺M
BTP A BTP B

Pros: Unified testing conditions when problem parameters are changed.


Cons: For control, we only need to solve the conditions numerically.
Control for learning: Algorithms and networks treated as control systems
• Neural networks as generalized Lur’e systems
• Stochastic learning algorithms as generalized Lur’e systems

Key message: Robustness can be addressed in a unified manner!


7
Learning for Control
Control theory addresses unified analysis and design of dynamical systems.

LTI systems MJLS Lur’e systems

ξk+1 = Aξk + Buk ξk+1 = Aik ξk + Bik uk ξk+1 = Aξk + Bφ(Cξk )

 T
AT P B

T
Pn T A PA − P
A PA − P ≺ 0 i=1 pij Ai Pi Ai ≺ Pj ≺M
BTP A BTP B

Many control design methods rely on convex conditions (BGFB1994).


What about problems that cannot be formulated as convex optimization?
• Direct policy search (e.g. min J(K)) is nonconvex!

Learning for control: Tailoring nonconvex learning theory to push robust control
theory beyond the convex regime
8
Outline

• Control for Learning


• Control methods for certifiably robust neural networks
• A control perspective on stochastic learning algorithms

• Learning for Control


• Global convergence of direct policy search on robust control

9
Robust Control Theory
- ∆

v w

P 

1. Approximate the true system as “a linear system + a perturbation”


2. ∆ can be a troublesome element: nonlinearity, uncertainty, or delays
3. Rich control literature including standard textbooks
• Zhou, Doyle, Glover, “ Robust and optimal control,” 1996
4. Many tools: small gain, passivity, dissipativity, Zames-Falb multipliers, etc
5. The integral quadratic constraint (IQC) framework [Megretski, Rantzer
(TAC1997)] provides a unified analysis for “ LTI P + troublesome ∆”
6. Recently, IQC analysis has been extended for more general P
7. Typically, the stability is tested by a SDP condition
10
Quadratic Constraints from Robust Control
• Lur’e system: ξk+1 = Aξk + B∆(Cξk ).
• EX: Gradient method (A = I, B = −αI, C = I, and ∆ = ∇f )
• Question: How to prove that the above Lur’e system converges? We are
looking at the following set of coupled sequences {ξk , wk , vk }

{(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk } ∩ {(ξ, w, v) : wk = ∆(vk )}


• Key idea: Quadratic constraints! Replace the troublesome nonlinear
element ∆ with the following quadratic constraint:
(  T   )
vk vk
{(v, w) : wk = ∆(vk )} ⊂ (v, w) : M ≤0 ,
wk wk
where M is constructed from the property of ∆.
• If we can show that any sequence from the set below converges,
(  T   )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , M ≤0 ,
wk wk
then we are done.
11
Quadratic Constraints from Robust Control
Now we are analyzing the sequence from the following set:
(  T   )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , M ≤0
wk wk

Theorem
If there exists a positive definite matrix P and 0 < ρ < 1 s.t.
 T T
A P A − ρ2 P AT P B
   
C 0 C 0
 M
BTP A BTP B 0 I 0 I
T
then ξk+1 P ξk+1 ≤ ρ2 ξkT P ξk and limk→∞ ξk = 0.
T  T   T  T
A P A − ρ2 P AT P B
    
ξk ξk ξ C 0 C 0 ξk
≤ k M
wk BTP A BTP B wk wk 0 I 0 I wk
| {z } | {z }
T TP ξ T 
P ξk+1 −ρ2 ξk
 
ξk+1 k
v v
 k  M  k ≤0
wk wk
This condition is a semidefinite program (SDP) problem!
12
Illustrative Example: Gradient Descent Method
• Rewrite the gradient method xk+1 = xk − α∇f (xk ) as:
(xk+1 − x? ) = (xk − x? ) − α ∇f (xk )
| {z } | {z } | {z }
ξk+1 ξk wk

• If f is L-smooth and m-strongly convex, then by co-coercivity:


T 
x − x? −(m + L)I x − x?
  
2mLI
≤0
∇f (x) −(m + L)I 2I ∇f (x)
| {z }
M

• We have A = I, B = −αI, C = I, and the following SDP


 T T 
A P A − ρ2 P A T P B
  
C 0 C 0
 M
BTP A BTP B 0 I 0 I

(1 − ρ2 )p −αp
   
• This leads to −2mL m + L
+ ⊗I 0
−αp α2 p m+L −2
• Choose (α, ρ, p) to be ( L1 , 1 − m 2 2 L−m 1 2
L , L ) or ( L+m , L+m , 2 (L + m) ) to
? ?
recover standard rates, i.e. kxk+1 − x k ≤ (1 − m/L)kxk − x k
• For this proof, is strong convexity really needed? No! Regularity condition!
13
Illustrative Example: Gradient Descent Method
• We have shown kxk+1 − x? k ≤ (1 − m/L)kxk − x? k
• Is it a contraction, i.e. kxk+1 − x0k+1 k ≤ (1 − m/L)kxk − x0k k?
• (xk+1 − x0k+1 ) = (xk − x0k ) − α (∇f (xk ) − ∇f (x0k ))
| {z } | {z } | {z }
ξk+1 ξk wk

• If f is L-smooth and m-strongly convex, then by co-coercivity:


T 
x − x0 x − x0
  
2mLI −(m + L)I
≤0
∇f (x) − ∇f (x0 ) −(m + L)I 2I ∇f (x) − ∇f (x0 )
| {z }
M

• We have A = I, B = −αI, C = I, and the same SDP


(1 − ρ2 )p −αp
   
−2mL m + L
+ ⊗I 0
−αp α2 p m+L −2

• Choose (α, ρ, p) to be ( L1 , 1 − m 2
L,L )
2
or ( L+m , L−m 1 2
L+m , 2 (L + m) ) to give
the contraction result!
• For this proof, is strong convexity really needed? Yes!
14
Outline

• Control for Learning


• Control methods for certifiably robust neural networks
• A control perspective on stochastic learning algorithms

• Learning for Control


• Global convergence of direct policy search on robust control

15
Deep Learning for Classification
Deep learning has revolutionized the fields of AI and computer vision!
• Input space X ⊂ Rd to a label space Y := {1, . . . , H}.
• Predict labels from image pixels
• Neural network classifier function f := (f1 , . . . , fH ) : X → RH such that
the predicted label for an input x is arg maxj fj (x).
• Input-label (x, y) is correctly classified if arg maxj fj (x) = y.

16
Deep Learning Models

Deep learning models: f (x) = xD+1 and x0 = x


• Feedforward: xk+1 = σ(Wk xk + bk ) for k = 0, 1, · · · , D
• Residual network: xk+1 = xk − σ(Wk xk + bk ) for k = 0, 1, · · · , D
• Many other structures: transformers, etc
Deep learning models are expressive and generalize well, achieving state-of-the-art
results in computer vision and natural language processing. However, ...
17
Adversarial Attacks and Robustness

• For correct labels (i.e. arg maxj fj (x) = y), one may find kτ k ≤ ε s.t.
arg maxj fj (x + τ ) 6= y (small perturbation lead to wrong prediction)
• Small perturbation can fool modern deep learning models!
• How to deploy deep learning models into safety-critical applications?
• Certified robustness: A classifier f is certifiably robust at radius ε ≥ 0 at
point x with label y if for all τ such that kτ k ≤ ε : arg maxj fj (x + τ ) = y

18
1-Lipschitz Networks for Certified Robustness
• Tsuzuku, Sato, Sugiyama (NeurIPS2018): Let f be L-Lipschitz. If we have

Mf (x) := max(0, fy (x) − max
0
f y 0 (x)) > 2Lε
y 6=y

then we have for every τ such that kτ k2 ≤ ε: arg maxj fj (x + τ ) = y



• Perturbation smaller than Mf (x)/ 2L cannot deceive f for datapoint x!
• If each layer of a network is 1-Lipchitz, the entire network is 1-Lipschitz.

• For each data point, we test whether Mf (x) > 2ε, and then count the
percentage of data points that is guaranteed to be guarded for perturbation
smaller than ε (which is the certified accuracy for that ε).
• We need to train a Lipschitz neural network with good prediction margins!
Previous approaches:
 
• Spectral normalization (MKKY2018): xk+1 = σ 1
kWk k2 Wk xk + bk
• Orthogonality (TK2021, SF2021): xk+1 = σ(Wk xk + bk ) with WkT Wk = I
• Convex potential layer (MDAA2022): xk+1 = xk − kW2 k2 Wk σ(WkT x + bk )
k 2
1

• AOL (PL2022): xk+1 = σ(Wk diag( j |Wk Wk |ij ) 2 xk + bk )
T
P
19
My Focus: Principles for 1-Lipschitz Networks
Theorem (AHDAH2023)
If there exists nonsingular diagonal Tk s.t. WkT Wk  Tk , then we have
−1
1. The layer xk+1 = σ(Wk Tk 2 xk + bk ) is 1-Lipschitz for any 1-Lipschitz σ.
2. The layer xk+1 = xk − 2Wk Tk−1 σ(WkT x + bk ) is 1-Lipschitz if σ is ReLU.
−1 −1 −1
kxk+1 − x0k+1 k2 ≤ kWk Tk 2 (xk − x0k )k2 = (xk − x0k )T Tk 2 WkT Wk Tk 2 (xk − x0k )
| {z }
≤kxk −x0k k2

The second statement can be proved using the quadratic constraint argument.

A Unification of Existing 1-Lipschitz Neural Networks


• Spectral normalization: Statement 1 with Tk = kWk k22 I
• Orthogonal weights: Statement 1 with Tk = I and WkT Wk = I
• CPL: Statement 2 with Tk = kWk k22 I
• AOL: Statement 1 with Tk = diag( nj=1 |WkT Wk |ij )
P

• Control Theory (SLL): Tk = diag( nj=1 |WkT Wk |ij qj /qi ).


P
20
Experimental Results
4 versions of SDP-based Lipchitz Network (SLL) (S, M, L, XL)

Natural Provable Accuracy (ε)


Datasets Models
Accuracy 36 72 108
1
255 255 255

Cayley Large 43.3 29.2 18.8 11.0 -


SOC 20 48.3 34.4 22.7 14.2 -
SOC+ 20 47.8 34.8 23.7 15.8 -
CPL XL 47.8 33.4 20.9 12.6 -
AOL Large 43.7 33.7 26.3 20.7 7.8
CIFAR100
SLL Small 45.8 34.7 26.5 20.4 7.2
SLL Medium 46.5 35.6 27.3 21.1 7.7
SLL Large 46.9 36.2 27.9 21.6 7.9
SLL X-Large 47.6 36.5 28.2 21.8 8.2

• Competitive results over CIFAR100 and TinyImageNet


• Many extensions: Lipschitz deep equilibrium models, neural ODEs, etc
21
Quadratic Constraints for Lipschitz Networks
• Residual network: xk+1 = xk − Gk σ(WkT xk + bk ) for k = 0, 1, · · · , D.
• 1-Lipschitz layer: How to enforce kxk+1 − x0k+1 k ≤ kxk − x0k k?
• (xk+1 − x0k+1 ) = (xk − x0k ) −Gk σ(WkT xk + bk ) − σ(WkT x0k + bk )

| {z } | {z } | {z }
ξk+1 ξk wk

• We will use the property of σ to construct Mk such that we only need to


look at the following set with Ak = I and Bk = −Gk :
(  T   )
ξk ξk
(ξ, w) : ξk+1 = Ak ξk + Bk wk , Mk ≤0 ,
wk wk

• Then we can ensure kξk+1 k ≤ kξk k via enforcing a SDP for the set:

 T  T  T
AT AT
  
Ak P Ak − P k P Bk  M =⇒ ξk Ak Ak − I k Bk ξk
k |{z} ≤0
BkT P Ak BkT P Bk wk BkT Ak BkT Bk wk
P =I | {z }
kξk+1 k2 −kξk k2 =kxk+1 −x0k+1 k2 −kxk −x0k k2

22
Quadratic Constraints for Lipschitz Networks
• Since σ is slope-restricted on [0, 1], the following scalar-version incremental
quadratic constraint holds with m = 0 and L = 1:
T 
a − a0 a − a0
  
2mL −(m + L)
≤0
σ(a) − σ(a0 ) −(m + L) 2 σ(a) − σ(a0 )
| {z }
 

0 −1 
−1 2

• The vector-version quadratic constraint: For diagonal Γk  0, we have


T 
vk − vk0 vk − vk0
  
0 −Γk
≤0
σ(vk ) − σ(vk0 ) −Γk 2Γk σ(vk ) − σ(vk0 )
| {z }
Xk

• Choosing vk = WkT xk + bk and vk0 = WkT x0k + bk , we have


T
WkT (xk − x0k ) WkT (xk − x0k )
  
Xk ≤0
σ(WkT xk + bk ) − σ(WkT x0k + bk ) σ(WkT xk + bk ) − σ(WkT x0k + bk )
23
Quadratic Constraints for Lipschitz Networks
T
−Γk WkT
      
ξ ξk Wk 0 0 0
• We get k Mk ≤ 0 with Mk =
wk wk 0 I −Γk 2Γk 0 I
| {z }
 
0 −Wk Γk 
−Γk WkT

2Γk

Theorem
If there exists diagonal Γk  0 such that
   
0 −Gk 0 −Wk Γk

−GT k GTk Gk −Γk WkT 2Γk

then the residual network xk+1 = xk − Gk σ(WkT xk + bk ) is 1-Lipschitz.

• Analytical solution: Gk = Wk Γk and Γk WkT Wk Γk  2Γk .


• Suppose Γk is nonsingular, and Tk = 2Γ−1k . Then the residual network
xk+1 = xk − 2Wk Tk−1 σ(WkT xk + bk ) is 1-Lipschitz as long as Tk  WkT Wk
• Ref: Araujo, Havens, Delattre, Allauzen, H.. A unifying algebraic
perspective on Lipschitz neural networks, ICLR, 2023. (Spotlight) 24
Outline

• Control for Learning


• Control Methods on Certifiably Robust Neural Networks
• A Control Perspective on Stochastic Learning Algorithms

• Learning for Control


• Global convergence of direct policy search on robust control

25
History: Computer-Assisted Proofs in Optimization
In the past ten years, much progress has been made in leveraging SDPs to assist
the convergence rate analysis of optimization methods.
• Drori and Teboulle (MP2014): numerical worst-case bounds via the
performance estimation problem (PEP) formulation
• Lessard, Recht, Packard (SIOPT2016): numerical linear rate bounds using
integral quadratic constraints (IQCs) from robust control theory
• Taylor, Hendrickx, Glineur (MP2017): interpolation conditions for PEPs
• H., Lessard (ICML2017): first SDP-based analytical proof for Nesterov’s
accelerated rate
• H., Seiler, Ranzter (COLT2017): first paper on SDP-based convergence
proofs for stochastic optimization using jump system theory and IQCs
• Van Scoy, Freeman, and Lynch (LCSS2017): first paper on control-oriented
design of accelerated methods: triple momentum
Taken further by different groups
• inexact gradient methods, proximal gradient methods, conditional gradient
methods, operator splitting methods, mirror descent methods, distributed
gradient methods, monotone inclusion problems 26
Stochastic Methods for Machine Learning
• Many learning tasks (regression/classification) lead to finite-sum ERM
n
1X
minp fi (x)
x∈R n i=1
where fi (x) = li (x) + λR(x) (li is the loss, and R avoids over-fitting).
• Stochastic gradient descent (SGD): xk+1 = xk − α∇fik (xk )
• Inexact oracle: xk+1 = xk − α(∇fik (xk ) + ek ) where kek k ≤ δk∇fik (xk )k
(the angle θ between (ek + ∇fik (xk )) and ∇fik (xk ) satisfies | sin(θ)| ≤ δ)
• Algorithm change: SAG (SRF2017) vs. SAGA (DBL2014)
n
!
k+1 k ∇fik (xk ) − yikk 1X k
SAG: x =x −α + y
n n i=1 i
n
!
k+1 k k k 1X k
SAGA: x = x − α ∇fik (x ) − yik + y
n i=1 i

∇fi (xk ) if i = ik

k+1
where yi :=
yik otherwise
• Markov assumption: In reinforcement learning, {ik } can be Markovian 27
My Focus: Unified Analysis of Stochastic Methods
Assumption




• fi smooth, f RSI 





• ik is IID or Markovian 




 Bound

• Oracle is exact or inexact 



• many other possibilities  • Ekxk − x? k2 ≤ c2 ρk + O(α)



 • Ekxk − x? k2 ≤ c2 ρk

Method

• Other forms





• SGD 





• SAGA-like methods 





• Temporal difference learning
How to automate rate analysis of stochastic learning algorithms? Use
numerical semidefinite programs to support search for analytical proofs?

assumption + method =⇒ bound


28
My Focus: Stochastic Methods for Learning
In the deterministic setting, we just need to show that the trajectories generated
by optimization methods belong to the following set:
(  T   )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , Mj ≤ Λj , j ∈ Π
wk wk

What to do for stochastic optimization (e.g. xk+1 = xk − α∇fik (xk ) where


ik ∈ {1, · · · , n} is sampled)?
• Stochastic quadratic constraints: Show that the trajectories generated by
stochastic optimization methods belong to the following set:
(  T   )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , E Mj ≤ Λj , j ∈ Π
wk wk

• Jump system approach: Show that the trajectories generated by stochastic


optimization methods belong to the following set:
(  T   )
vk vk
(ξ, w, v) : ξk+1 = Aik ξk + Bik wk , vk = Cik ξk , Mj ≤ Λj , j ∈ Π
wk wk

where Aik ∈ {A1 , · · · , An }, Bik ∈ {B1 , · · · , Bn }, and Cik ∈ {C1 , · · · , Cn }


29
Stochastic Quadratic Constraints
Suppose we can show that the trajectories generated by stochastic optimization
methods belong to the following set:
(  T   )
vk vk
(ξ, w, v) : ξk+1 = Aξk + Bwk , vk = Cξk , E Mj ≤ Λj , j ∈ Π
wk wk

Theorem
If there exists a positive definite matrix P , non-negative λj and 0 < ρ < 1 s.t.
 T T
A P A − ρ2 P AT P B
 X   
C 0 C 0
 λj Mj
BTP A BTP B 0 I 0 I
j∈Π

T
P ξk+1 ≤ ρ2 EξkT P ξk +
P
then Eξk+1 j∈Π λj Λj .
T   X  T
AT P A − ρ 2 P AT P B
   
ξk ξk vk v
≤ λj Mj k
wk BTP A BTP B wk wk wk
| {z } j∈Π
T
ξk+1 TP ξ
P ξk+1 −ρ2 ξk k

Then take expectation and apply the expected quadratic constraints!


30
Main Result: Analysis of Biased SGD
• Consider xk+1 = xk − α(∇fik (xk ) + ek ) with kek k2 ≤ δ 2 k∇fik (xk )k2 + c2
• If c = 0, the bound means the angle θ between (ek + ∇fik (xk )) and
∇fik (xk ) satisfies | sin(θ)| ≤ δ
 
• Rewritten as (xk+1 − x? ) = (xk − x? ) + −αI −αI ∇fik (xk )
 
| {z } | {z } ek
ξk+1 ξk | {z }
wk

• Assume the restricted secant inequality ∇f (x) (x − x ) ≥ mkx − x? k2


T ?

• Assume fi is L-smooth, i.e. k∇fi (x) − ∇fi (x? )k ≤ Lkx − x? k


T 
xk − x? xk − x?
  
2mI −I 0
• 1st QC: E ∇fik (xk )  −I 0 0 ∇fik (xk ) ≤ |{z}0
ek 0 0 0 ek Λ1
| {z }
M1

?
T 
−2L2 I 0 xk − x?
  
xk − x 0 n
2X
• 2nd QC: E ∇fik (xk )  0 I 0 ∇fik (xk ) ≤ k∇fi (x? )k2
n i=1
ek 0 0 0 ek
| {z } | {z }
M2 Λ2
31
Main Result: Analysis of Biased SGD
• We can rewrite kek k2 ≤ δ 2 k∇fik (xk )k2 + c2 as
T 
xk − x? xk − x?
  
0 0 0
E ∇fik (xk ) 0 −δ 2 I c2
0 ∇fik (xk ) ≤ |{z}
ek 0 0 I ek Λ3
| {z }
M3

• We have A = I, B = −αI
 
−αI , C = I, and the following SDP
 T 3 T
A P A − ρ2 P AT P B
 X   
C 0 C 0
 λ j Mj
BTP A BTP B 0 I 0 I
j=1

• Biased SGD satisfies Ekxk+1 − x? k2 ≤ ρ2 Ekxk − x? k2 + λ2 Λ2 + λ3 c2 if


1 − ρ2
     2 
−α −α −2m 1 0 2L 0 0
 −α α 2 − δ 2 λ3 α 2  + λ1  1 0 0 + λ2  0 −1 0  0
2 2
−α α α + λ3 0 0 0 0 0 0
32
Main Result: Analysis of Biased SGD
• Given Ekx0 − x? k2 ≤ U0 , set Uk+1 = min(ρ2 Uk + λ2 Λ2 + λ3 c2 ) with
1 − ρ2
     2 
−α −α −2m 1 0 2L 0 0
 −α α 2 − δ 2 λ3 α 2  + λ1  1 0 0 + λ2  0 −1 0  0
2 2
−α α α + λ3 0 0 0 0 0 0
then we have Ekxk − x? k2 ≤ Uk . This leads to a sequential SDP problem.
• This problem has an exact solution
 p p 2
Uk+1 = α c2 + δ 2 Λ2 + 2L2 δ 2 Uk + (1 − 2mα + 2L2 α2 )Uk + Λ2 α2

c2 +δ 2 Λ2 m(c2 (2L2 −m2 )+(1−δ 2 )Λ2 m2 )


• limk→∞ Uk = m2 −2δ 2 L2 + (m2 −2δ 2 L2 )2 α + O(α2 )
2 2 2
• Rate = 1 − m −2δ m
L
α + O(α2 )
• For different assumptions, modify (Mj , Λj )!
• H., Seiler, and Lessard. Analysis of biased stochastic gradient descent using
sequential semidefinite programs. Mathematical Programming, 2021
• Syed, Dall’Anese, H.. Bounds for the tracking error and dynamic regret of
inexact online optimization methods: A unified analysis via sequential SDPs.
33
Jump System Approach
n  T
1 X AT 2
AT
 X   
i P Ai − ρ P i P Bi  λj
C 0
Mj
C 0
T
n i=1 B i P Ai BiT P Bi 0 I 0 I
j∈Π

Pros:
• General enough to handle many algorithms: H., Seiler, Rantzer (COLT2017)
Method Ãik B̃ik C̃
" #
eik eTik
 
In − eik eTik 0̃  T 
SAGA 0̃ 1
−α
n (e − neik )
T
1 −αeTik
" #
eik eTik
 
In − eik eTik 0̃  T 
SAG 0̃ 1
−αn (e − eik )
T
1 −αn eik
T

• General enough to handle Markov {ik }: Syed and H. (NeurIPS2019), Guo


and H. (ACC2022a,2022b)
Cons:
• SDPs are much bigger than the ones obtained from stochastic quadratic
constraints, and we have to exploit SDP structures for simplifications
34
Control for Learning: Summary

• Iterative learning algorithms and neural network layers can be thought as


feedback control systems.

• The quadratic constraint approach from control theory can be leveraged to


formulate SDP conditions for machine learning research.

• Different from the study in control, now we want to obtain analytical


solutions of the SDPs!

35
Outline

• Control for Learning


• Control methods on certifiably robust neural networks
• A control perspective on stochastic learning algorithms

• Learning for Control


• Global convergence of direct policy search on robust control

36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy