0% found this document useful (0 votes)
22 views51 pages

Applying statistical learning theory to deep learning

The document discusses the application of statistical learning theory to deep learning, focusing on the understanding of inductive bias and its implications for various architectures trained with gradient-based methods. It provides an overview of key concepts such as benign overfitting, the mirror descent algorithm, and the implicit bias of gradient descent on linear networks. The lectures aim to bridge the gap between theoretical insights and practical applications in deep learning, highlighting the importance of understanding the geometry of learning problems.

Uploaded by

Bakht Zaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views51 pages

Applying statistical learning theory to deep learning

The document discusses the application of statistical learning theory to deep learning, focusing on the understanding of inductive bias and its implications for various architectures trained with gradient-based methods. It provides an overview of key concepts such as benign overfitting, the mirror descent algorithm, and the implicit bias of gradient descent on linear networks. The lectures aim to bridge the gap between theoretical insights and practical applications in deep learning, highlighting the importance of understanding the geometry of learning problems.

Uploaded by

Bakht Zaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Applying statistical learning theory to deep learning


Cédric Gerbelot
Courant Institute of Mathematical Sciences, New York, NY 10012, USA

Avetik Karagulyan
King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
arXiv:2311.15404v1 [cs.LG] 26 Nov 2023

Stefani Karp
Carnegie Mellon University, Pittsburgh, PA, and Google Research, NY, USA


Kavya Ravichandran
Toyota Technological Institute at Chicago, Chicago, Illinois 60637, USA

Nathan Srebro
Toyota Technological Institute at Chicago, Chicago, Illinois 60637, USA

Menachem Stern
Department of Physics & Astronomy, University of Pennsylvania, Philadelphia, PA 19104-6396, USA

November 28, 2023

Abstract
Although statistical learning theory provides a robust framework to understand supervised learning,
many theoretical aspects of deep learning remain unclear, in particular how different architectures may
lead to inductive bias when trained using gradient based methods. The goal of these lectures is to
provide an overview of some of the main questions that arise when attempting to understand deep
learning from a learning theory perspective. After a brief reminder on statistical learning theory and
stochastic optimization, we discuss implicit bias in the context of benign overfitting. We then move to a
general description of the mirror descent algorithm, showing how we may go back and forth between a
parameter space and the corresponding function space for a given learning problem, as well as how the
geometry of the learning problem may be represented by a metric tensor. Building on this framework, we
provide a detailed study of the implicit bias of gradient descent on linear diagonal networks for various
regression tasks, showing how the loss function, scale of parameters at initialization and depth of the
network may lead to various forms of implicit bias, in particular transitioning between kernel or feature
learning.

[email protected]
[email protected]

1
Contents
1 Lecture 1: Applying statistical learning theory to deep learning 4
1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Inductive bias in supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Inductive bias in deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Deep learning in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Lecture 2 : Implicit bias and benign overfitting 10


2.1 Finishing lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Matrix completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Single overparameterized linear unit: Rd → R . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Deep linear network: Rd → R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Linear convolutional networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Infinite-width ReLU network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.6 Takeaways from these examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Why the red curve goes down in Figure 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 What we don’t yet fully understand: benign overfitting . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Lecture 3 : Statistical learning and stochastic convex optimization 17


3.1 Learning and optimization for convex problems . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Stochastic optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Strong convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Mirror descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Lecture 4 : Mirror descent and implicit bias of descent algorithms 28


4.1 Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Examples of Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Smoothness and Batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 General Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Implicit bias of descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Deriving Implicit Regularization for Gradient Descent . . . . . . . . . . . . . . . . . . 32
4.3.2 Similar Argument for Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 General Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Lecture 5: Implicit bias with linear functionals and the square loss 34
5.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Reminder on the kernel regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 A simple model : 2-layer linear diagonal network . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3.1 Analytical study of GD in parameter space . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.2 Studying the dynamics in function space . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.3 Comparing explicit and implicit regularization . . . . . . . . . . . . . . . . . . . . . . 40
5.4 The effect of width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Deep diagonal networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.6 Beyond linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Lecture 6: Implicit bias with linear functionals and the logistic loss 45
6.1 Problem setting and equivalent reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Gradient flow dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Comparing the squared, logistic and exponential loss . . . . . . . . . . . . . . . . . . . . . . . 46
6.4 Matrix Factorization Setting and Commutativity . . . . . . . . . . . . . . . . . . . . . . . . . 48

2
7 Lecture Series Summary 48

3
1 Lecture 1: Applying statistical learning theory to deep learning
1.1 Preamble
We start by giving you Nati’s view of supervised learning and statistical learning theory, and the main forces
in learning. We’ll mostly focus on the concept of inductive bias. We’ll be starting by trying to explain what
is inductive bias, or rather, what I mean by inductive bias, as this term is hard to define mathematically.
We’ll incorporate this idea of inductive bias in the introduction of statistical learning theory, and try to
understand how it applies to deep learning. We’ll explore how deep learning fits into the approach we
developed to understand learning theory in the past five decades. The main goal for today is to mention
what we need to understand (what questions we need to answer), and what notions we need to rethink.

1.2 Inductive bias in supervised learning


The goal of supervised learning is to find a predictor, a mapping h from inputs or instances X (e.g. images,
sentences) to labels Y (e.g. classes). In the simplest setting we’ll just think of y ∈ {±1}. The goal is to find
a predictor that has a small generalization loss L(h). Crucially, we measure the loss with respect to a source
distribution D and our goal is to find h which in expectation over this distribution has small loss, meaning
its predictions tend to be correct.

Supervised learning: find h : X → Y with small generalization error


L(h) = E(x,y)∼D [loss(h(x); y)]

Machine learning designs this predictor h, not based on knowledge of the population D but rather based
on an IID sample from that population. We try to find good learning rules that produce a predictor with
small error for any population.

Learning rule: (based on sample S)


A : S → h (i.e. A : (X × Y)∗ → Y X )

Unfortunately, this is impossible. The ‘no free lunch’ theorem tells us small generalization error requires
knowledge about the population. For any learning rule, there exists some distribution D (that is, some
reality) for which the learning rule yields an expected error that is tantamount to randomly guessing the
answer (e.g. 1/2 for binary classes). More formally, for any A, m there exists D such that ∃h∗ L(h∗ ) = 0 but
1 m
ES∼Dm [L(A(S))] ≥ −
2 2|X |
This is true not only for independent x, y, but also if there exists a deterministic relation y(x), so that a
predictor does exist. The supposed improvement over 1/2 is proportional to the size of the dataset m/|X | is
due to memorization of that dataset, and vanishes when the population size is large.
Thus, learning is impossible without assuming anything. This is where inductive bias becomes an essential
part of learning. We assume that some realities (populations D) are unlikely, and design the learning rule
A to work for the more likely realities, e.g. by preferring certain models h(x) over others. More practically,
we assume reality D has a certain property which ensures the learning rule A has good generalization error.
Typically, we assume that there exist models h(x) in some class H, or with low “complexity” c(h) such that
it has low generalization error L(h). An example are models where the output y changes smoothly with the
input x. Another example is ridge regression, that prefers linear models (with small norms of the weights).
A flat inductive bias embodies the assumption that some realities are possible and others are not, ∃h∗ ∈ H
with low L(h∗ ). If we make this assumption, we know what is the best learning rule for supervised learning,
which is empirical risk minimization:

ERMH (S) = ĥ = arg min LS (h)


h∈H

4
with LS the empirical, or training loss over our sample S of size m:
m
1 X
LS (h) = loss(h(xi ); yi ).
m i=1
For this learning rule, we can guarantee an upper bound on the generalization error. If the best model
in our class is h∗ , the error of the predictor achieved by the ERM learning rule is
r
∗ ∗ O (capacity(H))
L(ERMH (S)) ≤ L(h ) + Rm (H) ≈ L(h ) + (∗)
m
We see that the error is larger than the best error in our class H by a term that scales with a measure of
its capacity (colloquially, the ‘amount’ of models in the class), divided by the number of samples. In other
words, the number of training examples required to learn h∗ scales with the capacity of class H.
What is this capacity? For binary classification, the capacity is the Vapnik-Chrvonenkis (VC) dimension
of the class, capacity(H) = V Cdim(H). The VC dimension is the largest number of points D that can
be labeled (by models h ∈ H) in every possible way. This is quite natural. A model class with high
VC dimension (i.e. that contains predictors allowing any possible labelling of the set), does not have any
inductive bias, so that learning is impossible (no free lunch). Learning becomes possible when the model
class can be falsified, and the number of samples needed for learning is the number required to falsify this
assumption on the model class H. For linear classifiers over d features, V Cdim(H) = d. In fact, if the model
class H can be parameterized with d parameters, the VC dimension is usually V Cdim(H) = Õ(d). It is
always true that the VC dimension is bounded by the logarithm of the cardinality of H:
#bits
V Cdim(H) ≤ log |H| ≤ #bits = #params · .
#params
Thus we expect that if we encode the parameters of our model with a fixed number of bits, the VC
dimension of the model scales with the number of parameters. Another way to produce model classes with
finite capacity is employ regularizers in our learning rule. For example, it can be shown that for linear
predictors with norm ||w||2 ≤ B (with logistic loss and normalized data), capacity(H) = B 2 .
Looking back at (∗), we see that learning, and machine learning in particular, requires model (hypothesis)
classes H that are expressive enough to approximate reality well (contain h∗ with low generalization error),
but also have a small enough capacity to allow for good generalization. The approximation error is defined
by the error of the best model in our class h∗ , and the estimation error is the excess over it, as the learning
rule can choose a different, worse model given the empirical data.
Usually however, our learning protocols do not represent a flat inductive bias over some model class. We
often think in terms of a complexity measure c : Y X → [0, ∞], which is any ordering of predictors h. Some
measures of complexity include the degree of polynomials is our model class, assumptions of sparsity or low
norm ||w||. The associated inductive bias is that ∃h∗ with low complexity c(h∗ ) and low error L(h∗ ). This
inductive bias suggests another learning rule, structural risk minimization:

SRMH (S) = arg min LS (h), c(h)


h∈H

This learning rule attempts to minimize two functions, which naturally introduces a trade-off between
them. At best, the learning rule achieves a predictor that sits on the Pareto frontier that trades-off gener-
alization error and complexity. Any predictor on that line cannot improve either the error or complexity
without worsening the other. You can get to this frontier by considering a regularization path, i.e. mini-
mizing LS (h) + λc(h), and varying the regularization amplitude λ in the range [0, ∞]. Equivalently, one can
attempt to minimize the error LS such that c(h) ≤ B. Note that this learning rule retrieves a multitude of
candidates along the Pareto frontier. We can choose the best of them according to their performance on a
cross validation sample.
For the SRM learning rule, we get a similar guarantee to (∗) on the generalization error, although
s
∗ O capacity Hc(h∗ )
L(SRMH (S)) ≤ L(h ) + .
m

5
Figure 1: Feed-forward neural network.

Another way to think of this complexity measure as opposed to a flat hypothesis class is that it gives
rise to a hierarchy of hypothesis classes which are sub-level sets of the complexity measure. The guarantee
for SRM gives us an upper bound on the loss based on the best model in the class h∗ , and its complexity
measure in that class; better predictors are obtained if h∗ lives in a small level set of the complexity measure.
A good model class H in this approach not only contains a model that approximates reality h∗ , but does so
at lower level-set of a complexity measure.

1.3 Inductive bias in deep learning


Deep learning is learning with a particular inductive bias, a flat hypothesis class in the form of a feed-forward
neural network. A feed-forward neural network (Fig. 1) is described by a directed graph G(V, E) with nodes
(neurons) indexed by vertices V . These nodes are subdivided into three types:
• Input nodes v1 , ..., vd ∈ V with no incoming edges, whose is o[vi ] = xi .

• Output node vout ∈ V , whose output is the model function hw (x) = o[vout ].
• Hidden nodes are all the rest of the nodes, which receive inputs from incoming edges (from a previous
layer) and produce outputs to outgoing edges (to the next layer).
The network also has an activation function σ : N → N that describes the non-linearity of the neural
network; a popular choice is the rectified linear unit (ReLU), σReLU = [z]+ . Finally, the edges of the network,
each connecting 2 nodes u, v, are called weights w[u → v] for each edge u → v ∈ E. A choice of architecture,
weights and activation function uniquely describe a predictor function hw (x). These models were historically
developed by McCulloch and Pitts to describe computation in the 1940’s [19]. They where able to show
these models can perform complex computations.
In deep learning, we fix the architecture and activation function σ, and learn the weights from data.
Thus the model class is given by HG(V,E),σ = {fw (x) =outputs of a network with weights w}. We want to
understand the capacity of these models, as well as their expressivity (how well they represent reality).
As noted before, the capacity is roughly given by the number of learned parameters, which here is the num-
ber of edges |E|. The VC dimension of these networks with threshold activation is V Cdim(HG(V,E),sign ) =
O (|E| log |E|). However, for other activation functions, it is actually possible to get a capacity which is
much higher than the number of parameters. For example, the VC dimension of these networks with
sine activation is infinite, even for a single hidden node. We do not use these kind of activation func-
tions. More useful activation functions are, for example the sigmoid function σ(z) = (1 + exp(z))−1 ,
4
whose VC dimensions is bounded by V Cdim(HG(V,E),sigmoid ) ≤ O |E| , or ReLU function, for which
V Cdim(HG(V,E),ReLU ) ≤ O (|E| log |E|l), with l the network depth. One can limit the capacity by discretiz-
ing the weights, e.g. if w ∈ [−B, ..., B], V Cdim ≤ 2|E|B. As we’ve seen, the fact that these network models
can have a large capacity is not necessarily good, because capacity comes at the expense of inductive bias.
What about expressivity? Feed-forward neural networks can represent any logical gate, and thus any
function over X = {±1}d (as proved by Turing). Define the class CIRCU ITn [depth, size] as all functions

6
f : {±1}n → {0, 1} that can be implemented with at most size AND, OR and NOT gates, and longest path
from input to output at most depth. We know that circuits can represent any function, but only if we are
allowed to select an appropriate gate architecture. In neural networks, we keep the architecture fixed and
only vary the weights.
Claim 1.1. A neural network with fixed architecture can learn the function of any circuit:
CIRCU ITn [depth, size] ⊆ HG(V,E),l=depth,k=size,σ=sign
Where we use a fully connected neural net with l = depth layers and k = size nodes in every layer.
This can be done easily, if we choose the weights of each edge to be ±1 if the edge is connected in the
circuit (with /o without a NOT gate in between), 0 otherwise. The bias terms are chosen f an in − 1 for
AND gates, 1 − f an in for OR gates. The weights essentially describe which wires exist in the circuit. Thus
neural networks can represent any binary function.
More generally, we have a universal representation theorem: Any continuous function f : [0, 1]d → N
can be approximated to within ϵ by a feed-forward network with sigmoidal (or almost any other) activation
function and a single hidden layer. This shows that as a model class, feed-forward neural nets are extremely
expressive and can represent any reality. However, representing functions may require huge networks, e.g.
with layer widths exponential in d. The relevant question is not what a network of arbitrary size can
represent, but what small networks can represent.
Small networks can represent intersections of half-spaces (using single hidden layer, each neuron corre-
sponds to a half-space and the output neuron performs AND) and unions of intersections of half-spaces (with
two hidden layers: half-spaces→OR→AND). However, the main compelling reason to use them is feature
learning: Linear predictors over (small number of) features, in turn represented as linear predictors over
more basic features, that in turn are also represented as linear predictors. In essence, the network builds up
a hierarchy of predictors that progressively manage more abstract features of the data. In the case of image
data, this is typically presented as early layers learning simple features (edges in images), and later layers
building up on the simpler features to represent higher-level, semantic ideas (cars, birds, etc.).
Interestingly, a feed-forward neural network can represent any time T computable function with network
of size Õ(T ) using a depth-T network. This is true since anything computable in time T is also computable
by a logical circuit of size Õ(T ). This realization has broad implications for machine learning.
Machine learning is an engineering paradigm (of being lazy): Use data and examples, instead of expert
knowledge and tedious programming, to automatically create efficient systems that solve complex tasks.
Therefore, we only care about a model (predictor) h if it can be implemented efficiently. A good learned
model only needs to compete with a programmer, producing results that are at least as good as a programmed
model in a competitive (model evaluation) time.
In this case we have a free lunch: the model class T IM ET - all functions computable by at most time T ,
has capacity O (T ), and hence learnable with O (T ) parameters (e.g. using ERM). Even better: the model
class P ROGT , all functions programmable up to length of code T , also has capacity O (T ). This is relatively
clear, because the length T bounds the number of bits needed to represent all these functions. Unfortunately,
ERM with respect to P ROGT is uncomputable. Modified ERM for T IM ET (truncating exec. time) can be
computed, but is NP-complete. If P = N P , we can have universal learning (free lunch). If P ̸= N P , i.e. if
there exist one-way functions that are easy to compute but hard to learn, there is no poly-time algorithm
for ERM over T IM ET .
We thus unfortunately conclude that the free lunch is only possible if P = N P . This realization gives
rise to the computational no free lunch theorem: For every computationally efficient learning rule A,
there is some reality D such that there is some computationally efficient (poly-time) h∗ with LS (h∗ ) = 0, but
E[L(A(S))] ≈ 1/2. In other words, our learning rule A can find an efficient h∗ , but there are no guarantees
on its generalization.
This leads us to revise our requirements of inductive bias; we have to assume that not only that reality D
supports good generalization, but also that the learning algorithm A runs efficiently. The capacity of H or the
complexity measure h(c) are not sufficient inductive bias if ERM / SRM are not efficiently implementable,
or if implementation does not always work (i.e.runs quickly but does not achieve ERM / SRM). Note that
we switched from discussing learning rules (arbitrary mappings from sample to model), to talking about
learning algorithms, an actual implementable process that chooses such a model.

7
Going back to neural networks, we completely understand them from a statistical perspective (in terms
of capacity and expressivity). The problem with them relates to computation; computing the ERM for feed-
forward neural nets is a non-convex optimization problem, and no known algorithm is guaranteed to work.
We know that learning in neural nets, even in the simplest cases (2-hidden units in one hidden layer), is NP-
hard. Even if reality is well-approximated by a small neural net, and you tried optimizing a larger neural net
(which has more degrees of freedom), optimization is easy but ERM is still NP-hard. Unfortunately, there is
nothing one can do to efficiently solve this computational problem, which is essential to neural nets precisely
because of their expressive power (computational no-free lunch). Even if a function is exactly representable
with single hidden layer and Θ(log d) nodes, even with no noise, and even if we take a much larger network
or use any other method when learning: no poly-time algorithm can ensure better-than-chance prediction,
see e.g. [14, 10].
And nevertheless, deep learning does work! We have seen that from a statistical and computational per-
spectives, performing ERM on short programs (or short runtime programs) and learning with deep networks
is equivalent. Both approaches are universal and approximate reality with reasonable sample complexity.
They are both NP-hard, and provably hard to learn with any learning rule (subject to cryptographic assump-
tions). However there is no practical way to optimize over short programs, as e.g. there is no practical local
search over programs. In contrast, deep neural nets are often easy to optimize; they are continuous models,
amenable to local search (gradient descent, SGD), and enjoy much empirical success. In the worst case, deep
learning is provably impossible, and yet, we are constantly reminded that deep learning is possible. There is
a certain magical property of reality that makes feed-forward neural networks, and we have just started to
scratch the surface of what that property is. However, we know for sure what it isn’t: it is not the property
that reality is well approximated by neural networks.

1.4 Deep learning in practice


As we have seen, deep neural networks can represent any function, and indeed have been shown to fit
random data with perfect (training) accuracy [34]. However, when trained on real data, these networks do
successfully generalize, even when over-parameterized.
We thus have a learning rule A(S) that is able to achieve perfect training accuracy for any data set, even
with random labels LS (A(S)) = 0. On the other hand, it is able to generalize for real data S ∼ Dm sampled
from a reasonable reality D, achieving low L(A(S)).
Perhaps we should not be surprised about this, as other learning algorithms do show similar behavior.
A 1-Nearest Neighbor classifier, if realizable by a continuous h∗ (i.e. LS (h∗ )), then for an infinite sample
size (|S| → ∞), it is consistent with zero generalization error L(1 − NN(S)) → 0. Similarly, a Hard Margin
Support Vector Machine (SVM) with a Gaussian kernel (or some other universal kernel), or more generally,
minimization of a norm for consistent solutions, also tend to generalize despite having vanishing training
error: arg min ||h||K such that LS (h) = 0. Let us consider a linear case where

w = arg min ||w||2 s.t. ⟨w, ϕ(xi )⟩ = yi

In this case, our SVM model does not have a flat inductive bias, but the norm of the weights w adapt
to the level of complexity inherent in the data. If reality is represented by a solution with small norm, then
the learning rule will achieve a solution with low complexity measure and therefore generalize. However,
if we try to fit random labels, we can only fit a model with a high norm (high complexity measure), and
it will fail to generalize. We can always train SVMs with zero training error LS (h) = 0. If ∃h∗ with
zero generalization error LD (h∗ ) = 0, it will be achieved we sample complexity |S| = O ||h||2K . Another
example for this generalization is found in Minimum Description Length (MDL): A program optimized for
min |program| with LS (program) = 0, is able to achieve a generalization error L(M DL(S)) ≤
itslength arg
|program|
O |S| . That is, a short program only requires a sample complexity proportional to its length.
These examples and the ability of deep nets to generalize implies that the size of the network is not a
good measure of model complexity. This is not a new idea; it was already realized in the 1990s that kernel
regression works for infinitely many features, because we rely on norm for complexity control rather than
the dimensionality. It was shown by P. Bartlett in 1996 [2] that the complexity of a neural network is not
controlled by the number of weights but by their magnitude. In fact, neural networks have many solutions

8
for w so that LS (w) = 0, many of which have high generalization errors. These solutions tend to have high
w-norms. However, the solutions found in practice for neural networks using gradient descent do generalize
well, and tend to have small norms, even without explicitly regularizing for low norm solutions.
Where is this implicit regularization coming from? We will try to understand this in the simplest model
possible - linear regression. Consider an under-constrained least squares problem (n > m):

min ||Aw − b||2 , A ∈ Rm×n


w∈Rn

In under-constrained cases there are many choices of w for which the sum of squares vanish. Imagine
solving this problem with gradient descent, initialized at w = 0. Gradient descent will definitely succeed, as
this is a convex problem, and find a vanishing solution, but which solution?
Claim 1.2. Gradient descent (or SGD, conjugate gradient descent, BFGS) will converge to the least norm
solution minAw=b ||w||2 . The proof follows from the iterates always being spanned by the rows of A (more
details soon).
While we did not tell the algorithm to prefer solutions with small norms, it does in fact find the solution
that minimizes this norm. This implicit regularization comes directly through the optimization process. In
general we find that the optimization algorithm minimizes some norm or complexity measure, but which
complexity measure?

9
2 Lecture 2 : Implicit bias and benign overfitting
2.1 Finishing lecture 1
To review, the high-level questions we’re trying to answer are:

1. How much of what we’ve seen so far (in Lecture 1) fits within our classic understanding of statistical
learning?
2. What questions do we need to answer to put it within our standard understanding?
3. And what goes beyond our standard understanding?

One thing we’ve seen so far is that huge models (so large that they could fit even random labels) can still
generalize. This is fine; it’s the same type of behavior we get with something like a hard-margin SVM or
a minimum-norm predictor. What’s going on in these cases is that the x-axis is not actually capacity. The
real measure of complexity is some kind of norm.
What is this norm? Well, we are really abusing the term “norm”; what we really mean is some measure
of scale that might not satisfy the rigorous definition of a norm. We can get implicit complexity control
just from the optimization algorithm. For example, when we optimize an underdetermined least squares
problem using gradient descent, we get the minimum-norm solution; that just comes from the optimization
algorithm. So, we need to ask: what complexity measure is being minimized, and how does gradient descent
minimize it?
We ended last time by saying that if we change our optimization algorithm without changing the objective
function, we’re actually implicitly minimizing some other complexity measure, which will change our inductive
bias and thus our generalization properties. As some examples, we can compare the test performance of
SGD and Path-SGD as in [25] and SGD vs. Adam as in [32]. In all cases, they reach the same final training
error, but they have different final test performance values. In other words, they’re reaching different global
minima of the training objective. The story we’re seeing is that the inductive bias is determined by the bias
of the optimization algorithm. In other words, if we have a training loss landscape with many global minima,
and we start optimizing on some hill, different optimization algorithms will move down the “hill” differently
and reach different “beaches” (0 loss).
An illustrative example of different optimization algorithms inducing different regularizers can be seen
by studying gradient descent vs. coordinate descent. Gradient descent will get to the minimum ℓ2 norm,
whereas coordinate descent will get to an approximately minimum ℓ1 norm solution. In a high-dimensional
system, these two norms are extremely different; ℓ1 induces sparsity, and ℓ2 does not. This difference is
significant, especially when we think about deep learning in terms of feature learning (which Nati thinks is
what’s really going on). Feature selection (i.e., from a long list of features, select the relevant ones) is just
sparsity, and ℓ1 regularization can achieve it. In deep learning, we want to do feature learning, not just
feature selection. But what is finding new features? We can think of a continuous set of possible features,
and we want to select good features from that infinite feature set. So as long as we’re not too worried about
infinities, there’s not much difference between feature selection and feature learning. And you can do that
better with ℓ1 ; ℓ2 will never give you anything that is sparse. Thus, this can be a huge difference, and all
the inductive bias is coming from the algorithm here. In fact, here is the perspective on deep learning we
shall take here:
Deep networks can approximate any function. We kind of dismissed our universal approximation results
last time because they require huge networks with seemingly unrealistic capacity. But maybe we actually
are using networks that are large enough to capture all functions (since we are, at least on the data we see,
able to capture all functions). So maybe we are essentially optimizing over the space of all functions. In
that case, minimizing empirical error with respect to all functions doesn’t make any sense; it’s really easy to
optimize the empirical error with respect to all functions, since we can just memorize the training examples
and not do anything anywhere else in the domain. But in deep learning, we optimize over all functions with
particular search dynamics, and although we do get to a function that has 0 training error, we don’t get to
just any 0 error solution. How we optimize over the space of all functions determines which directions we
like to take. Roughly speaking, we’ll get to a 0-error solution that’s “close” (to the initialization point) in
some sense with respect to our geometry.

10
Figure 2: Matrix completion. Relative error is the Figure 3: Error vs. number of hidden units.
squared error compared to the squared error of the
null predictor.

2.2 Examples
With this perspective in mind, we will now begin the main content of Lecture 2. We’ll start by examining
some examples where we’re optimizing over the space of all functions, but because of the way we’re optimizing
over the space of all functions, we’re actually getting good generalization.

2.2.1 Matrix completion


In matrix completion, we have some observations from a matrix. We want to complete the matrix and
uncover the remaining entries. You should think of the matrix as having some structure (e.g., some low-rank
structure). Formally, the matrix completion problem is:

min ∥observed(X) − y∥22 . (1)


X∈Rn×n

In some sense, it’s very easy to solve this optimization problem. We can complete the observed entries and
put 0 everywhere else, but it won’t help us recover the unobserved entries. The problem is underdetermined.
So what do we do? We can run gradient descent directly on X, or - alternatively - we can replace X with
U V ⊤ (U, V full dimensional, therefore no rank constraint on X = U V ⊤ ) and run gradient descent on U, V .
Figure 2 compares the results of these two procedures when the ground-truth matrix X ∗ has rank 2. We
also see how slight variations in gradient descent on U, V change which solution we converge to. But the
bigger effect is coming from the reparameterization (from X to U V ⊤ ).
So what is going on here? In this case, we have a good understanding, though not complete. The initial
proposal in [12] was that gradient descent is converging to a low nuclear norm solution - i.e., nuclear norm is
the relevant complexity measure that gradient descent is minimizing. We know that minimizing the nuclear
norm can give you good generalization when you have an approximate low-rank matrix. Minimization of
the nuclear norm is proved rigorously in some cases (e.g., under restricted isometry property (RIP) in [15])
and turns out to not always be the case (e.g., see counterexample in Example 5.9 in [16]).

2.2.2 Single overparameterized linear unit: Rd → R


If we train a single unit with gradient descent using the logistic (“cross entropy”) loss, we converge to the
max-margin separator (the hard-margin SVM predictor), which has an implicit ℓ2 norm in it:

w(∞) ∝ arg min ∥w∥2 s.t. ∀i yi ⟨w, xi ⟩ ≥ 1. (2)

This holds regardless of initialization. We will go over this result in detail in a future lecture.

11
2.2.3 Deep linear network: Rd → R
Now, we can ask what happens in a deeper network (with only linear activations). Let w denote the weights
of all the layers. Then our deep linear network implements the same linear mapping as above:

fw (x) = ⟨βw , x⟩. (3)

When we run gradient descent on w (vs. β, as we did above), one might think that this reparameterization
could affect the search geometry. However, in this case, the inductive bias is actually the same as above:

βw(∞) ∝ arg min ∥β∥2 s.t. ∀i yi ⟨β, xi ⟩ ≥ 1. (4)

2.2.4 Linear convolutional networks


Things get more interesting in the linear convolutional case, though. Let’s consider the following linear
convolutional network:
D−1
X
hl [d] = wl [k]hl−1 [d + k mod D], hout = ⟨wL , hL−1 ⟩, (5)
k=0

which is still just a reparameterization of our original linear function from Rd → R. Now we can ask what
happens when we train this model using gradient descent.

Single layer (L = 2). With a single hidden layer, training the weights with gradient descent implicitly
minimizes the ℓ1 norm in the frequency domain:

arg min ∥ DFT(β)∥1 s.t. ∀i yi ⟨β, xi ⟩ ≥ 1, (6)

where DFT denotes the discrete Fourier transform. In other words, we obtain sparsity in the frequency
domain.

Multiple layers. With L layers, training the weights with gradient descent converges to a critical point
of
∥ DFT(β)∥2/L s.t. ∀i yi ⟨β, xi ⟩ ≥ 1, (7)
where ∥ · ∥2/L denotes the 2/L quasinorm. It is not technically a norm, but it is formally defined, and it’s
even more sparsity-inducing than ℓ1 . Thus, increasing the depth induces more and more sparsity in the
frequency domain. See [11] for more details.

2.2.5 Infinite-width ReLU network


Now let us look at all functions (not just linear) from Rd → R. In order to represent them with a neural
network, we have to introduce nonlinearities (e.g., ReLU). If we let a single-hidden-layer ReLU network be
wide enough, we can approximate all functions. So let us learn using infinite-width ReLU networks.

Functions h from R → R. Gradient descent on the weights implicitly minimizes


Z
max |h′′ |dx, |h′ (−∞) + h′ (+∞)| . (8)

This would be a very sensible penalty to choose, since it is kind of a smoothness-inducing penalty. The
interesting thing is that we didn’t explicitly choose it; it came to us just from gradient descent on this
parameterization.

12
Functions h from Rd → R. Gradient descent on the weights implicitly minimizes
Z
|∂bd+1 Radon(h)|. (9)

Once again, we get some kind of sensible smoothness penalty. This result is rigorous for logistic loss
(doesn’t depend on initialization or learning rate). For squared loss, we don’t know how to analyze it
exactly, although we expect something similar. See [27], [26], [7] for more details.

2.2.6 Takeaways from these examples


What have we been doing? The game here is that we want to understand what happens in the space
of functions. The inductive bias in parameter space is relatively simple (ℓ2 or something similar, often). But
what we really care about is what happens in function space, which can be very rich. A large part of the
optimization problem is the architecture. The classical view is that the architecture is important because it
limits what functions you can get; however, that’s not the case here. The architecture is important because
it determines the mapping from parameter space to function space and is the biggest contributor to the
geometry by which you’re searching in function space.
The next most significant thing that affects the inductive bias is the geometry of the local search in
parameter space (e.g., ℓ2 vs. ℓ1 in parameter space).
And the least significant thing (though still relevant) that affects the inductive bias is the set of opti-
mization choices (e.g., initialization, batch size, step size, etc.).

Does gradient descent always minimize the ℓ2 norm in parameter space? In all of these examples,
we can get the same thing as gradient descent’s implicit regularization using a minimum ℓ2 norm on the
weights (subject to fitting the data). In all of these examples, the complexity control in parameter space
is very simple (it is just ℓ2 ), and everything is coming from the parameterization. So is all this discussion
of the implicit bias of gradient descent just ℓ2 in parameter space, with everything coming just from the
parameterization?
The answer is: sort of. You can get some asymptotic results showing that, for some restricted class
of models with the logistic loss, everything boils down to ℓ2 . However, we’ll also see that in many cases,
this is not true, and you’ll get something very different (e.g., under squared loss, or even under logistic loss
non-asymptotically).

2.3 Why the red curve goes down in Figure 3


We now understand that, in Figure 3, we have complexity control coming from the algorithm, and this is
what stops us from overfitting. The main question this led us to ask is: what is this complexity control?
(which we just studied, in several examples.)
But there is another thing that is going on here. What we have studied so far explains why we might
be able to generalize well even at a large number of hidden units (i.e., we have complexity control coming
from the algorithm, even though we’re optimizing in the space of all functions). However, we haven’t yet
explained why the red curve (test error) actually goes down as the number of hidden units increases. Recall,
the x-axis is not complexity (complexity is something else - some norm-based complexity, probably).

Gaussian kernel. We can actually see similar behavior even with kernel methods. Let us consider the
Gaussian kernel, which corresponds to an infinite-dimensional feature space. Let us think of what hap-
pens if we use a finite approximation to the Gaussian kernel. Concretely, we have the Gaussian kernel
′ 2
⟨ϕ∞ (x), ϕ∞ (x′ )⟩ = e−∥x−x ∥ and the finite-dimensional feature mapping
1
ϕd (x)[i] = √ cos(⟨ωi , x⟩ + θi ).
d
The algorithm returns A(S) = arg min ∥w∥ s.t. LS (x 7→ ⟨w, ϕd (x)⟩) = 0, i.e., ∀(xi , yi ) ∈ S, yi =
⟨w, ϕd (xi )⟩. As d → ∞, we approach the Gaussian kernel. Once we have more features than data points, we

13
can already get 0 training error. But we are not doing it with the Gaussian kernel yet; we are doing it with
an approximation of the Gaussian kernel. If the RKHS norm induced by the Gaussian kernel is our “correct”
complexity measure, then as d → ∞, we are approximating it better and better. So we are minimizing a
complexity measure that’s a better and better approximation of the complexity measure we want. So as the
dimensionality increases, the test error improves.

Matrix completion. We can also see this in matrix completion. Suppose X is n×n, we observe nk entries,
and we parameterize X as U V ⊤ , where U, V ∈ Rn×d (thereby adding a rank-d constraint). As above, d
captures the quality of our approximation. Suppose the “right” complexity measure is nuclear norm. We
have two different regimes:
- If d < k, our algorithm returns arg min L̂(X) s.t. rank(X) ≤ d.
- If d > k, our algorithm returns arg min ∥X∥∗ s.t. L̂(X) = 0, rank(X) ≤ d.
So as we increase the rank constraint, the test error becomes better and better. As we increase the rank,
we’re getting closer to what we really want, which is a full-rank low nuclear norm matrix. So d is not our
complexity measure; rather, it’s the dimensionality of the approximation to the infinite-dimensional system.

Commentary. We claim that this is what’s happening in neural networks too. The real object we should
be learning with is an infinite-size network. But we cannot really represent an infinite-size network? Instead,
we represent a truncated, finite-dimensional representation of it. As our truncation becomes finer and finer,
our representation becomes better and better. We want the size to be so big that we essentially have a good
approximation to the infinite-size model. Our methods should not rely on the size of the approximation
being small. So we should start by understanding the infinite size and then worry about the question: “how
large do we need our model to be in order for it to be considered infinite?”.

2.4 What we don’t yet fully understand: benign overfitting


However, there’s still one thing that doesn’t really fit our understanding. And this is something we were
completely blind to for many years, even though it was right in front of us. It was pointed out only fairly
recently by Misha Belkin. This one thing is that we are getting good generalization even though we are
insisting on a 0-training error solution in noisy situations (situations where the approximation error is
nonzero). In particular, in Figure 4, we want to balance complexity and training error, so we know we want
to be somewhere on this regularization frontier. But the solutions we are finding are the minimum-complexity
0-error solutions - an extreme point of the frontier, and we are seeing this even in fairly noisy cases, where
we would expect to be somewhere on the frontier that strikes more of a balance between complexity and
training error.
To understand this a bit better, let us return to fitting noisy data with polynomials, where complexity is
degree of the polynomial. As we increase the degree of the polynomial, we can decrease the training error.
In this case, as seen in Figure 5, we can get 0 training error with a degree-9 polynomial. But we are fitting
the noise and getting bad generalization. This is captured by the classic U-shaped red curve in Figure 5.
At some point, we start overfitting - which we will define as fitting the noise. At that point, the test error
starts to become worse because the estimation error begins to dominate. The conventional wisdom is that
we should never insist on 0 training error, because it will fit the noise and generalize poorly.

Connection with double descent. Arguably, the first paper to discuss the above phenomenon was [4],
which introduced the notion of “double descent” (Figure 6). At this point, we probably understand about
95% of double descent, which has little to do with the question we just asked about fitting the noise. So
let us briefly discuss the double descent phenomenon, before reaching the remaining 5% that we do not yet
understand.
Why are we getting double descent? Let us think of a least squares problem in dimensionality d and a fixed
number of training examples. The x-axis is the dimensionality. The y-axis is the error of the ERM solution
- but not just any ERM solution: if the problem is overdetermined, find the solution that minimizes the
reconstruction error; if the problem is underdetermined, find the minimum Euclidean norm 0-training-error
solution. Until the interpolation threshold, the x-axis really is complexity control. After the interpolation

14
Figure 5: Our classical understanding of overfitting.
Figure 4: Trading off training loss and complexity.

Figure 6: The double descent phenomenon.

threshold, the x-axis is no longer complexity control; rather, it is the degree of approximation. As we just
discussed, it is not surprising that the test error improves as the approximation becomes better.
The fact that we get an increase followed by a decrease is not surprising. What is surprising is that
we get good generalization even though we insist on 0 training error in a fairly noisy situation. The main
experiment that Misha and coauthors did that probably convinced many people is from [5], in which high
levels of noise were added to synthetic data. When perfectly fitting this noisy data using different methods
(all methods get 0 training error), the test error is substantially below the null risk. In particular, the
phenomenon we are seeing is that we are perfectly fitting the noise, but this overfitting is not harmful, as in
Figure 5. Rather, the overfitting is benign. We are fitting the noise in a way that has a kind of measure-0
effect; fitting the noise does not ruin the fit in other places. We are maybe not gaining anything from fitting
this noise, but it doesn’t hurt us either (it’s benign).

Open Questions In what situations is overfitting harmful, and it what situations is it benign? In least
squares, we can get both kinds of behaviors in a now-predictable way (the result of research in the past 2-3
years). We know that it relates to a measure of effective rank. Characterizing more generally (beyond least
squares) when overfitting is harmful vs. benign is a big challenge that we have now, which is fairly different
from how we used to think about overfitting.
In some sense, the most practical implication of benign overfitting is that we don’t have to worry too
much about selecting the right value of the regularization parameter λ; we have a whole regime of good
values of λ. In many cases in practice, though, we see something in between benign and harmful overfitting
(but much closer to benign). This matches what we see empirically: that adding a bit of regularization can
be helpful by a bit, but we can still have good performance without explicit regularization.

2.5 Summary
What fits our understanding and what doesn’t?
In linear models, which is the only case we really understand, we can see how we can generalize even

15
when we are able to fit random labels. We can understand how the complexity control comes purely from the
optimization algorithm and how we can generalize better with bigger and bigger models, since they provide
a better and better approximation to what we really want. Therefore, to fit each case within our classic
understanding, we “just” need to ask: 1. What is the complexity measure? 2. How is it minimized by the
optimization algorithm? 3. How does minimizing it ensure good generalization? Thus, the only thing that
truly goes beyond our standard understanding is the phenomenon of benign overfitting.

16
3 Lecture 3 : Statistical learning and stochastic convex optimiza-
tion
In this lecture, we will explore the connections between optimization geometry and generalization in the
well-understood convex case. Specifically, we will derive generalization guarantees based on this geometry.
Many of the concepts and proofs reproduced here can be found in classical references on statistical learning
[30, 28, 20] and optimization [6, 24].

3.1 Learning and optimization for convex problems


Recall that the goal of supervised learning is, for given input and output spaces X , Y, to find a predictor
function hw : X → Y, parametrized by a vector w, with low population error defined by:

L(hw ) = Ex,y [l (hw (x); y)] , (10)

for some hidden joint density p(x, y) and a chosen loss l function measuring the prediction error for a given
pair (x, y). In what follows, we will denote H the chosen hypothesis class of functions that can be represented
with the parameter w, and will consider the same notation for optimization on H or the corresponding
parameter space. Since we do not have direct access to the joint distribution p(x, y), we collect a dataset
S = {(x1 , y1 ), . . . , (xm , ym )} of m i.i.d. samples from p, and use it to estimate L(hw ) (equivalently denoted
L(w)) with the corresponding empirical distribution. This leads to the empirical risk minimization (ERM)
problem to estimate a parameter vector ŵ:
1 X
ŵ = arg min L̂(w) := arg min l (hw (xi ) ; yi ) + λΨ(w). (11)
h∈H h∈H m
i

Here, the term Ψ(w) is the regularizer or penalty, while the parameter λ > 0 tunes the regularization
strength. The purpose of this term is mainly to prevent overfitting. Intuitively, a well chosen Ψ will increase
if a corresponding complexity measure of the model increases. Typical examples include the ℓ2 norm,
enforcing regularity at the functional level, or the ℓ1 norm, inducing sparsity at the level of the parameters
w. Equivalently, (11) can be written as
1 X
ŵ = arg min l (hw (xi ) ; yi ) (12)
m i
such that Ψ(w) ≤ B (13)

where B depends on the regularization coefficient λ. The main two sources of error that need to be controlled
in empirical risk minimization are the optimization and generalization error. The latter is handled using
uniform convergence tools to control the convergence rate of the empirical risk towards the population one,
when the parameters are constrained to the sublevel sets defined by the penalty.

Figure 7: Graphical representation of sublevel sets of the complexity measure.

By uniform convergence we mean the following. Let us look back at the ERM learning rule (12). To
ensure that it performs well with respect to the population error, we can bound the difference between the
population and empirical errors uniformly over all predictors in the class. That is, for any ϵ > 0, we need to
quantify how many samples m are needed to ensure that sup∥Ψ(w)∥≤B L̂(w)−L(w) < ϵ. For a given learning

17
problem, a dataset that ensures the latter inequality is said to be ϵ−representative. It is straightforward to
show that an 2ϵ representative training set ensures that

L(ŵ) ≤ min L(w) + ϵ, (14)


w

ensuring that the predictor ŵ is a good proxy for the true minimizer. The standard way to achieve uniform
control of the deviation between L̂(w) and L(w) is to quantify the complexity of the hypothesis class H and
regularity of the loss function, and to relate these quantities to the required sample complexity. A useful
complexity measure for hypothesis classes is the Rademacher complexity, defined in its empirical form as
" m
#
1 X
R(H ◦ S) = Eσ∼{±1}m sup σi h(zi ) (15)
m h∈H i=1

where σ is a random vector with i.i.d. Rademacher entries. The Rademacher complexity gives a distribution
dependent alternative to the VC dimension discussed in Lecture 1, defined for any class of real-valued
functions. We then have uniform convergence bounds similar to the ones presented using the VC dimension
in Lecture 1, e.g., assuming that |l(h, z)| < c for some positive constant c, for any h ∈ H
r
2/δ
L(h) − L̂(h) ≤ 2R(l ◦ H ◦ S) + c (16)
m
with probability at least 1 − δ. The interested reader can find more details on the Rademacher complexity
along with examples in Chapter 26 of [28]. In this lecture, we will directly derive optimization guarantees
for the learning problem formulated as a stochastic optimization problem.

Example: Let us assume that the reality is captured by a low-norm linear predictor. Mathematically,
this goes as follows:
H = {hw (x) → ⟨w, x⟩ | ∥w∥2 ≤ B}. (17)
For simplicity, let us assume that the data and the derivative of the loss function are bounded: ∥x∥2 ≤ 1
and ∥∇l∥ < 1, and that l is convex. In that case, it can be shown straightforwardly that the empirical
Rademacher complexity of the corresponding ERM problem verifies
r
B2
R(l ◦ H ◦ S) ≤ . (18)
m

If we denote by ŵ the arg min of the empirical loss L̂, we then reach the following generalization result:
r !
B2
L(ŵ) ≤ inf L(w) + O . (19)
∥w∥≤B m

1
P
In order to compute ŵ we perform gradient descent on L̂(w) = m i l(⟨w, xi ⟩, yi ), where l(·, ·) is the loss
function. The iteration of the GD goes as follows

wk+1 = wk − η∇L̂(wk ).

The convergence rate of this algorithm with the optimal step-size, see e.g. [24], is described as:
r !
T B2
L̂(w̄ ) ≤ inf L̂(w) + O
∥w∥≤B T

However, in each iteration of the gradient descent, we need m gradient computations. Depending on the data
this may be computationally costly. Instead, one may use stochastic gradient descent (SGD). In this case,
we uniformly pick an example (xi , yi ) and only calculate its corresponding gradient term. SGD iteration is
written as
w̄k+1 = w̄k − η∇l(⟨w̄k , xi ⟩, yi ).

18
Thus at each iteration we subtract an unbiased estimator of the full gradient. Indeed, at one may check that

E[∇l(⟨w̄k , xi ⟩, yi )] = ∇L̂(w̄k ).

For the SGD algorithm we have the same convergence guarantee:


r !
B2
L̂(w̄T ) ≤ inf L̂(w) + O .
∥w∥≤B T

Combining this bound with (19) we have the following:


r ! r !
T B2 B2
L(w̄ ) ≤ inf L̂(w) + O +O .
∥w∥≤B m T

Thus the error magnitude of the SGD and approximation error is the same. This means that we need to
do q
at most m iteration of SGD because when m > T , the dominant term in the previous bound becomes
2
O( Bm ).
The one pass SGD can also be viewed as an algorithm to minimize the population risk L(w) = E[l((w, x), y)].
Indeed, the gradient term satisfies the following:

∇L(wk ) = Exi ,yi ∇l(⟨wk , xi ⟩, yi ) .


The latter means that instead of this two-step scheme, we can analyze the generalization using the optimiza-
tion guarantee directly for the population risk. Therefore we obtain the following
r !
T B2
L(w̄ ) ≤ inf L(w) + O .
∥w∥≤B T

We cannot do more iterations than the number of data points, as we need to have independent samples from
the population. Thus the number of iterations is again bounded by the sample size and therefore we get the
same bound.

3.2 Stochastic optimization


Stochastic optimization problem is written as

min F (w) = Ez∼D [f (w, z)] (20)


w∈W

based on i.i.d. samples z1 , z2 , . . . , zm ∼ D. Here the distribution D is unknown and we do not have access
to F (w). But using the samples we can have estimates of F and ∇F . An instance of this problem is the
general learning problem. It can be formulated as

min F (h) = Ez∼D [f (h, z)],


h

using the data samples zi ∼ D, for i = 1, . . . , m, where h is a mathematical object adapted to the particular
model. Here is a short list of examples.

• In supervised learning we have Z = X × Y = {z = (x, y) | x ∈ X , y ∈ Y}. The function h : X → Y and


f (h, z) = l(h(x), y), where l is the loss function.
• In the unsupervised k-means clustering problem we have z = x ∈ Rd and h = (µ[1], . . . , µ[n]) ∈ Rd×k .
Here h[i] is the center of i-th cluster. The objective function for this problem is defined as

f ((µ[1], µ[2], . . . , µ[k]), x) = min ∥µ[i] − x∥2 .


i

19
• The problem of density estimation can also seen as a stochastic optimization. Consider z = x in
some measurable space Z (e.g. Rd ). Then for each h, we define the probability density ph (z) and the
objective function f (h, z) = − log ph (z). The function F in this case is the KL divergence.

The fields of stochastic optimization and statistical learning have been developed in parallel in the 60s
and 70s [31, 23].
Let us get back to the stochastic optimization problem (20). We saw two ways of solving this problem.
The first is based on Sample Average Approximation (SAA) or the ERM. It essentially consists of collecting
data z1 , z2 , . . . , zm and estimating the expectation term with the empirical mean
1 X
F̂m (w) = f (w, zi ).
m i

The other method is the Stochastic Approximation (SA) e.g. SGD. Here, we update wi using f (wi , zi ), ∇f (wi , zi )
and previous iterates. In particular in SGD we have: wi+1 = wi − η∇f (wi , zi ).
As mentioned previously, in machine (supervised) learning the objective function is the population risk
L(w). In this setting, the SGD can be applied in two ways. The first follows the direct SA approach (one-pass
SGD).

1: Initialize w(0) = 0
2: At iteration t = 0, 1, . . . , T
3: Draw (xt , yt ) ∼ D
4: w(t+1) ← w(t) − ηt ∇l(⟨w(t) , xt ⟩, yt )
PT
5: Return w̄ (T ) = T1 t=1 w .
(t)

For this algorithm we may obtain the following convergence guarantee1 :


r r
(T ) ∗ B2 B2
L(w̄ ) ≤ L(w ) + 2 +
m T
. However, we may perform SGD on ERM. That is our optimization problem has the following form:
m
1 X
min L̂(w) = min l(⟨w, xi ⟩, yi ). (21)
∥w∥2 ≤B ∥w∥2 ≤B m i=1

The alternative approach suggests the minimization scheme below.

1: Draw (x1 , y1 ), . . . , (xm , ym ) ∼ D


2: Initialize w(0) = 0
3: At iteration t, we pick randomly i ∈ {1, 2, . . . m}.
4: w(t+1) ← w(t) − ηt ∇l(⟨w(t) , xi ⟩, yi )
5: Then, we may perform the step wt+1 ← proj∥w∥≤B wt+1 . although we may show that implicitly our
iterate will converge PTto this set.
6: Return w̄ (T ) = T1 (t)
t=1 w .

For this algorithm we may obtain the following convergence guarantee:


r r
(T ) ∗ B2 B2
L(w̄ ) ≤ L(w ) + 2 + .
m T
The several differences between these two schemes.
1 All the guarantees are satisfied up to a constant.

20
• Since we need independent samples for the first scheme, it has as many samples as iterations, (m = T ).
This is not the case for the second method, as we fix initially the m samples and then choose from
them. Thus it can have T > m iterations.
• The direct SA approach does not require regularization and thus it does not have a projection step.

• The SGD on ERM method is explicitly regularized. The regularization of the direct SA approach hides
in the step-size. Indeed, in order for the parameter
√ w to have larger norm, one needs to choose larger
step-sizes. In particular, if we choose ηt = B 2 t, we get the following:
r
(T ) ∗ B2
L(w̄ ) ≤ L(w ) +
T
On the other hand, the SGD on ERM has the following generalization error bound:
r r
(T ) ∗ B2 B2
L(w̄ ) ≤ L(w ) + 2 +
m T
In both cases L(w∗ ) is the value of the objective at the optimal point in the class: L(w∗ ) = min∥w∥2 ≤B L(w).

Where is the regularization? Although we mentioned the effect of the step-size on the regularization in
the direct approach, it is still not clear why we observe this phenomenon. Let us look back at the standard
GD. The gradient descent minimizes the norm ∥w∥2 . Indeed. At each step of the gradient descent we
minimize its linear approximation given by the gradient:
n o
w(t+1) = arg min F (w(t) ) + ⟨g (t) , w − w(t) ⟩ .
w

The latter is linear function and thus its minimum is at infinity. Also, on the other hand, the linear
approximation is not valid when we get far from the iterate w(t) . That is why we add a square regularizer
∥w − wt ∥22 /2η. We obtain

1
w(t+1) ← arg min F (w(t) ) + ⟨g (t) , w − w(t) ⟩ + ∥w − w(t) ∥22
w 2η

1 (22)
= arg min ⟨g (t) , w − w(t) ⟩ + ∥w − w(t) ∥22
w 2η
= w(t) − ηg (t) .

In order to better understand this for the SGD, let us first introduce the notion of stability.

3.3 Stability
Here we will be studying
Pma notion of stability for loss functions, generically denoted F̂ (w), defined as empirical
1
sums of the form m i=1 f (w, zi ), for a given dataset (zi )1≤i≤m . We start by defining the leave-one-out
stability and replace-one-out stability, and then derive generalization bounds using these quantities, see e.g.
[29].
Definition 3.1. Let β : N → R be a monotonically decreasing function. A learning rule w̃(z1 , . . . , zm ) is
leave one out β(m)-stable if

|f (w̃(z1 , . . . , zm−1 ), zm ) − f (w̃(z1 , . . . , zm ), zm )| ≤ β(m),

and replace one out β(m)-stable if,

|f (w̃(z1 , . . . , zm−1 , z ′ ), zm ) − f (w̃(z1 , . . . , zm ), zm )| ≤ β(m),

where z ′ is a new independent sample from the hidden distribution D.

21
For simplicity we will assume that the learning rule is symmetric. That is w̃(z1 , . . . , zm ) = w̃(σ(z1 ), . . . , σ(zm )),
where σ is any permutation defined on {1, 2, . . . , m}.
1
Pm
Theorem 3.1. Define Fb(w) := m i=1 fi (w). If w̃ is symmetric and β(m) is stable then

e (z1 , . . . , zm−1 ) , zm )] ≤ E[Fb (w


E[F (w e (z1 , . . . , zm ) , zm )] + β(m)

Proof. By symmetry of w
e we have

Ez1 ,...,zm [f (w
e (z1 , . . . , zm−1 ) , zm )]
m
1 X
= E [f (w e (z1 , . . . , zi−1 , zi+1 , . . . , zm ) , zi )]
m i=1

Using the stability of the function f , we get the following

Ez1 ,...,zm [f (w
e (z1 , . . . , zm−1 ) , zm )]
m
1 X
≤ (E [f (w e (z1 , . . . , zm ) , zi )] + β(m))
m i=1
" m
#
1 X
=E f (we (z1 , . . . , zm ) , zi ) + β(m)
m i=1
h i
= E Fb (w em ) + β(m)

This result yields generalization for stable learning rules. However, one needs to take into account that
stability may be tricky in the learning problem. In the case, when the predictor interpolates the data, the
empirical error is equal to zero. But most interpolators hardly satisfy the stability condition with a small
β(m). Thus, the right hand side will be very large and hence non-informative. On the other hand, let us
consider the zero predictor. It is stable, as the rule does not depend on the data. However, it has a very large
empirical error. We therefore need a different rule that is stable and has small empirical error, to guarantee
generalization.

3.4 Strong convexity


Let us define strong convexity for real valued functions.
Definition 3.2. The function Ψ : W → R is α-strongly convex w.r.t. to a norm ∥w∥ if
α ′
Ψ(w′ ) ≥ Ψ(w) + ⟨∇Ψ(w), w′ − w⟩ + ∥w − w∥2 . (23)
2
Strong convexity essentially means that the function is bounded below by a quadratic function. This
property depends on the norm unlike the convexity which can be defined on a vector space. Let us apply
the (23) for the optimum point w0 := arg minw Ψ(w). Since ∇Ψ(w0 ) = 0 we get the following for every w:
α
Ψ(w) − Ψ(w0 ) ≥ ∥w − w0 ∥2 . (24)
2
The latter means that the difference between the function value at any point and the optimal one is propor-
tional to the distance from the optimal point. Let us now establish the connection of strong convexity and
stability. For the dataset S = {z1 , . . . , zm } we define the regularized ERM (RERM) as follows:
n o
RERMλΨ (S) = arg min Fb(w) + λΨ(w) .
w∈W

Here Ψ is an α-strongly convex function and Fb is the empirical mean corresponding to the dataset S. We
recall the definition of Lipschitz continuity.

22
Definition 3.3. The function f (w, z) is G-Lipschitz w.r.t. ∥ · ∥ if and only if for every z ∈ Z and w, w′ ∈ W

|f (w, z) − f (w′ , z)| ≤ G∥w′ − w∥.

One may notice that Lipschitz continuity implies some kind of “stability”, as it essentially means that
small perturbation of the argument will not change the function value drastically. As for strong convexity,
the Lipschitzness also depends on the norm. This yields that ∥∇w f (w, z)∥∗ ≤ G, where ∥ · ∥∗ is the dual
norm. We will assume that the norm is the same for both properties.
Proposition 1. If f is G-Lipschitz and Ψ is α-strongly convex then RERMλΨ (S) is stable with a coefficient

2G2
β(m) ≤ . (25)
mλα
Proof. In the following proof we will abbreviate RERM with R. It is straightforward to check that if f
is G−Lipschitz then so is F̂ (with respect to w, conditionally on the zi ). Denote S = (z1 , ..., zm ) the full
dataset and S (−m) P
= (z1 , ..., zm−1 ) the dataset in which zm is left out. Now define hS (w) = F̂S (w) + λΨ(w),
1 m
where F̂S (w) = m i=1 f (w, zi ). Then hS is λα strongly convex and from equation (24) we have

hS (w) − hS (RλΨ (S)) ≥ λα∥w − RλΨ (S)∥22 . (26)

Now, denote RλΨ (S (−m) ) the solution to the ERM problem with the dataset S (−m) . Then

hS (RλΨ (S (−m) )) − hS (RλΨ (S)) = hS (−m) (RλΨ (S (−m) )) − hS (−m) (RλΨ (S))
1
+ f (RλΨ (S (−m) ), zm ) − f (RλΨ (S), zm ) . (27)
m
Since hS (−m) (w) is positive and strongly convex,

hS (−m) (RλΨ (S (−m) )) − hS (−m) (RλΨ (S)) ≥ 0 (28)

so that

hS (RλΨ (S (−m) ))−hS (RλΨ (S))


1
≤ f (RλΨ (S (−m) ), zm ) − f (RλΨ (S (−m) ), zm )
m
≤ G∥RλΨ (S (−m) ) − RλΨ (S)∥2 (29)

using the Lipschitz continuity of f . Combining this inequality with Eq.(26), we reach

G
∥RλΨ (S (−m) ) − RλΨ (S)∥2 ≤ , (30)
αmλ
and
G2
f (RλΨ (S (−m) ), zm ) − f (RλΨ (S), zm ) ≤ . (31)
αmλ
′ ′
In the case of replace-one-out stability, we define S = (z1 , ..., zm−1 , z ′ ) and the related estimator RλΨ (S ).
Eq.(27) then becomes
′ ′
hS (RλΨ (S )) − hS (RλΨ (S)) = hS ′ (RλΨ (S )) − hS ′ (RλΨ (S))
1 ′
1 ′

+ f (RλΨ (S ), zm ) − f (RλΨ (S), zm ) + f (RλΨ (S ), z ′ ) − f (RλΨ (S), z ′ ) , (32)
m m
leading to
′ 2G
∥RλΨ (S ) − RλΨ (S)∥2 ≤ . (33)
αmλ

23
In the rest of the lecture, without loss of generality, we may assume that the regularization function Ψ
is 1-strongly convex. This is easily achieved by tuning the parameter λ accordingly. Theorem 3.1 yields the
following
h i 2G2
E [F (RERMλΨ (S))] ≤ E Fb (RERMλΨ (S)) + .
λm
Without loss of generality we may assume that Ψ is a positive function. Then using the definition of RERM,
for every w ∈ W we obtain
h i 2G2
E [F (RERMλΨ (S))] ≤ E Fb (RERMλΨ (S)) + λΨ(RERMλΨ (S)) +
λm
h i 2G2
≤ E Fb(w) + λΨ(w) +
λm
2G2
= F (w) + λΨ(w) +
r λm
8G2 supw Ψ(w)
≤ inf F (w) + ,
w∈W m
q
2G2
where the last line is obtained by choosing λ = αm sup . In particular, using the constraint Ψ(w) ≤ B,
w Ψ(w)
the last inequality can be rewritten as
r !
∗ sup{∥∇w f ∥2∗ }B
E [F (RERMλΨ (S))] − F (w ) ≤ O .
m

The last inequality is obtain by optimizing the right-hand side w.r.t. to the parameter λ. Let us now look
back at the Risk minimization problem in the convex case:
min Ez∼D [f (w, z)] = min Ez∼D [loss(⟨w, ϕ(x)⟩, y)].
w∈W w∈W

W is assumed to be convex. If loss(ŷ, y) is convex in ŷ, then the problem is convex. For a non-trivial loss
e.g. loss(hw (x), y) is convex in w only when hw (x) = ⟨w, ϕ(x)⟩. In this setting the Lipschitz continuity goes
as follows. Assume that the loss function loss(y, y′ ) is g-Lipschitz continuous w.r.t. y. Then
|f (w, (x, y)) − f (w′ , (x, y))| ≤ g∥ϕ(x)∥∗ · ∥w − w′ ∥. (34)
In particular, the learning problem becomes G = gR Lipschitz continuous, if we assume that ∥ϕ(x)∥∗ ≤ R,
for some R > 0. Hence the generalization bound becomes
r !
Ψ(w ∗ ) sup ∥ϕ(x)∥

E [F (RERMλΨ (S))] − F (w∗ ) ≤ O .
m

In order for the right-hand side to be small we need to have an appropriate sample size. Below we derive
the sample complexity for several examples.
• Ψ(w) = 12 ∥w∥22 is 1-strongly convex w.r.t. ∥w∥2 then m ∝ ∥w∥22 · ∥ϕ(x)∥22 .
• Ψ(w) = 21 wT Qw is 1-strongly convex w.r.t. ∥w∥Q then m ∝ (wT Qw)(xT Q−1 x). Here we choose Q to
be small in some direction, then we pay for it in its dual Q−1 .
∥w∥2 ·∥x∥2
• Ψ(w) = 2(p−1)
1
∥w∥2p is 1-strongly convex w.r.t. ∥w∥p , then m ∝ p
p−1
q
. Here, we would like the q
norm of the data to be small. Thus, we want q to be large. Hence p must be close to 1, which explodes
the denominator of the sample complexity.

• Ψ(w) = i w[i] log w[i]


P
1/d is 1-strongly convex w.r.t. ∥w∥1 . This problem is called the entropic mini-
∥w∥21 ·∥x∥2∞
mizer. Its sample complexity satisfies m ∝ p−1 .
We see in this example that in order to have good sample complexity we need to have matching geometries
for the data and the parameter.

24
Online learning The stochastic optimization resembles the online learning problem. The optimizer pro-
vides wi . We give it to the adversary to compute f (wi , zi ) and then use the value of f (wi , zi ) to compute
wi+1 .
The stability in online learning setting plays an important role. Consider the Follow The Leader (FTL)
rule. It proposes to choose
m−1
X
ŵm (z1 , . . . , zm−1 ) = arg min f (w, zi ) (35)
w∈W
i=1

However, this is an unstable rule. A better method called Be the Leader (BTL) suggests the following:
m
X
ŵm (z1 , . . . , zm−1 ) = arg min f (w, zi ). (36)
w∈W
i=1

This has great convergence properties, but it is not implementable as we assume to have access to f (w, zm ).
Instead, we regularize the FTL. The Follow the Regularized Leader (FTRL) goes as
m
X
ŵm (z1 , . . . , zm−1 ) = arg min f (w, zi ) + λt Ψ(w).
w∈W
i=1

This algorithm however, does not resemble to the one-pass SGD algorithm (see also equation (22)):

wt+1 = arg min⟨∇f (wt , zt ), w⟩ + λt ∥w − wt ∥22 /2. (37)


w

With the increase of the iteration m, our FTRL needs to minimize a more complex sum-decomposable
function. This becomes costly for large m’s. For convex objectives we can relax the problem. We minimize
the linear approximation of the objective (Linearized FTRL):
*m +
λ 1 X
ŵm (z1 , . . . , zm−1 ) = arg min ∇f (wi , zi ), w + λt Ψ(w). (38)
w∈W m
i=1

This problem is simpler than the FTRL, but it is still more complex than the one-pass SGD because of its
dependence on the previous gradient evaluations. To fill this gap we introduce the mirror descent method.

3.5 Mirror descent


In this part of the lecture, we will present the mirror descent algorithm. We will get a better understanding of
it in the upcoming lectures. Let us define the Bregman divergence. For a given strictly convex, continuously
differentiable function Ψ defined on a convex set Ω, the Bregman divergence between two points x, y in Ω is
given by
DΨ (x | y) = Ψ(x) − (Ψ(y) + ⟨∇Ψ(y), x − y⟩).
Intuitively, it corresponds to the distance, at point x, between the function Ψ and its linearization at point
y, see Figure 8 for an illustration. In particular, for α-strongly convex functions, it holds that DΨ (x | y) ≥
α∥x − y∥2 /2. The mirror descent is then defined as the following iterative scheme:

wt+1 = arg min ⟨∇f (wt , zt ), w⟩ + λt DΨ (w | wt ) (39)


w∈W

W −1 1
= ΠΨ ∇Ψ ∇Ψ(wt ) − ∇f (wt , zt ) , (40)
λt

where ΠW ′
Ψ (w) = minw′ ∈W DΨ (w | w) is a projection set on W with respect to the Bregman distance. One
may easily verify that if Ψ is a quadratic function, then its Bregman divergence is also quadratic, and we
recover the standard gradient descent algorithm, assuming the constraint set W is the domain of definition
of Ψ. Indeed, taking Ψ(x) = 12 ∥x∥22 , we obtain
1
DΨ (x|y) = ∥x − y∥22 (41)
2

25
Figure 8: Graphical representation of the Bregman divergence.

so that

arg min ⟨∇f (wt , zt ), w⟩ + λt DΨ (w | wt ) (42)


w∈W
1
= wt − ∇f (wt , z t ) (43)
λt
Hence the one-pass SGD is also an instance of the mirror descent. Let us now look at the minimization
problem (39). The optimality condition is the following:

0 = ∇w ⟨∇f (wt , zt ), w⟩ + λt DΨ (w | wt )
(44)
= ∇f (wt , zt ) + λt ∇Ψ(w) − ∇Ψ(wt ) ,

which leads to
1
∇Ψ(w) = ∇Ψ(wt ) − ∇f (wt , zt ).
λt
For differentiable, strongly convex functions the gradient is invertible, and we may write the following formula
for wt+1
−1 1
W
wt+1 = ΠΨ ∇Ψ ∇Ψ(wt ) − ∇f (wt , zt ) .
λt
Here ΠW
Ψ (w0 ) is the projection of w0 on W:

ΠW
Ψ (w0 ) = arg min DΨ (w | w0 ).
w∈W

See Figure 9 for illustration of the mirror descent algorithm. Suppose that W is the entire space: W = B.

Figure 9: Suppose that the hypothesis class W is a subset of B, where B is a Banach space. As we have seen
above, the gradients live in the dual space B ∗ . Then each iteration of the mirror descent consists of four
steps. First we take the map ∇Ψ : B → B ∗ . Then we do a step in B ∗ in gradient direction ∇f . At the third
step, we take Ψ−1 , which maps B ∗ to B. The final step is the projection of this point to the hypothesis set
W.

Then in the dual space B ∗ we get a sum of the gradients, as there is no projection in the primal space. We

26
take the initial point w0 = arg minw Ψ(w). This choice is intuitive as one would like to start with a model
with lowest complexity. Thus the mirror descent iteration goes as follows:
t
−1 X 1
wt+1 = ∇Ψ ∇Ψ(w0 ) − ∇f (wi , zi )
λ
i=1 i
t
X 1
= arg min ∇f (wi , zi ), w + ∇Ψ(w) ,
w λ
i=1 i

and we recover the linearized FTRL iteration (38).

3.6 Summary
We presented guarantees for general and stochastic optimization under simple assumptions on the function
being optimized and combined these with generalization guarantees to understand how these interact (in
particular, is optimization or generalization the “harder” thing to do, in terms of sample complexity?). We
studied stability and used it to motivate mirror descent, an abstraction that covers both one-pass SGD
and linearized FTRL. Finally, we saw two different ways to see mirror descent. The first is based on the
link between the primal and the dual spaces. However, it is less transparent as the link function Ψ does
not include information about the reality or model complexity. The other way, is the connection with the
linearized FTRL. Here, Ψ acts as a regularizer which naturally describes the link function as a complexity
measure.

27
4 Lecture 4 : Mirror descent and implicit bias of descent algo-
rithms
To recap, we talked last time about learning being a stochastic optimization problem. We are going to use
z to get unbiased estimators of the gradient of the population loss. For learning, we consider an objective of
the form:

f (w, (x, y)) = loss(⟨w, ϕ(x)⟩, y)


so that we have convexity with respect to w . This is suitably generic because any data set is realizable in
this form: Let ϕ map x to an indicator about its identity, and w selects the appropriate label.
There are two reasons to study convex optimization in a deep learning school: one is because it relates to
mirror descent, and the second is that it allows us to talk about the geometry of the optimization problem
and therefore the inductive bias.
Recall that if we are exploring the loss landscape with gradient descent, we are implicitly staying close
in ℓ2 norm to the initialization.

4.1 Mirror Descent


We derived mirror descent as regularizing this object with respect to optimizing the first order object. We
wrote it in the following form:

wk+1 = arg min ⟨∇f (wk , zk ), w⟩ + λk DΨ (w||wk ) (45)


w∈W

W −1 1
= ΠΨ ∇Ψ ∇Ψ(wt ) − ∇f (wt , zt ) (46)
λt

where

DΨ (w||w′ ) = Ψ(w) − (Ψ(w′ ) + ⟨∇Ψ(w′ ), w − w′ ⟩)


we replace λk with ηk since the regularization is linked to the step size, as was discussed in the previous
lecture.

Figure 10: Graphical representation of mirror descent, see Figure 9.

We cannot directly add points in the primal and dual spaces, so we use a link function. To go from
primal space to dual space, we use the gradient of the potential, and to go back to the primal space from
the dual space,
When the step size is small, we can take a low-order approximation of the Bregman divergence:

DΨ (w||wk ) ≈ ⟨∇Ψ(wk ), w − wk ⟩ − (w − wk )T ∇2 Ψ(wk )(w − wk ) − ⟨∇Ψ(wk ), w − wk ⟩


w−wk small

where we have ignored terms not dependent on w , as they do not affect optimizing with respect to w .

28
We can write the gradient descent problem approximately as:
1
wk+1 = arg min ⟨∇f (wk , zk ), w⟩ + (w − wk )T ∇2 Ψ(wk )(w − wk ) (47)
w∈W ηk
= wk − ηk ∇2 Ψ(wk )−1 ∇f (wk , zk ) , define ρ(wk ) := ∇2 Ψ(wk )−1 (48)
The latter version of the update rule is known as natural gradient descent, which was first popularized
in the context of information geometry by S. Amari (see [1]). It is valid for any potential Ψ (or really for
any metric tensor). For intuition, consider that there is a manifold with some local metric ρ , which is the
Hessian of Ψ , and we optimize on the manifold. Thus, here we take the view that natural gradient descent
is an approximation to mirror descent.
Next, we can take the step size to 0 and write the gradient flow equation:
ẇ = −∇2 Ψ(w(t))−1 ∇f (w(t), z) .
As the step size goes to 0, the stochasticity goes away: in any time interval, as the step size goes to 0,
we take more steps and so we see more samples, so the stochasticity goes away. Thus, instead of writing the
stochastic gradient, we can write the population gradient:
ẇ = −∇2 Ψ(w(t))−1 ∇F (w(t)) . (49)
This is either very convenient because we do not have to worry about the stochasticity, or it is bothersome
because we want to study the stochasticity but cannot.
There are two ways to interpret what is hapening here: when the step size goes to 0, both mirror
descent and natural gradient descent (which are the same in the infinitesimal step size limit) converge to the
Riemannian gradient flow on the population objective. (Note: this is another way to think about what we
are doing with SGD, because as the step size gets smaller and smaller, we are approaching optimizing the
population loss.) The other way to think about natural gradient descent and mirror descent is: we endow
the space with some local geometry and take steps based on that local geometry. Natural gradient descent
corresponds to a forward Euler discretization of the natural gradient flow rule (Eqn. 49):

ẇ(t) = −∇2 Ψ(w(⌊t⌋η ))−1 ∇F (w(⌊t⌋η )) (50)


2 −1
= −∇ Ψ(w(⌊t⌋η )) ∇f (w(⌊t⌋η ), z⌊t⌋η ) (51)

w(kη + ∆t) = ∇Φ−1 (∇Ψ(w(kη)) + ∆t∇f (w(kη), zk )) (52)


⇔ ∇Ψ(w(kη + ∆t)) = ∇Ψ(w(kη)) + ∆t∇f (w(kη), zk ) (53)
Mirror descent represents a slightly more sophisticated discretization of this equation. Suppose we start with
any arbitrary metric tensor
ẇ(t) = −ρ(w(t))−1 ∇F (w(t)) . (54)
What happens if we start with a manifold and seek a complexity measure to optimize over this manifold?
In order to determine the complexity measure that underlies the geometry, we require the metric tensor to
also be a Hessian map. Unfortunately, most metric tensors (smooth mapping from general vector space to
d × d positive definite matrix) are not Hessian maps. In order to have it be a Hessian, we require that:
∂ρij ∂ρik
= .
∂wk ∂wj
This turns out to be almost sufficient as well. Suppose we define:
ρ(w) = I + wwT .
The manifold that we get from looking at the squared norm of w appended to w turns out to be the above.
It turns out that this is not a Hessian map, as the above symmetry doesn’t hold. Thus, starting from a
metric tensor, it’s quite special to actually be a Hessian map, and that is what allows us to determine the
complexity measure that is underlying the geometry. To summarize, NGD is piecewise linear in the primal
space, and MD is piecewise linear in the dual space.

29
4.1.1 Examples of Mirror Descent
Let us now look at how the dynamics look like for some examples of mirror descent. How should we choose
the geometry? We know that:
r !
⋆ Ψ(w⋆ ) sup ||∇f ||⋆
ES∼Dm [F (w̄k )] ≤ F (w ) + O
k

which holds as long as Ψ is 1-strongly convex with respect to our choice of norm. Here, w⋆ is no longer the
best in the norm bounded class, since we do not wish to limit the norm explicitly. The guarantee above
means that we can compete with any w⋆ , we just have to pay for its complexity in the second term.

Question When the metric tensor doesn’t correspond to a Hessian, does that mean we cannot come up
with a global complexity measure? Answer The answer is a combination of “no” and “I don’t know .”

Example 1: Euclidean potential


1 2
Ψ(w) = ||w||2 strongly convex wrt ||w||2 .
2
2 2
Here, k ∝ ||w⋆ ||2 ||ϕ(x)||2 . The metric tensor is just the identity, and the dynamics are the standard GD
dynamics

Example 2: Mahalanobis potential


1 T p
Ψ(w) = w Qw strongly convex wrt ||w||Q = wT Qw
2
Here, k ∝ (w⋆T Qw⋆ )ϕ(x)T Q−1 ϕ(x) . The dynamics correspond to preconditioned gradient descent:

ẇ = −Q−1 ∇F (w) .

Example 3: Entropic potential


X w[i]
Ψ(w) = w[i] log
i
i/d
2 2
here, k ∝ ||w||1 log d ||ϕ(x)||∞ . Here, the Hessian will be diagonal with 1/w on each component, and so the
dynamics will look like:

ẇ = −∇2 Ψ(w)−1 ∇F (w) = −diag(w)∇F (w) = −wi ∂i F (w)

The local geometry is not constant, nor is it uniformly penalized in the same directions everywhere. It
penalizes changing coordinates that are already small.

4.1.2 Smoothness and Batching


2
Definition 4.1. If F (w′ ) ≤ F (w) + ⟨∇F (w), w′ − w⟩ + H
2 ||w′ − w|| , we say that the convex function F is
H-smooth.

For a smooth function, we can get a better convergence rate, which depends on 1/k rather than 1/ k :
r !
⋆ HΨ⋆ (w) σ 2 Ψ⋆ (w) h
2
i
E [F (w̄k )] ≤ F (w ) + O + , where σ s.t. E ||∇f (w, z) − ∇F (z)|| ≤ σ 2 (55)
k k

The first term above is the optimization term that only depends on the discretization (how well does lin-
earization match the true function), and the second term is the stochasticity term, which depends on the
distance between the stochastic approximation and the population objective. Accordingly, it makes sense to

30
take more samples than discretization steps. We can do this via using batches. In particular, if instead we
use:
b
1X
∇f (w, zi ) ,
b i=1

then the variance scales with 1/b, giving us:


r !
⋆ HΨ⋆ (w) σ 2 Ψ⋆ (w)
E [F (w̄k )] ≤ F (w ) + O + . (56)
k bk

We can see here is that optimization is easier than the statistical aspect. It never really helps us to take
more steps than the number of samples. In the non-convex case, this is substantially different: in particular,
the optimization might be harder than the statistical part, and so it might be worth viewing the samples
multiple times.

Remark It is sufficient to look just at variance of the gradients at w⋆ . This is important because if the
problem is realizable, then at the optimum, w⋆ is correct for any z , and therefore the gradients are 0. So if
f ≥ 0, r !
⋆ HΨ⋆ (w) HF (w⋆ )Ψ⋆ (w)
E [F (w̄k )] ≤ F (w ) + O + .
k bk
One more observation: this analysis relies not only on the choice of potential function but also on the
choice of norm: the variance and smoothness depend on the norm, but the algorithm itself doesn’t rely on
the norm. Recent analyses alleviate this by going through relative smoothness [3, 17].

Definition 4.2. A function F is relatively smooth to Ψ up to smoothness parameter H if:

∇2 F (w) ⪯ H∇2 Ψ(w) .

The guarantee in Eqn. 55 holds when H is the relative smoothness parameter.

4.2 General Steepest Descent


So far, we have talked about mirror descent and have related it to Riemannian gradient flow. We can also
think more generally about steepest descent methods that are not related to a metric.
1
wk+1 = arg min⟨∇f (wk , zk ), w⟩ + δ(w, w′ ) .
η
We can take δ(w, w′ ) = ||w − w′ ||1 , for example, and though it appears to correspond to using ℓ1
geometry, it actually will correspond to coordinate descent. Using ℓ∞ will take a step corresponding to the
sign of the gradient.

4.3 Implicit bias of descent methods


So far, we have discussed the convex setting, and we related optimization guarantees and generalization guar-
antees. We get generalization guarantees with respect to w⋆ considering only the dynamics of optimization.
We studied the connection between the geometry of searching the space and the corresponding complexity.
However, there are some limitations to what we have seen so far. First of all, this is only the convex case.
Second of all, this isn’t what we’re seeing in deep learning. Here, we are not driving training error to 0. The
generalization seems to rely on the fact that we are using stochasticity. That is different from what we see
in the examples from the previous lectures where we train with full batch, multi-pass gradient methods but
still get good generalization.
Let us compare two approaches. First, suppose we actually select:

wλ = arg min F̂k (w) + λΨ(w) .

31
And secondly, consider w̄k , the solution achieved after k iterations of mirror descent. In either of these
cases, we get the same generalization guarantee, but for different solutions. In √ the Lipschitz case with

optimally chosen λ, the
√ distance (suboptimality, really) between w √
λ and w is 1/ k . The distance between
w⋆ and w̄k is also 1/ k , but the distance between wλ , w̄ is also 1/ k. We know this because we know that √
we can view one-pass of this as optimizing the training objective, which in k steps gets suboptimality 1/ k .
Thus, we are not saying that implicit regularization gets us to the same solution as explicit regularization,
but rather we are saying that the generalization guarantees hold. If we keep repeating passes, we might get
to minimizer of the training error, but it’s unclear if this is beneficial.

4.3.1 Deriving Implicit Regularization for Gradient Descent


We will analyze gradient descent on the unregularized training objective:
m
1 X
F̂ (w) = loss(⟨w, xi ⟩, yi ) .
m i=1

We’ve dropped ϕ for ease of notation, but we should think of x as being some feature map. Let w ∈ Rd , d >>
m . Let us use gradient descent:
wk+1 = wk − η∇F̂ (wk ) ; w0 = 0 .
We know from previous discussion that this will result in implicit ℓ2 regularization, but we will formally
derive this. Additionally, this will later allow us to understand what happens when using mirror descent.
For the first step, we are going to argue that the iterates wk all lie in a linear manifold given by the span
of the data, i.e.:

wk ∈ M = span(X) = w = X T s s ∈ Rm

This is because each update involves adding a scaled gradient, and the gradient is dependent on a residual.
This already tells us a lot about what we will get: even though we have d parameters, we only will ever be in
an m-dimensional subspace. We will also assume that we get to a global optimum. This is not always easy to
show but are going to assume it. Since we are in the overparametrized setting, we know that Xw = y , where
w is the predictor to which we finally converge. We claim that the point we get is exactly the minimizer of
the following optimization problem:
1 2
min ||x||2 such that Xw = y . (57)
2
We have the optimization that we are presumably optimizing: that is, minimizing the empirical error,
and we have the optimization problem above. In general, these arguments are going to proceed by claiming
that solving the first optimization problem actually implicitly solves the second optimization problem. To
show this, we use the method of Karush-Kuhn-Tucker (KKT) conditions for optimality of a solution to
a constrained optimization problem. The optimum for a constrained optimization problem is uniquely
characterized by its KKT conditions. Let us look at the KKT conditions. We introduce dual variables ν for
the constraints, giving us that the Lagrangian is:
1 2
L(w, ν) = ||w||2 + ν T (y − Xw) .
2
The KKT conditions are:
• Stationarity 0 = ∇w L = w − X T ν
• Primal feasibility y = Xw .
We call ν the dual variables, or the variables in the dual space. We observe that the stationarity
condition shows exactly what we claimed earlier, that wk are in the span of the data, and the primal
feasibility condition finds the interpolator. Since a point that satisfies these two conditions is optimal for
the optimization problem in Eqn. 57, and since gradient descent that arrives at a 0 training error predictor
satisfies these two conditions, gradient descent finds an optimal solution to the problem in Eqn. 57, i.e., a
minimum Euclidean norm solution.

32
4.3.2 Similar Argument for Mirror Descent
With this argument in mind, we can apply a similar procedure to analyze the output of mirror descent. We
are going to show that with mirror descent, we get to a solution that is good with respect to the stated
potential. To show this, we will use essentially the same proof. The optimization problem is still the same
as before: we are minimizing the training objective, but the optimization algorithm will differ. Now, we
optimize F̂ using mirror descent with respect to some potential function Ψ. Let us reproduce the iterates:

wk+1 = arg min ⟨∇f (wt , zt ), w⟩ + λt DΨ (w||wt )


w∈W

In mirror descent, we accumulate the gradients in the dual space.


k
X
wk+1 = ∇Ψ−1 ∇Ψ(w0 ) − ∇F̂ (wk )
i=1

That is, the k + 1 iterate is the mapping back into the primal space of the accumulation of the gradients
in the dual space. As before, we can see that the gradients lie in the span of the data (in the dual space).
Thus, again, wk ∈ M = {∇Ψ−1 (∇Ψ(w0 ) + X T s) ∀ x ∈ Rm } . This is not a flat manifold (zero curvature)
in the primal space, but it is in the dual space. To see what point we converge to, we use the fact that in
the end, we converge to a global optimum. This imposes m linear constraints, which when intersected with
an m-dimensional manifold gives us a unique point. To determine which unique point that is, we write an
optimization problem whose KKT conditions match the two sets of constraints (global optimality and lying
in the manifold):
min DΨ (w||w0 ) such that Xw = y .
Then the Lagrangian is:
L(w, ν) = DΨ (w||w0 ) + ν(y − Xµ)
To see this, we write the KKT conditions:
• Stationarity 0 = ∇Ψ(w) − Ψ(w0 ) − X t ν
• Primal Feasibility Xw = y

This is exactly what was specified earlier.

4.3.3 General Method


We started from the optimization problem of minimizing the training objective and then saw trajectory stays
within a manifold. The second set of constraints we imposed came from the global optimality of the solution
reached. We then matched these two sets of constraints to KKT conditions for some other optimization
problem.
If w0 were the minimizer of the potential function, then the optimization problem becomes minimizing
the potential function subject to Xw = y.

33
5 Lecture 5: Implicit bias with linear functionals and the square
loss
The goal of this lecture is to provide analytical evidence to the lazy [8] and rich regimes in learning problems
with gradient descent. We will consider a simple model for which the dynamics of gradient descent may be
solved exactly and we will study these trajectories as a function of the magnitude of the initialization and
model architectures, leading to various implicit biases.

5.1 Setting
We consider models parametrized by a weight vector w ∈ Rp acting on an input space X and denote
F (w) ∈ {f : X → R} the predictor implemented by w, such that F (w)(x) = f (w, x). We will focus
on models linear in x represented by a linear functional in the dual space of X , denoted X ∗ represented
by a vector β w such that f (w, x) = ⟨β w , x⟩, i.e. β w is some transform, potentially non-linear of w. We
consider the supervised learning problem with n sample pairs (xi , yi ) where yi ∈ R is a response vector. The
parameters are learned by minimizing the empirical loss
n
1X
loss(f (w, xi ), yi ) (58)
n i=1

with the corresponding population loss

Ex,y [loss(f (w, y), x)] . (59)

We assume the model is homogeneous of order D, i.e., for any constant c > 0 F (cw) = C D hw . The order D
is related to the depth of networks with homogeneous activations (e.g. a linear or Relu). A linear model is
homogeneous of order 1, a factorization model of order 2. Our goal is three-fold :
• first, we want to characterize the implicit bias of gradient descent in this setup. In regards to the
previous lectures, is optimzing in the parameters space using gradient descent equivalent to optimizing
in the functional space w.r.t. some metric tensor and potential ? What is this potential
• secondly is it equivalent to explicit regularization ? For instance, we saw that GD on a least-squares
problem converges towards the minimum ℓ2 norm solution, equivalent to explicitly penalizing the norm.
Same here ?

• finally, we would like to study the transition between kernel (a.k.a. lazy) regime and rich regime. What
parameters govern this behaviour ? We have seen in the previous lectures that in many cases we obtain
implicit biases that cannot be reached with kernel methods (sparsity, nuclear norm, ...). Can we get a
more precise picture on simple models ?

5.2 Reminder on the kernel regime

Consider the function computed by the model f (w, x). We can take its first order approximation around
the initalization of GD w0 .

f (w, x) = f (w0 , x) + ⟨w − w0 , ∇w f (w0 , x)⟩ + O ∥w − w0 ∥22



(60)

In what follows, we will sometimes write ϕ0 (x) = ∇w f (w0 , x). In certain regimes, this linear approximation
is always valid across training and the model behaves as an affine model f (w, x) = f0 (x) + ⟨w, ϕ0 (x)⟩ with
feature map ∇w f (w0 , x) corresponding to the tangent kernel K0 (x, x′ ) = ⟨∇w f (w0 , x), ∇w f (w0 , x′ )⟩. GD
then learns the corresponding minimum RKHS distance to the initialization F (w0 ) solution, i.e. arg minh ∥h−
F (w0 )∥K0 s.t. h(X) = y. We can avoid the F (w0 ) by choosing a familiy of functions verifying F (w0 ) = 0
(unbiased initialization from [8]). Then are we just studying an uninformative regime or can we really replace

34
neural nets with linear models ? In what case do we get this kernel regime ?

Initially, was linked to the width, taking width to infinity leads to a Gaussian process which is a kernel
regime. See e.g. [13, 9]. Closer to what we want to do is [8]. Regardless of the width, can always reach the
kernel regime when the scale of the initialization goes to infinity. In what follows, we will mainly consider
gradient flow, i.e.
dw
= −∇w L̂(w) (61)
dt
Initialize at different scales, i.e. w(0) = αw0 where α > 0 and w0 can be any w0 which can be random,
and such that F (w0 ) = 0 (i.e. w0 maps to a null function in function space). For any α, we will write the
dynamics
dwα
= −∇w L̂(wα ) (62)
dt
The result from [8] then states that when α goes to infinity, after a appropriate rescaling of time, the entire
trajectory converges to the kernel one :
1
lim sup∥wα ( D−1 t) − wK (t)∥∞ = 0 (63)
α→∞ t α
where wK represents the vector obtained by running gradient descent on the tangent kernel model : ẇ(K) =
−∇w L̂(⟨w, ϕ0 (x)⟩).

5.3 A simple model : 2-layer linear diagonal network


We would like a simple model going beyond the linear case (for which we understand the phenomenology),
that can still be studied analytically and where the implicit bias is interesting. We are going to look at
element-wise squaring of parameters, i.e.
f (w, x) = ⟨w2 , x⟩ β = F (w2 ) (64)
where for a given vector, 2 denotes the elementwise squaring. This is equivalent to a depth 2 diagonal linear
network. Here we cannot get all linear functions this way because
the coefficients cannot be negative. In
w+ 2 2
order to allow negative coefficients, we used 2d parameters w = and β w = w+ −w− . This is equivalent
w−

x
to a depth-2 diagonal linear network with 2d parameters on the first layer f (w, x) = w⊤ diag(w) , as
−x
in the following figure

Figure 11: Depth-2 diagonal linear net with replicated an signed units in the first layer

Negative functions can thus be represented and we can also choose the initialization such that F (w0 ) = 0.
If w0 = 1d , then at initialization, β 0 = 0 whatever the scale of initialization α, where w+,α (0) = w−,α (0) =
αw0 . What’s the implicit bias of doing gradient descent on this model, when considering the square loss ?
1
L(ŷ, y) = (ŷ − y)2 (65)
2

35
5.3.1 Analytical study of GD in parameter space
Applying the chain rule gives, denoting X ∈ Rn×d the design matrix
dβ dL(β)
ẇ+ = − = −2X⊤ r(t) ⊙ w+ (t) (66)
dw+ dβ
dβ dL(β)
ẇ− = − − = 2X⊤ r(t) ⊙ w− (t) (67)
dw dβ
(68)
where we defined the residual r(t) = Xβ(t) − y and ⊙ denotes the elementwise product.
Consider that the residuals are known, then these are differential equations that can be solved. Integrating
yields
Z t
w+ (t) = w+ (0) ⊙ exp{−2X⊤ r(τ )dτ } (69)
0
Z t
w− (t) = w− (0) ⊙ exp{2X⊤ r(τ )dτ } (70)
0
⊤ 2 2

where r(t) = X w+− w− − y. Although this is just a rewriting as integral equations, we can get
Rt
important information from this. Letting S(t) = 0 r(τ )dτ :
2
(0) ⊙ exp −4X⊤ S(t) − w− 2
(0) ⊙ exp 4X⊤ S(t)

β(t) = w+ (71)
2 ⊤ ⊤

= α 1d ⊙ exp −4X S(t) − exp 4X S(t) (72)
2 ⊤

= 2α 1d ⊙ sinh −4X S(t) (73)
where sinh is the hyperbolic sine. We are interested in the regime where n << d i.e. number of samples much
lower than dimensionality d, so they are many solutions to the problem Xβ = y. However, Eq.(71)-(72)
shows that we have reduced the set of solutions to a lower dimensional manifold of dimension n. This is
similar to the study of GD on linear regression, i.e. the predictor is always spanned by the data. Here we
observe the same thing on a non-linear model. This manifold is defined by
M = β = α2 1d ⊙ exp −4X⊤ s − exp 4X⊤ s |s ∈ Rn

(74)
2 ⊤ n

= β = 2α 1d ⊙ sinh −4X s |s ∈ R (75)
where we have n additional constraints defined by the equation Xβ = y leading to a unique solution. Note
that we are not studying the convergence of gradient flow, we assume it converges. Let’s write an equivalent
optimization problem giving the same set of solutions. We want to find a function Qα (β) that is implictly
minimized by the GF dynamics such that
β ∗ ∈ min Qα (β) (76)
β

s.t. Xβ = y (77)
The Lagrangian formulation for this problem reads
min max Qα (β) + ν ⊤ (y − Xβ) (78)
β ν

The KKT conditions then read


Xβ = y (79)

∇β Qα (β) = X ν (80)
Using Eq.(73), we may write

1
sinh−1 β = −4X⊤ s (81)
2α2

36
where the inverse hyperbolic sine is applied elementwise. Matching this with the optimality condition then
gives
Z ∞
ν = −4 rα (t) (= −4s) (82)
0

1
∇β Qα (β) = sinh−1 β (83)
2α2
Integrating this expression, remembering that the gradient of an element-wise function is applied element-
wise, yields
d
X βi
Qα (β) = α2 q( 2 ) (84)
i=1
α
Rz √
where q(z) = 0 sinh−1 ( 2t )dt = 2 − 4 + z 2 + zsinh−1 ( z2 ).
We can now study this function for different scalings of α. Plotting q gives the following figure :

Rz √
Figure 12: q(z) = 0
sinh−1 ( 2t )dt = 2 − 4 + z 2 + zsinh−1 ( z2 )

For α → ∞, we may look at q around zero, a second order Taylor expansion shows that q is quadratic
around zero. Thus for large α, the regularization is effectively quadratic on each coordinate. In this model,
we can show that the tangent kernel at initialization is the linear kernel K0 (x, x′ ) = ⟨x, x′ ⟩ thus the solution
converges to the minimum ℓ2 norm solution
α→∞
β α (∞) −−−−→ β̂ L2 = arg min∥β∥2 Kernel regime (85)
Xβ=y

For small values of α however, q becomes close to a ℓ1 norm and we obtain an effective ℓ1 , sparsity inducing
implicit regularization which does not correspond to a kernel regime, but a rich regime.
α→0
β α (∞) −−−→ β̂ L1 = arg min∥β∥1 Rich regime (86)
Xβ=y

We thus have a transition from kernel inductive bias to a sparsity inducing inductive bias. The rich learning
regime corresponds to the ℓ1 regime, which corresponds to learning : learning feature corresponds to selecting
a small number of features among an infinite amount of features. A more precise characterization of this
transition can be found in Theorem 2 from [33], which we reproduce here.
Theorem 5.1 (Theorem 2 from [33]). For 0 < ϵ < d, under the above setting,
n − 2+ϵ o
α ⩽ min 2(1 + ϵ)∥β ∗ℓ1 ∥1 2ϵ
, exp(−d/(ϵ)∥β ∗ℓ1 ∥1 ) =⇒ ∥β ∞ ∗
α,1 ∥1 ⩽ (1 + ϵ)∥β ℓ1 ∥1
q
α ⩾ 2(1 + ϵ)(1 + 2/ϵ)∥β ∗ℓ2 ∥ =⇒ ∥β ∞ 2 ∗ 2
α,1 ∥2 ⩽ (1 + ϵ)∥β ℓ1 ∥2
2

37
The sparsity inducing bias is fundamentally different from the kernel regime. Consider the following
example of sparse regression
yi = ⟨β ∗ , xi⟩ + γ where γ ∼ N (0, 0.01) (87)
where d = 1000; ∥β ∗ ∥0 = 5 and we have n = 100 samples. A kernel method cannot solve this problem : we
can see this on the following figure For large α, we are in the kernel regime and the excess ℓ2 norm is small,

Figure 13: Generalization error of kernel and rich regime on a sparse regression problem with n << d

but the population error is large. For small alpha however, both the excess ℓ1 norm and the population error
are small.
Getting to the ℓ1 regime is actually quite difficult. See this from Theorem 2 : α has to be exponentially
small. What’s the sample complexity as a function of α, or conversely, how small does α has to be to reach
good performance, let’s say L(β α (∞)) ⩽ 0.025 Thus, for very low number of samples, doing the exact ℓ1 is

Figure 14: Threshold value of alpha varies with the number of samples

impossible (vertical asymptote), but for a reasonable amount of samples, we can get good performance with
an approximate ℓ1 . This concludes the link between the scaling of initialization α to the kernel and rich
regimes.

5.3.2 Studying the dynamics in function space


Whats do the dynamics look like in function space ? Recall that the function space is parametrized by β,
thus we want to write the solution directly on β instead of w, and β = F (w).

β̇ = ẇ = ∇F (w(t))⊤ ẇ (88)
dw
= ∇F (w(t))⊤ (−∇w L(w(t))) (89)

38
where ∇F (w(t)) ∈ Rp×p is the Jacobian of F . The chain rule then gives

∇w L(w(t)) = ∇w L(β(w(t))) (90)


= ∇F (w(t))∇L(β) (91)

giving the dynamics


β̇ = −∇F (w(t))⊤ ∇F (w(t))∇L(β) (92)
which corresponds to what the previously discussed Riemanian gradient flow, with metric tensor ρ =
−1
∇F (w(t))⊤ ∇F (w(t)) , i.e.
β̇ = −ρ−1 ∇L(β). (93)
Thus choosing a certain parametrization F induces a geometry in the search done by the gradient descent,
governed by the metric tensor ρ. But ρ is a function of w(t): in the case of the model discussed previously,

diag(w+ )
∇F (w(t)) = ∈ R2d×d (94)
diag(w− )
2 2 −1

ρ(w(t)) = diag w+ + w− (95)

We would now like to rewrite this entirely as functions of β, i.e. write

β̇ = −ρ(β)∇L(β) (96)

This is problem dependent, and possible here. Recall the dynamics on w+ and w− from the previous section,
we have
d
(w+ ⊙ w− ) = −2X⊤ r(t) (w+ ⊙ w− − w− ⊙ w+ ) = 0 (97)
dt
We can thus evaluate this quantity at t = 0 which gives

∀t, w+ ⊙ w− = α2 1d (98)

Note that this also appears immediately by considering Eq.(69)-(70) along with the fact that w+ (0) =
w− (0) = α1d . Furthermore, by definition of β(t)
2 2
β(t) = w+ − w− , (99)

which leads to an element-wise quadratic equation that we can solve. Since the equations are the same for
2
each coordinate, we drop the coordinate index. Squaring both sides of Eq.(98) and replacing w+ (t) with
2
β(t) + w− (t) using Eq.(99), we obtain
4 2
w− (t) − βw− (t) + α4 = 0. (100)
2
This is a quadratic equation in w− whose positive solution reads
p
2 −β + β 2 + 4α4
w− = (101)
2
2
w+ is obtained in similar fashion, leading to
p
2 β+ β 2 + 4α4
w+ = (102)
2
Which leads to the following expression for the metric tensor ρ and the corresponding dynamics
p −1
ρ−1 = diag β 2 + 4α4 (103)
p −1
β̇ = −diag β 2 + 4α4 ⊙ ∇L(β) (104)

39
We can recover the previously discussed phenomenology from this equation. For large α, the β 2 term is
negligible and we recover standard gradient flow dynamics with ℓ2 geometry in the function space. If α = 0,
the scaling in front of the gradient is proportional to the absolute value of β. Thus higher absolute value
coefficients will decay more, which will promote sparsity. Now, does this metric tensor correspond to a
Hessian map defining a mirror descent ? To check this we need to solve ρ = ∇2 Ψ where Ψ is the potential
defining the Bregman distance used for mirror descent. In general, a metric tensor is not a Hessian map,
but here it is the case, mostly thanks to the diagonal structure. We can then simply integrate each element
on the diagonal twice. Performing this double integral yields the following potential
d r !
2

X β β β
Ψ(α, β) = α2 2
sinh−1 − 4+ 4 (105)
i=1
2α 2α2 α

Up to a constant, this is the same potential as the one implicitly being minimized by the gradient descent.

5.3.3 Comparing explicit and implicit regularization


Is this equivalent to explicity regularization using the ℓ2 norm ?

βRα,w0 = F arg min∥w − αw ∥2
0 2 s.t. L(w) = 0 (106)
w
= arg min Rα,w0 (β) s.t.Xβ = y (107)
β

where Rα,w0 = min∥w − αw0 ∥22 s.t. F (w) = β. (108)


w

Using a similar analysis as before, we study the optimization problem in parameter space over β which is
equivalent to the optimization problem in weight space over w where we use explicit ℓ2 regularization. For
standard linear regression this is a classical result, gradient descent converges to the min ℓ2 norm solution
(in that case β = w). When the initialization is w0 = 1, one can determine an analytical expression for
Rα,w0 : X
Rα,1 (β) = r(β i /α2 ) (109)
i

where r(z) is the unique real root of pz (u) = u − 6u + (12 − 2z 2 )u2 − (8 + 10z 2 )u + z 2 + z 4 . The next figure
4 3

shows a plot of r(z) next to q(z) obtained from the implicit bias analysis. The functions are very close to

Figure 15: Comparing explicit and implicit regularization

one another, even if a more refined analysis shows that the rich regime can be reached with a polynomial
scale of α with r instead of an exponential one with q. See the discussion in [33] for more detail.

40
5.4 The effect of width
For now we have studied the effect of the scale of the initialization on the regime in which the dynamics
operate. What is the effect of the width ? To find out, consider now that the model we want to learn is the
function
X
f ((U, V), x) = ui,j vi,j x[i] = ⟨UV⊤ , diag(x)⟩ (110)
i=1,..,d,j=1,..,k

where U, V ∈ Rd×k , and we learn the model by minimizing


N
X 2
L(U, V) = (⟨Xn , MU,V ⟩ − yn ) = L̃(MU,V ) (111)
n=1

using gradient flow, and we defined the map MU,V = F (U, V) = UV⊤ in the notations of the previous
section. This model can be considered as an extension of the linear model from above over matrix valued
observations, with an additional width parameter k, i.e. a matrix factorization problem or a wide parallel
linear network. The goal is to study the combined effect of α and k on the learning regime. Since the number

Figure 16: Wide parallel linear network

of parameters grows with the width k, we capture the scale of initialization with the parameter
1
∥MU,V ∥F (112)
d
We will see that MU,V can be in the kernel regime even if σ goes to 0, depending on the relative scaling
with k. In the symmetric case where MW = WW⊤ , the gradient flow reads

ṀW(t) = ∇L̃(MW(t) )MW(t) + MW(t) ∇L̃(MW(t) ) (113)

thus the entire dynamics is described by MWW⊤ . In the asymmetric case MU,V this is not true. We may
then consider the following ”lifted” problem defined by

UU⊤ MU,V

M̄U,V = (114)
MU,V V V ⊤

1 0 Xn
and the corresponding ”lifted” datapoints X̄ n = , where we consider that the datapoints are
X⊤
2
n 0
matrices in Rd×d , not necessarily diagonal. The implemented function is

f¯((U, V ), X̄) = ⟨M̄U,V , X̄⟩ (115)

the output of which is the same as the original model but now M̄U,V is the relevant matrix
h to study ithe
α2
problem. Assume that U(0), V (0) are initialized with N (0, σ = k ). This way Var diag(UV ⊤ )[i] =
2 √

α2 . In the case where the measurements commute, the following theorem is proven in [33] (we note that
the definition of scaling parameters are different from the paper here, but ultimately the statements are
equivalent)

41
√ √
Theorem 5.2. Let k → ∞, α(k) → 0 and µ := limk→∞ α4 k = σ(k) k and suppose that X1 , ..., Xn
commute. If MU,V (t) converges to a zero error solution M∗U,V , then

M∗U,V = arg min Qµ (spectrum(M)) s.t. L(M) = 0 (116)


M

where Qµ is the same function as before, now applied to the spectrum of M.


We see that the parameter governing whether or not the function q behaves like a square or an absolute
value is µ, which involves both the scale of initialization and the width of the problem :

• if α = o(1/k 1/4 ), i.e. σ = o(1/ k), we have an ℓ1 implicit bias and rich regime,

• if α = O(1/k 1/4 ), i.e. σ = O(1/ k), we have an ℓ2 implicit bias and kernel regime,

• kα2 → 0 leads to the kernel regime, even if ∥β(0)∥ ≃ α2 → 0

Figure 17: Rich and kernel regime in the matrix factorization problem

5.5 Deep diagonal networks


We now turn to the study of the effect of depth on the learning regime. To do so, we consider a deep variant
of the model introduced above, a depth D diagonal linear network :

β(t) = w+ (t)D − w− (t)D andfD (w, x) = ⟨w+ (t)D − w− (t)D , x⟩ (117)

We assume that gradient flow is initialized with w+ (0) = w− (0) = α1, and define the residual at each time

Figure 18: Deep diagonal linear network

42
step r(t) = Xβ(t) − y. Writing gradient flow on this model, with the square loss, reads
dL
ẇ+ = − = −DX ⊤ r(t) ⊙ w+
D−1
(118)
dw+
dL
ẇ− = − = DX ⊤ r(t) ⊙ w+
D−1
(119)
dw−
which integrates to
Z t 1
− D−2
2−D ⊤
w+ = α + D(D − 2)X r(t)dt (120)
0
Z t 1
− D−2
2−D ⊤
w− = − α + D(D − 2)X r(t)dt (121)
0

and
Z t D
− D−2
D D−2 ⊤
β(t) = α 1+α D(D − 2)X r(t)dt (122)
0
Z t D
− D−2
D D−2 ⊤
−α 1+α D(D − 2)X r(t)dt (123)
0
R∞
Letting s = 0
r(τ )dτ and assuming a zero error solution is achieved, we may write

β(∞) = αD hD X⊤ s and Xβ(∞) = y



(124)
− D − D
where hD = αD 1 + αD−2 D(D − 2)z D−2 − αD 1 + αD−2 D(D − 2)z D−2 (we note that the function
with different coefficients in the paper, but ultimately the statements are equivalent). Recall that we are
searching for an equivalent problem of the form

β ∗ ∈ inf Q(β) s.t. Xβ = y (125)


β

with the corresponding Lagrangian

L(β, ν) = Q(β) + ν ⊤ (Xβ − y) (126)

for which the KKT optimality conditions read

Xβ = y and ∇Q(β ∗ ) = X ⊤ ν (127)

. We can then match s with ν and identify thepotential


Q by matching its gradient with the inverse of hD .
β[i]
Then, define qD = h−1
R P
D and Q D (β) = q
i D αD . It is proven in [33] that

t
α2−D
Z

∀t ∥X r(τ )dτ ∥∞ ⩽ (128)
0 D(D − 2)

so the domain of hD is the interval [−1, 1] upon which it is monotonically incresing, ensuring the existence
of the inverse mapping h−1
D . Then, for all depth D ⩾ 2, this equivalent cost induces a rich implicit bias for
α → 0, i.e.
lim β ∞ ∗ ∞ ∗
α,D = β ℓ1 lim β α,D = β ℓ2 (129)
α→0 α→∞

Although the same behaviour is observed as for the D = 2 case, there are actually two main differences. The
first one is that, for D > 2, explicit regularization does not lead to a sparse bias. Indeed, for

Rα (β) = min ∥w − α1∥22 (130)


D −wD
β=w+ −

43
leads to
α→0
Rα (β) −−−→ ∥β∥2/D (131)
i.e. the 2/D quasi-norm, which leads to less sparse solution than the ℓ1 norm for D = 2. The second
difference concerns the intermediate regime, meaning how fast does the scaling at initialization go to zero
for the sparsity inducing bias to kick in. We have seen above that, for D = 2, an exponentially small scale
in α is required to enter the rich regime. As soon as D = 3 however, only a polynomially decreasing scale in
α is required, and the deeper the network the faster we can reach the rich regime when decreasing α. This
is illustrated by plotting the shape of qD for different values of D

Figure 19: Implicit cost function for deeper networks

5.6 Beyond linear models


Recall our setup, minimizing a loss function defined over a dataset with gradient descent. The predictor
function hw (x) = f (w, x) is parametrized by w ∈ Rp . So far we have focused on linear models taking the
form
hw (x) = ⟨β w , x⟩ (132)
where β w = F (w) ∈ Rd . Now consider the generic case where no linearity assumption is made on the
predictor. The function F is now defined as a mapping from Rd to RX . We may write the dynamics on
hw (x) in similar fashion as before using functional derivatives :

ḣw (x) = −∇F ⊤ ∇F ∇h L(h) (133)

where ∇h L(h) is now an element of RX and ∇F ⊤ ∇F is a linear map from RX : RX , with a kernel taking
the form
ρ−1 (x, x′ ) = ⟨∇w f (w, x), ∇w f (w, x′ )⟩ (134)
Thus the dynamics in parameter space is a gradient flow according to the metric tensor defined by the
tangent kernel at each time step. In the kernel regime, this metric tensor is fixed and remains the same as
the one at initialization throughout the dynamics, whereas in the generic case it changes at each time step.

44
6 Lecture 6: Implicit bias with linear functionals and the logistic
loss
Summary of the previous lecture We’re trying to understand how the choice of optimization geometry
(i.e., what our preferred metric is in local updates) affects where the optimization will lead us. Last time, we
also spoke about geometry of parameters space (usually just Euclidean geometry), but really what mattered
was the geometry in function space. The relationship between parameter and function space is:

hw (x) = f (w, x) or hw (x) = ⟨βw , x⟩ where βw = F (w) .

We can describe what the geometry of our model looks like in function space based on the Jacobian of the
model F . Now we can do gradient flow with this inverse metric tensor:

β̇ = −∇F T ∇F ∇β LS (β) .

We also discussed how this relates to explicit ℓ2 regularization and showed a setting where it was quite
different. The metric tensor really depends on where we are in parameter space, which isn’t conducive to
studying the dynamics on the function, but in some cases, this can be circumvented.
We can write down the entries of ρ−1 :

ρ−1 (x, x′ ) = ⟨∇w f (w, x), ∇w f (w, x′ )⟩ .

This is the tangent kernel at the position w . We are conditioning the dynamics on the position w at the
given time. In the kernel regime, the location at which the Jacobian is evaluated doesn’t change significantly
and so the same kernel matrix governs the dynamics at every step / at all times.
To finish our recap from last lecture, the simplest example in which we can see non-trivial behavior is
2 2
this squared parameterization model: f (w, x) = ⟨βw , x⟩ with βw = F (w) = w+ − w− , or in the deeper case:
βw = F (w) = |w+ | − |w− | . We talked about initializing at w(0) = α⃗1 , which gives β(0) = 0 . We saw
D D

that for D = 2 , for large α , we got the kernel regime, and when we take α to 0, we get this ℓ1 regularization.
Even when D ≥ 2 , when we took α → 0 , we still get ℓ1 regularization, despite explicit ℓ2 norm regularization
on the parameters not giving ℓ1 norm regularization in function space.
Today, we will look at logistic loss and see that there is an effect in the classification setting, as well.

6.1 Problem setting and equivalent reformulation


Consider a binary classification problem where we minimize the logistic loss over a data set using gradient
descent
n
1X
LS (w) = loss (⟨w, xi ⟩, yi ) (135)
n i=1

where w ∈ Rd ,the yi are binary and loss(ŷi , yi ) = log(1 + exp(−ŷi yi )), and we minimize this loss with

wk+1 = wk − η∇LS (wk ) (136)

In the overparametrized case, i.e. d > n, the data is separable and we may minimize the loss by considering
any separating hyperplane and taking its norm to infinity. Since the optimal solution diverges, we focus on
the direction of the optimal solution :
wk
lim (137)
k→∞ ∥wk ∥2

which converges to the max margin separator, i.e. the furthest away (in Euclidian distance) to all the points.
This can be equivalently rewritten as a convex opitmization problem under inequality constraints

w∗ ∈ arg min∥w∥2 (138)


w∈Rd
s.t. yi ⟨w, xi ⟩ ⩾ 1 (139)

45
The Lagrangian for this problem reads
n
1 X
L(w, ν) = ∥w∥22 + νi (1 − yi ⟨wi , xi ⟩) (140)
2 i=1

where ν ⪰ 0. Primal feasibility then requires

yi ⟨wi , xi ⟩ ⩾ 1 and ∇w L = 0 (141)


P
which implies w = i νi xi , meaning the separating hyperplane is supported by the data vectors. Comple-
mentary slackness then indicates that the only active coefficients νi that are non-zero are those associated with
datapoints verifying yi ⟨w, xi ⟩ = 1, i.e. where the constraint is active. For any xi verifying yi ⟨wi , xi ⟩ > 1,
the corresponding Lagrange multiplier νi will be zero.

6.2 Gradient flow dynamics


What does GF look like on this problem? Since we are interested in the interpolating regime, assume that
the dynamics converge to a small error ϵ. In this regime, we may approximate the logistic loss by its right
hand side tail, i.e.
log(1 + exp(−ŷi yi )) ≃ exp−yi ŷi ∀i (142)
The gradient reads
n
1 X −⟨wi ,xi ⟩
−∇w L(w) = e xi (143)
n i=1
Formal proof of what follows can be found in [18, 21]. Intuitively, GF finds a separating direction that will
not change much after a certain number of iteration, and increase its norm. We may then write

wk = w∞ g(k) + ρ(k) (144)

where ρ(k) = o(1) (a Theorem from [Soudry Hoffer Srebro 18] actually shows g(k) = log(k)). Replacing in
the expression of the gradient leads to
n n
1 X −⟨w,xi ⟩ 1 X g(k)⟨w∞ ,xi ⟩−O(1)
∇w L(w) = e xi ≃ e xi (145)
n i=1 n i=1

which is a linear combination of the data points and gives an explicit expression for the νi coefficients. Now
denote by γ the margin γ = mini ⟨w∞ , xi ⟩ > 0, and define the normalized separator
w∞
ŵ∞ = (146)
γ
whose norm will remain finite. Primal feasibility is verified by construction ⟨w, xi ⟩ ⩾ 1 is satisfied by
construction, and we have seen that the zero gradient condition on w prescribing it as a linear combination
of the data points is also verified. We also see that large values of ⟨w∞ , xi ⟩ the corresponding coefficient νi
will decrease very fast as g(k) → ∞, leaving the main contribution to the lowest values which are the xi for
which ⟨w∞ , xi ⟩ = 1. Thus we recover the complementary slackness condition.

6.3 Comparing the squared, logistic and exponential loss


For squared loss, we go to minimum distance from initialization, so the initialization is still important. For
the logistic loss, however, the initialization doesn’t matter at all – we always go to the max margin solution.
This makes sense because given that we diverge anyway, i.e., we go infinitely far from the starting point,
it cannot matter where we start. Any finite initialization from far enough away looks like the origin. This
is a significant factor that steers differences between squared loss and logistic loss. We could repeat this
analysis and show that when you minimize the logistic loss, for any homogenous model, you always go to
the minimum ℓ2 norm solution. Any network with fixed depth and homogeneous activation. We can show
this with basically the same proof.

46
Theorem 6.1 ([22, 18]). If LS (w) → 0 and the step size is small enough to ensure convergence in direction,
then:
w∞ ∝ first order stationary point of arg min∥w∥2 s.t. ∀ i yi f (w, xi ) ≥ 1 .
The first order stationary points of this objective are exactly those that satisfy the KKT conditions.
For convex problems, this implies optimality. Here, we have to be a bit more careful. The objective is
convex but f is not in general a convex function of w. This suggests that the implicit bias is defined by
RF (h) = arg minF (w)=h ∥w∥2 , i.e., which function h is representable most cheaply in parameter space. These
two problems have the same global minimum but this does not mean a first-order stationary point of the
first problem are first-order stationary points of the second problem. Relating the two remains largely open.
For squared loss, as the scaling goes to infinity, the entire trajectory converges to the kernel trajectory.
In particular, then, the implicit bias is the implicit bias of the kernel (i.e., the limit points are the same). For
logistic loss, we can get a similar statement about infinite scale leads to the kernel regime, but the problem
is that we can only get it for finite time. That is, if we fix the amount of time for which we optimize and
then take the scale to infinity, then we stay inside the kernel regime. These results are presented formally
in [8]. For logistic loss, if we change the order of limits for scale of initialization and time of optimization,
we reach a first order stationary point of arg min∥w∥2 such that the margin condition is satisfied (formal
statement in [18] and [22]).
Thus, for logistic loss, the regime we are in is no longer just dependent on α, the initialization scale, but
rather on the combination of scale and training loss. For each optimization accuracy, we can ask at what
scale we would enter the kernel regime. The transition occurs at ϵ ∼ exp (−α2 ). On the boundary, (under
several non-realistic assumptions), the behavior follows the Qα function with parameter √ α . Thus, the
log(1/ϵ)
transition depends on the ordering of ϵ, α limits. While it is true that we have an asymptotic result, it only
kicks in when ϵ = 10−1700 , so we do not get ℓ1 regularization in practice.

Figure 20: This figure depicts how the predictor behaves with finite but not large initialization scale (α = 2).
We know that in the long run, we will reach the minimum ℓ1 norm solution. When the optimization accuracy
is finite, we will traverse the whole Qα path. The dashed black line is the path of minimum Qα margin
solution: solution with margin 1 that minimizes Qα for corresponding α. As α increases, it seems we converge
to this path. (Note, we don’t know how to prove this but it seems to hold empirically.) Importantly, while
it is true asymptotically in ϵ that minimizing the regularizer in function space corresponds to minimizing
the norm of the weights, the ϵ at which it starts to hold is completely impractical.

When we consider deeper models, things only get worse: the asymptotic regime is even harder to reach.
It is not well-understood yet what is happening here when α is small. Finally, the width of the network also
makes a difference in effective initialization scale.

Other Control Parameters Other control parameters (aside from initialization scale) include how early
we stop training, shape of the initialization (i.e., relative scale of the parameters), step size, and stochasticity.
The latter has been studied through the lens of batch size and label noise.t

47
6.4 Matrix Factorization Setting and Commutativity
We are still going to look at a linear model in β, but now β is going to be a matrix. This is a standard least
squares objective in β. The main difference is that now, instead of factorizing β element-wise, we are going
to factorize β = U V T . This formulation captures many things: matrix completion, matrix reconstruction
from linear measurements, and even multi-task learning.
Let us study the implicit bias of gradient flow on the factorization in function space, i.e., in matrix space:

U̇ (t) = −∇U L̂(U V T ) (147)


T
V̇ (t) = −∇V L̂(U V ) (148)
⇒ β̇ = −(∇F T ∇F )∇L̂(β) = −(U U T ∇L̂(β) + ∇β̂V V T ) . (149)

Observe that this is a linear transformation of the gradient. We would be interested in writing the
transformation in terms of just β, and this is not possible here, as it turns out. Instead, we define:

UUT UV T UUT

U T β
W = β̃ := W W = =
V V UT V V T βT VVT

This is still a matrix factorization problem in W , except it is now a symmetric matrix factorization
problem. This corresponds to a minimization problem over positive semidefinite matrices. Namely:

min L̂(β̃) = ∥X̃ (β̃) − y∥22 .


β̃⪰0

˙
for appropriately defined X̃ . The resulting dynamics are: β̃ = −(β̃∇L̂(β̃) + ∇L̂(β̃)β̃ T ) .
What we’ve shown here is that whenever we have a non-symmetric matrix factorization problem, it is a
special case of a higher-dimensional symmetric matrix factorization problem. In some sense, the real problem
we should be looking at is this one, since the non-symmetric one hides information about the geometry.
The local geometry of the space in which we search is given by left and right multiplying by β̃ . Now
what we want to see if we can identify this as a Hessian map. Do the dynamics stay in a low-dimensional
manifold?
As before, where we are depends on the integral of the residual so far. The residual explains how much
to multiply by the observation. If matrices commute, we can solve the differential equation explicitly with
the matrix exponential. The metric tensor is a Hessian map, which directly implies we are in low-dimension
manifold. In this case, the result is very robust: it doesn’t depend on the residuals, nor the loss under which
the residuals are computed. It is not robust to step size: could do a ski jump off the manifold now that it
is curved. If it is non-commutative, it matters which order in which I multiply on right and left. β(t) , is
a “time-ordered exponential,” and we cannot ignore the ordering of the residuals. Even with just two data
points, we can enter the whole space. An analogy is parallel parking, where with the forward/backward and
left/right controls we can somehow move net to the right.
This non-commutative case is the case we are in in general. This is a well-defined question but one for
which the solution is not known.

7 Lecture Series Summary


The approach covered over this lecture series involved attempting to characterize the implicit bias of neural
networks from a complexity theory perspective. The main question is to understand which zero-error solution
is reached when optimizing an overparameterized problem, and determining this involved decomposing the
problem into (1) identify what optimization biases toward based on identifying what complexity measure is
being (at least approximately) minimized; followed by (2) evaluating whether this complexity measure is a
good one to optimize for the task at hand, i.e., does a minimizer of it generalize well.
A natural question, then, is whether this is the right approach to take. Indeed, it is not always the best
description for optimization in neural networks. It remains open whether minimization of this complexity
measure, perhaps even approximately, is sufficient to explain most cases.

48
The ultimate questions: what is true Inductive Bias? What makes reality efficiently learnable by fitting
a (huge) neural net with a specific algorithm?

Acknowledgements

These are notes from the lecture of Nathan Srebro given at the summer school “Statistical Physics &
Machine Learning ”, that took place in Les Houches School of Physics in France from 4th to 29th July 2022.
The school was organized by Florent Krzakala and Lenka Zdeborová from EPFL.

49
References
[1] Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science & Business
Media, 2012.
[2] Peter Bartlett. For valid generalization the size of the weights is more important than the size of the
network. Advances in neural information processing systems, 9, 1996.
[3] Heinz H. Bauschke, Jérôme Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient conti-
nuity: First-order methods revisited and applications. Mathematics of Operations Research, 42(2):330–
348, 2017.
[4] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning
practice and the classical bias–variance trade-off. In Proceedings of the National Academy of Sciences,
2019.

[5] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand
kernel learning. In International Conference on Machine Learning, 2018.
[6] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends®
in Machine Learning, 8(3-4):231–357, 2015.

[7] Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks
trained with the logistic loss. In Conference on Learning Theory, 2020.
[8] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming.
Advances in Neural Information Processing Systems, 32, 2019.

[9] Amit Daniely. Sgd learns the conjugate kernel class of the network. Advances in Neural Information
Processing Systems, 30, 2017.
[10] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper learning
complexity. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages
441–448, 2014.

[11] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on
linear convolutional networks. In Advances in Neural Information Processing Systems, 2018.
[12] Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro.
Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems,
2017.

[13] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and general-
ization in neural networks. Advances in neural information processing systems, 31, 2018.
[14] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite
automata. Journal of the ACM (JACM), 41(1):67–95, 1994.

[15] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix
sensing and neural networks with quadratic activations. In Conference On Learning Theory, 2018.
[16] Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for
matrix factorization: Greedy low-rank learning. In International Conference on Learning Representa-
tions, 2021.

[17] Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order
methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
[18] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks.
arXiv preprint arXiv:1906.05890, 2019.

50
[19] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5:115–133, 1943.
[20] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT
press, 2018.

[21] Edward Moroshko, Blake E Woodworth, Suriya Gunasekar, Jason D Lee, Nati Srebro, and Daniel
Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in
neural information processing systems, 33:22182–22193, 2020.
[22] Mor Shpigel Nacson, Suriya Gunasekar, Jason Lee, Nathan Srebro, and Daniel Soudry. Lexicographic
and depth-sensitive margins in homogeneous and non-homogeneous deep models. In Kamalika Chaud-
huri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learning Research, pages 4683–4692. PMLR, 09–15 Jun
2019.
[23] AS Nemirovsky and DB Yudin. Problem complexity and optimization method efficiency. M.: Nauka,
1979.

[24] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
[25] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-sgd: Path-normalized optimiza-
tion in deep neural networks. In Advances in Neural Information Processing Systems, 2015.
[26] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm
infinite width relu nets: The multivariate case. In International Conference on Learning Representations,
2020.
[27] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm
networks look in function space? In Conference On Learning Theory, 2019.
[28] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms.
Cambridge university press, 2014.
[29] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and
uniform convergence. The Journal of Machine Learning Research, 11:2635–2670, 2010.
[30] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.

[31] Vladimir Vapnik and Alexey Chervonenkis. Theory of pattern recognition, 1974.
[32] Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal
value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing
Systems, 2017.

[33] Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan,
Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference
on Learning Theory, pages 3635–3673. PMLR, 2020.
[34] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep
learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.

51

You might also like