Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Anthropic's recent mechanistic interpretability paper, Toy Models of Superposition, helps to demonstrate the conceptual richness of very small feedforward neural networks. Even when being trained on synthetic, hand-coded data to reconstruct a very straightforward function (the identity map), there appears to be non-trivial mathematics at play and the analysis of these small networks seems to providing an interesting playground for mechanistic interpretability.

While trying to understand their work and train my own toy models, I ended up making various notes on the underlying mathematics. This post is a slightly neatened-up version of those notes, but is still quite rough and un-edited and is a far-from-optimal presentation of the material. In particular, these notes may contain errors, which are my responsibility.

1. Directly Analyzing the Critical Points of a Linear Toy Model

Throughout we will be considering feedforward neural networks with one hidden layer. The input and output layers will be of the same size and the hidden layer is smaller. We will only be considering the autoencoding problem, which means that our networks are being trained to reconstruct the data. The first couple of subsections here are largely taken from the Appendix to the paper "Neural networks and principal component analysis: Learning from examples without local minima." by Pierre Baldi and Kurt Hornik. (Neural networks 2.1 (1989): 53-58).

Consider to begin with a completely liner model., i.e. one without any activation functions or biases. Suppose the input and output layers have D neurons and that the middle layer has d<D neurons. This means that the function that the model is implementing is of the form x↦ABx, where x∈RD, B is a d×D matrix, and A is a D×d matrix. That is, the matrix B contains the weights of the connections between the input layer and the hidden layer, and the matrix A is the weights of the connections between the hidden layer and the output layer. It is important to realise that even though - for a given set of weights - the function that is being implemented here is linear, the mathematics of this model and the dynamics of the training are not completely linear.

The error on a given input x will be measured by ∥∥x−ABx∥∥2 and on the data set {xt}Tt=1, the total loss is

Define Σ to be the matrix whose (i,j)th entry σij is given by

σij=T∑t=1xtixtj.

Clearly this matrix is symmetric.

Assumption. We will assume that the data is such that a) Σ is invertible and b) Σ has distinct eigenvalues.

Let λ1>⋯>λD be the eigenvalues of Σ.

1.1 The Global Minimum

Proposition 1. (Characterization of Critical Points)Fix the dataset and consider L to be a function of the two matrix variables A and B. For any critical point (A,B) of L, there is a subset I⊂{1,…,D} of size dfor which

AB is an orthogonal projection onto a d-dimensional subspace spanned by orthonormal eigenvectors of Σ corresponding to the eigenvalues {λi}i∈I; and

L(A,B)=trΣ−∑i∈Iλi=∑i∉Iλi.

Corollary 2. (Characterization of the Minimum)The loss has a unique minimum value that is attained when I={1,…,d}, which corresponds to the situation when AB is an orthogonal projection onto the d-dimensional subspace spanned by the eigendirections of Σ that have the largest eigenvalues.

Remarks. We won't try to spell out all of the various connections to other closely related things, but for those who want some more keywords to go away and investigate further, we just remark that the minimization problem being studied here is about finding a low-rank approximation to identity and is closely related to Principal Component Analysis. See also the Eckart–Young–Mirsky Theorem.

We begin by directly differentiating L with respect to the entries of A and B. Using summation convention on repeated indices, we first take the derivative with respect to bj′k′:

Setting this equation equal to zero for every i′=1,…,D and j′=1,…,d we have that:

ΣBT=ABΣBT.(2)

Thus

∇L(A,B)=0⟺{ATΣ=ATABΣΣBT=ABΣBT.(3)

Since we have assumed that Σ is invertible, the first equation immediately implies that AT=ATAB. If we assume in addition that A has full rank (a reasonable assumption in any case of practical interest), then ATA is invertible and we have that

(ATA)−1AT=B,(4)

which in turn implies that

AB=A(ATA)−1AT=PA,(5)

where we have written PA to denote the orthogonal projection on to the column space of A.

Claim. We next claim that Σ commutes with PA.

Proof of claim. Plugging (5) into (3), we have:

ΣBT=PAΣBT.(6)

Then, right-multiply by AT and use the fact that PTA=PA to get:

ΣPA=PAΣPA.(7)

The right-hand side is manifestly a symmetric matrix, so we deduce that ΣPA is symmetric. If the product of two symmetric matrices is symmetric then they commute, so this indeed shows that Σ commutes with PA and completes the proof of the claim.

Now let U be the orthogonal matrix which diagonalizes Σ, i.e. the matrix for which

Σ=UΛUT,(8)

where Λ is a diagonal matrix with entries λ1>λ2>⋯>λD>0.

Claim. We next claim that PA=UPUTAUT and that PUTA is diagonal.

Proof of Claim. Firstly, using the standard formula for orthogonal projections, we have

PUTA=UTA(ATUUTA)−1ATU=UTA(ATA)−1ATU=UTPAU,

which implies that

PA=UPUTAUT.(9)

To show that PUTA is diagonal, we show that it commutes with the diagonal matrix Λ (any matrix that commutes with a diagonal matrix must itself be diagonal). Starting from PUTAΛ, we first insert the identity matrix in the form UTU, and then use (8) and (9) thus:

PUTAΛ=UTUPUTAUTUΛUTU=UTPAΣU

Then recall that we have already established that PA commutes with Σ. So we can swap them and then performing the same trick in reverse:

UTPAΣU=UTΣPAU=UTUΛUTUPUTAUTU=ΛPUTA.

This shows that PUTA commutes with Λ and completes the proof of the claim.

So, given that PUTA is an orthogonal projection of rank d and is diagonal, there exists a set of indices I={i1,…,id} with 1≤i1<i2<⋯<id≤D such that the (i,j)th entry of PUTA is zero if i≠j and 1 if i=jandi∈I. And since PA=UPUTAUT, we see that

PA=UIUTI,(10)

where UI is formed from U by simply setting to zero the jth column if j∉I. This is manifestly an orthogonal projection onto the span of {ui1,…,uid}, where u1,u2,…,uD is an orthonormal basis of eigenvectors of Σ (and indeed the columns of U). Combining these observations with (5), we have that

AB=PA=UIUTI=PUI.(11)

This proves the first claim of the proposition.

To prove the second part, write AB=[pij] and compute thus:

But we know from (7) and (11) that PUIΣPUI=ΣPUI and so this last line is actually just equal to

trΣ−tr(PUIΣ).

Focussing on the second term and using (11), then (9) and (8), then cancelling UTU=I, and then - to reach the last line - cyclicly permuting the matrices inside the trace operator to produce another UTU cancellation, we have:

The diagonal form of PUTA means that this final expression is equal to ∑i∈Iλi, meaning that

T∑t=1∥∥xt−ABxt∥∥2=trΣ−∑i∈Iλi.

Since trΣ=∑Di=1λi (the trace is always equal to the sum of the eigenvalues), this completes the proof of the proposition. □

Remarks. Equation (10) above tells us that col(A)=span⟨ui1,…,uid⟩, which means that there exists an invertible matrix C with A=UIC. Then, using (4), we compute that

This subsection is something of an aside, but it is included for completeness.

Proposition 3. (Other Critical Points are Saddle Points.) Fix the dataset and consider L to be a function of the two matrix variables A and B. Every other critical point is a saddle point, i.e. if (A,B) is a critical point but not equal to the unique minimum, then exist ~A and ~B which are arbitrarily close to A and B respectively and at which a lower loss is achieved.

Proof. Since (A,B) is not the unique global minimum, we know from Corollary 2 that I≠{1,…,d}. This means that there are distinct indices j and k for which j∈I, k∉I and k<j. In particular, bear in mind that λk>λj.

Now, given any ϵ>0, put

~uj:=uj+ϵuk√1+ϵ2.

And let us form the new matrix ~UI by starting with UI and replacing the column uj with ~uj. Write

~A=~UIC~B=C−1~UTI(13)(14)

We want to calculate the loss of the model at (~A,~B). We ought to bear in mind that it is not a critical point, so we cannot assume the intermediate results in the proof of Proposition 2, but it turns out that the bits that are most useful for this computation rely only on algebra and (13), (14). We start from the equivalent of line (11) which is that ~A~B=~UI~UTI=P~UI, which implies that ~A~B=P~A. And so just as in (12) above, we have

Since λk>λj this shows that in an arbitrarily small neighbourhood of the critical point (A,B) we can find a point (~A,~B) where smaller loss is achieved. We will not bother doing so here, but one can also check that (A,B) is not a local maxima by using the fact that for fixed (full rank) A, the function z↦∥x−Az∥2 is convex. □

2. Sparse Data, Weight Tying, and Gradients

Abstractly analyzing critical points is not at all the same as training real models. In this section we start to think about data and the optimization process.

2.1 Sparse Synthetic Data

Here we describe the kind of training data used in Anthropic's toy experiments

Fix a number S∈[0,1]. This parameter is the sparsity of the data. We will typically be most interested in the case where S is close to 1.

Let {Bti}∞t=1Di=1 be an independent and identically distributed family of Bernoulli random variables with parameter (1−S). And let {Uti}∞t=1Di=1 be an IID family of Uniform([0,1]) random variables. Write Xti=BtiUti and Xt:=(Xt1,…,XtD). Our datasets {xt}Tt=1∈RD will be drawn from the IID family {Xt}∞t=1. Notice that

Independently, for each data point xt∈RD and for each i∈{1,…,D}, we will have P(xti=0)=S.

So, the expected number of non-zero entries for each data point is (1−S)D. To bring this in line with the way people say things like "k-sparse", we can say that the data is, on average, (1−S)D-sparse.

E(xti)=12(1−S).

Remark. Judging from some of the existing literature on the linear model that we analyze in Section 1 (e.g. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks." by Andrew M. Saxe, James L. McClelland and Surya Ganguli), it seems like it's tempting to make an assumption/simplification/approximation that Σ=I. I still don't feel like I understand how justifiable that is - for me this question is a potential 'jumping-off' point for further analysis of the whole problem. Recall that the matrix Σ is equal to ∑Tt=1xt⊗xt. Certainly the probability that an off-diagonal entry of xt⊗xt is equal to zero is 1−(1−S)2 whereas for the diagonal entries it is just S. And note that E(xtixtj)=14(1−S)2 if i≠j and E((xti)2)=13(1−S). But the diagonal entries are still independent and I'm not sure why thinking of them as equal makes sense.

The data (and the loss) are model two main ideas: Firstly, that the coordinate directions of the input space act as a natural set of features for the data. And secondly, when S is close to 1, the sparsity of the data is supposed to capture the fact that features really do often tend to be sparse in real-world data, i.e. we see that for any given object or any given word/idea that appears in a language, it is the case that most images don't contain that object and most sentences don't contain that word or idea.

2.2 Weight Tying and The Gradient Flow

In practice, when we train an autoencoder like this, we do so with weight tying. Roughly speaking, this means that we only consider the case where A=BT. Proposition 1 does indeed allow for a global minimum in which A=BT: This is achieved by essentially taking C=I in the equations (⋆) at the end of Section 1.1, i.e. we have:

A=UIB=UTI

But note that we don't actually want to try to repeat the analysis of Section 1 on a loss of the form ∑t∥xt−WTWxt∥2. This would be a higher-order polynomial function of the entries of W and so it's genuinely a different and potentially more complicated functional. The way that weight-tying is done in practice is more similar to saying that we insist during training that updates are made that preserve the equality A=BT.

Equations (1) and (2) in subsection 1.1 are obtained as a direct result of differentiating the loss with respect to individual entries of the matrices (or individual 'weights' if we interpret this model as a feedforward neural network without activations). Our computations show that:

∇L(A,B)=−2(ATΣ−ATABΣΣBT−ABΣBT).(19)

In an appropriate continuous time limit, if we set the learning rate to 1, the weights during training evolve according to the differential equations:

ddtB=AT(Σ−ABΣ)ddtA=(Σ−ABΣ)BT.(20)(21)

Remarks. Notice that there is a certain deliberate sloppiness here: One doesn't really have a fixed matrix Σ and then run this gradient flow for all time; the matrix Σ is a function of (a batch of) training data. So we need to be careful about any further manipulations or interpretations of these equations.

Those caveats having been noted, if we additionally add in the weight-tying constraint A=BT , we get:

ddtW=W(Σ−WTWΣ).(22)

We can even make the substitution W=¯¯¯¯¯¯WUT to introduce the form:

ddt¯¯¯¯¯¯W=¯¯¯¯¯¯W(I−¯¯¯¯¯¯WT¯¯¯¯¯¯W)Λ.(23)

In components (and without summation convention) the equation reads

Remarks. (cf.the previous two Remarks) If we assume that Σ=I, then W=¯¯¯¯¯¯W and the equation above arises as gradient descent on the energy functional

E=14D∑i=1λ(1−|wi|2)2+12∑i,j:i≠jλ∣∣⟨wi,wj⟩∣∣2.(28)

It's plausible that a reasonable line of argument to justify this is that since no particular directions in the data are special, it means that over time, on average, the effects of different eigenvalues of Σ just somehow 'average out'. But I don't endorse or understand how that argument would actually go. Regardless, if we just assume this for now, as is explained in the Anthropic paper, we can think of the two terms in (28) as being in competition. The first term suggests that model 'wants' to learn the kth feature by arranging |wk|=1. However, as it tries to does so, it incurs a penalty - given by ∑i≠kλ∣∣⟨wi,wk⟩∣∣2 - that can reasonably be interpreted as the extent to which the hidden representation wk∈Rd of that feature interferes with its attempts to represent and reconstruct the other features.

3. The ReLU Output Model

3.1 The Distribution of the Data and the Integral Loss

Perhaps a better way to try to incorporate information about the distribution of the data into the analysis here is to directly let μ be the distribution (i.e. in the proper measure-theoretic since) of Xt on RD and to consider

L=∫RD∥∥x−WTWx∥2dμ(x).

In the Anthropic paper and in my own work, we are ultimately more interested in a model with biases and ReLUs at the output layer.

L=∫RD∥∥x−ReLU(WTWx−b)∥∥2dμ(x)

Performing an analysis anything like that done in Section 1 seems much harder for this model, but perhaps more progress can be made studying the integral above.

The synthetic data we described in the previous section is all contained in the cube Q:=[0,1]D⊂RD . In the sparse regime i.e. with S close to 1, the vast majority of the data is concentrated around the lower-dimensional skeletons of the cube. For l=1,…,D, if we write Ql for the set of points in the cube with only l non-zero entries, i.e.

Ql:={x∈Q:#{j:xj≠0}=l},

then C is the disjoint union

C=n⋃l=0Ql

Without a closer analysis of binomial tail bounds I can't immediately tell how well-justified it is to say, ignore ∪l≥2Ql and focus the analysis just on 1-sparse vectors in the dataset. i.e. You might want to say that μ(⋃l≥2Ql) is sufficiently small such that that region contributes only negligibly to the integral. Then you can start to work with more manageable expressions To my mind this is another concrete potential 'jumping-off' point if one were to do more investigation. In particular, it is in the direction of the observations made in Toy Models of Superposition to suggest a link between this problem and 'Thomson Problem'.

Anthropic's recent mechanistic interpretability paper, Toy Models of Superposition, helps to demonstrate the conceptual richness of very small feedforward neural networks. Even when being trained on synthetic, hand-coded data to reconstruct a very straightforward function (the identity map), there appears to be non-trivial mathematics at play and the analysis of these small networks seems to providing an interesting playground for mechanistic interpretability.

While trying to understand their work and train my own toy models, I ended up making various notes on the underlying mathematics. This post is a slightly neatened-up version of those notes, but is still quite rough and un-edited and is a far-from-optimal presentation of the material. In particular, these notes may contain errors, which are my responsibility.

## 1. Directly Analyzing the Critical Points of a Linear Toy Model

Throughout we will be considering feedforward neural networks with one hidden layer. The input and output layers will be of the same size and the hidden layer is smaller. We will only be considering the autoencoding problem, which means that our networks are being trained to reconstruct the data. The first couple of subsections here are largely taken from the Appendix to the paper "Neural networks and principal component analysis: Learning from examples without local minima." by Pierre Baldi and Kurt Hornik. (Neural networks 2.1 (1989): 53-58).

Consider to begin with a completely liner model., i.e. one without any activation functions or biases. Suppose the input and output layers have D neurons and that the middle layer has d<D neurons. This means that the function that the model is implementing is of the form x↦ABx, where x∈RD, B is a d×D matrix, and A is a D×d matrix. That is, the matrix B contains the weights of the connections between the input layer and the hidden layer, and the matrix A is the weights of the connections between the hidden layer and the output layer. It is important to realise that even though - for a given set of weights - the function that is being implemented here is linear, the mathematics of this model and the dynamics of the training are not completely linear.

The error on a given input x will be measured by ∥∥x−ABx∥∥2 and on the data set {xt}Tt=1, the total loss is

L=L(A,B,{xt}Tt=1):=T∑t=1∥∥xt−ABxt∥∥2=T∑t=1D∑i=1(xti−D∑j,k=1aijbjkxtk)2Define Σ to be the matrix whose (i,j)th entry σij is given by

σij=T∑t=1xtixtj.Clearly this matrix is symmetric.

Assumption.We will assume that the data is such that a) Σ is invertible and b) Σ has distinct eigenvalues.Let λ1>⋯>λD be the eigenvalues of Σ.

## 1.1 The Global Minimum

Proposition 1.(Characterization of Critical Points)Fix the dataset and considerLto be a function of the two matrix variablesAandB. For any critical point(A,B)ofL, there is a subsetI⊂{1,…,D}of sizedfor whichis an orthogonal projection onto ad-dimensional subspace spanned by orthonormal eigenvectors ofΣcorresponding to the eigenvalues{λi}i∈I; and.Corollary 2.(Characterization of the Minimum)The loss has a unique minimum value that is attained whenI={1,…,d}, which corresponds to the situation whenABis an orthogonal projection onto thed-dimensional subspace spanned by the eigendirections ofΣthat have the largest eigenvalues.Remarks.We won't try to spell out all of the various connections to other closely related things, but for those who want some more keywords to go away and investigate further, we just remark that the minimization problem being studied here is about finding a low-rank approximation to identity and is closely related to Principal Component Analysis. See also the Eckart–Young–Mirsky Theorem.We begin by directly differentiating L with respect to the entries of A and B. Using summation convention on repeated indices, we first take the derivative with respect to bj′k′ :

∂L∂bj′k′=T∑t=1D∑i=1−2(xti−aijbjkxtk)ailδlj′δqk′xtq=−2T∑t=1(xtiaij′xtk′−aijbjkxtkaij′xtk′)Setting this equal to zero and interpreting this equation for all j′=1,…,d and k′=1,…,D gives us that

ATΣ=ATABΣ.(1)Then, separately, we differentiate L with respect to ai′j′ :

∂L∂ai′j′=T∑t=1D∑i=1−2(xti−aijbjkxtk)δii′δpj′bpqxtq=−2T∑t=1(xti′bj′qxtq−ai′jbjkxtkbj′qxtq).Setting this equation equal to zero for every i′=1,…,D and j′=1,…,d we have that:

ΣBT=ABΣBT.(2)Thus

∇L(A,B)=0⟺{ATΣ=ATABΣΣBT=ABΣBT.(3)Since we have assumed that Σ is invertible, the first equation immediately implies that AT=ATAB. If we assume in addition that A has full rank (a reasonable assumption in any case of practical interest), then ATA is invertible and we have that

(ATA)−1AT=B,(4)which in turn implies that

AB=A(ATA)−1AT=PA,(5)where we have written PA to denote the orthogonal projection on to the column space of A.

Claim.We next claim that Σ commutes with PA.

ΣBT=PAΣBT.(6)Proof of claim.Plugging (5) into (3), we have:Then, right-multiply by AT and use the fact that PTA=PA to get:

ΣPA=PAΣPA.(7)The right-hand side is manifestly a symmetric matrix, so we deduce that ΣPA is symmetric. If the product of two symmetric matrices is symmetric then they commute, so this indeed shows that Σ commutes with PA and completes the proof of the claim.

Now let U be the orthogonal matrix which diagonalizes Σ, i.e. the matrix for which

Σ=UΛUT,(8)where Λ is a diagonal matrix with entries λ1>λ2>⋯>λD>0.

Claim.We next claim that PA=UPUTAUT and that PUTA is diagonal.

PUTA=UTA(ATUUTA)−1ATU=UTA(ATA)−1ATU=UTPAU,Proof of Claim.Firstly, using the standard formula for orthogonal projections, we havewhich implies that

PA=UPUTAUT.(9)To show that PUTA is diagonal, we show that it commutes with the diagonal matrix Λ (any matrix that commutes with a diagonal matrix must itself be diagonal). Starting from PUTAΛ, we first insert the identity matrix in the form UTU, and then use (8) and (9) thus:

PUTAΛ=UTUPUTAUTUΛUTU=UTPAΣUThen recall that we have already established that PA commutes with Σ. So we can swap them and then performing the same trick in reverse:

UTPAΣU=UTΣPAU=UTUΛUTUPUTAUTU=ΛPUTA.This shows that PUTA commutes with Λ and completes the proof of the claim.

So, given that PUTA is an orthogonal projection of rank d and is diagonal, there exists a set of indices I={i1,…,id} with 1≤i1<i2<⋯<id≤D such that the (i,j)th entry of PUTA is zero if i≠j and 1 if i=j and i∈I. And since PA=UPUTAUT, we see that

PA=UIUTI,(10)where UI is formed from U by simply setting to zero the jth column if j∉I. This is manifestly an orthogonal projection onto the span of {ui1,…,uid}, where u1,u2,…,uD is an orthonormal basis of eigenvectors of Σ (and indeed the columns of U). Combining these observations with (5), we have that

AB=PA=UIUTI=PUI.(11)This proves the first claim of the proposition.

To prove the second part, write AB=[pij] and compute thus:

T∑t=1∥∥xt−ABxt∥∥2=T∑t=1D∑i=1(xti−pijxtj)2=T∑t=1(xtixti−2xtipijxtj+pijxtjpikxtk)=trΣ−2tr(PUIΣ)+tr(PUIΣPTUI).(12)But we know from (7) and (11) that PUIΣPUI=ΣPUI and so this last line is actually just equal to

trΣ−tr(PUIΣ).Focussing on the second term and using (11), then (9) and (8), then cancelling UTU=I, and then - to reach the last line - cyclicly permuting the matrices inside the trace operator to produce another UTU cancellation, we have:

tr(PUIΣ)=tr(PAΣ)=tr(UPUTAUTUΛUT)=tr(UPUTAΛUT)=tr(PUTAΛ).The diagonal form of PUTA means that this final expression is equal to ∑i∈Iλi, meaning that

T∑t=1∥∥xt−ABxt∥∥2=trΣ−∑i∈Iλi.Since trΣ=∑Di=1λi (the trace is always equal to the sum of the eigenvalues), this completes the proof of the proposition. □

B=(ATA)−1AT=(CTUTIUIC)−1(UIC)T=C−1(CT)−1CTUTI=C−1UTI.Remarks.Equation (10) above tells us that col(A)=span⟨ui1,…,uid⟩, which means that there exists an invertible matrix C with A=UIC. Then, using (4), we compute thatSo we have:

{A=UICB=C−1UTI(⋆)## 1.2 Characterizing Other Critical Points

This subsection is something of an aside, but it is included for completeness.

Proposition 3.(Other Critical Points are Saddle Points.)Fix the dataset and considerLto be a function of the two matrix variablesAandB. Every other critical point is a saddle point, i.e. if(A,B)is a critical point but not equal to the unique minimum, then exist~Aand~Bwhich are arbitrarily close toAandBrespectively and at which a lower loss is achieved.Proof.Since (A,B) is not the unique global minimum, we know from Corollary 2 that I≠{1,…,d}. This means that there are distinct indices j and k for which j∈I, k∉I and k<j. In particular, bear in mind that λk>λj.Now, given any ϵ>0, put

~uj:=uj+ϵuk√1+ϵ2.And let us form the new matrix ~UI by starting with UI and replacing the column uj with ~uj. Write

~A=~UIC~B=C−1~UTI(13)(14)We want to calculate the loss of the model at (~A,~B). We ought to bear in mind that it is not a critical point, so we cannot assume the intermediate results in the proof of Proposition 2, but it turns out that the bits that are most useful for this computation rely only on algebra and (13), (14). We start from the equivalent of line (11) which is that ~A~B=~UI~UTI=P~UI, which implies that ~A~B=P~A. And so just as in (12) above, we have

T∑t=1∥∥xt−~A~Bxt∥∥2=trΣ−2tr(P~UIΣ)+tr(P~UIΣPT~UI).(15)Now, looking at the final term on the right-hand side, we have P~UIΣPT~UI=P~AΣP~A and (by cycling permutation) tr(P~AΣP~A)=tr(P~AΣ). And since

PUT~A=UT~A(~ATUUT~A)−1~ATU=UTP~AU,we have:

tr(P~AΣ)=tr(UPUT~AUTΣ)=tr(PUT~AΛ).(16)We also use tr(P~UIΣ)=tr(P~AΣ) and (16) on the second term on the right-hand side of (15) to ultimately arrive at:

T∑t=1∥∥xt−~A~Bxt∥∥2=trΣ−tr(PUT~AΛ).(17)So we are interested in computing the diagonal elements of PUT~A. Fix i∈{1,…,D}. The ith diagonal entry is given by:

eTiPUT~Aei=eTiUTP~AUei=uTi~UI~UTIui.This can be computed directly from the definition of ~UI to give that the ith entry on the diagonal is equal to

⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩0if i∉I∪{k}1if i∈I∖{j}1/(1+ϵ2)if i=jϵ2/(1+ϵ2)if i=k.(18)Therefore

T∑t=1∥∥xt−~A~Bxt∥∥2=trΣ−[∑i∈I∖{j}λi+λj/(1+ϵ2)+ϵ2λk/(1+ϵ2)]=trΣ−∑i∈Iλi−ϵ2(1+ϵ2)(λk−λj)=T∑t=1∥∥xt−ABxt∥∥2−ϵ2(1+ϵ2)(λk−λj).Since λk>λj this shows that in an arbitrarily small neighbourhood of the critical point (A,B) we can find a point (~A,~B) where smaller loss is achieved. We will not bother doing so here, but one can also check that (A,B) is not a local

maximaby using the fact that for fixed (full rank) A, the function z↦∥x−Az∥2 is convex. □## 2. Sparse Data, Weight Tying, and Gradients

Abstractly analyzing critical points is not at all the same as training real models. In this section we start to think about data and the optimization process.

## 2.1 Sparse Synthetic Data

Here we describe the kind of training data used in Anthropic's toy experiments

Fix a number S∈[0,1]. This parameter is the

sparsityof the data. We will typically be most interested in the case where S is close to 1.Let {Bti}∞t=1Di=1 be an independent and identically distributed family of Bernoulli random variables with parameter (1−S). And let {Uti}∞t=1Di=1 be an IID family of Uniform([0,1]) random variables. Write Xti=BtiUti and Xt:=(Xt1,…,XtD). Our datasets {xt}Tt=1∈RD will be drawn from the IID family {Xt}∞t=1. Notice that

non-zeroentries for each data point is (1−S)D. To bring this in line with the way people say things like "k-sparse", we can say that the data is, on average, (1−S)D-sparse.Remark.Judging from some of the existing literature on the linear model that we analyze in Section 1 (e.g. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks." by Andrew M. Saxe, James L. McClelland and Surya Ganguli), it seems like it's tempting to make an assumption/simplification/approximation that Σ=I. I still don't feel like I understand how justifiable that is - for me this question is a potential 'jumping-off' point for further analysis of the whole problem. Recall that the matrix Σ is equal to ∑Tt=1xt⊗xt. Certainly the probability that an off-diagonal entry of xt⊗xt is equal to zero is 1−(1−S)2 whereas for the diagonal entries it is just S. And note that E(xtixtj)=14(1−S)2 if i≠j and E((xti)2)=13(1−S). But the diagonal entries are still independent and I'm not sure why thinking of them as equal makes sense.The data (and the loss) are model two main ideas: Firstly, that the coordinate directions of the input space act as a natural set of features for the data. And secondly, when S is close to 1, the sparsity of the data is supposed to capture the fact that features really do often tend to be sparse in real-world data, i.e. we see that for any given object or any given word/idea that appears in a language, it is the case that

mostimagesdon'tcontain that object andmostsentencesdon'tcontain that word or idea.## 2.2 Weight Tying and The Gradient Flow

In practice, when we train an autoencoder like this, we do so with

A=UIB=UTIweight tying.Roughly speaking, this means that we only consider the case where A=BT. Proposition 1 does indeed allow for a global minimum in which A=BT: This is achieved by essentially taking C=I in the equations (⋆) at the end of Section 1.1, i.e. we have:But note that we don't actually want to try to repeat the analysis of Section 1 on a loss of the form ∑t∥xt−WTWxt∥2. This would be a higher-order polynomial function of the entries of W and so it's genuinely a different and potentially more complicated functional. The way that weight-tying is done in practice is more similar to saying that we insist

during trainingthat updates are made that preserve the equality A=BT.Equations (1) and (2) in subsection 1.1 are obtained as a direct result of differentiating the loss with respect to individual entries of the matrices (or individual 'weights' if we interpret this model as a feedforward neural network without activations). Our computations show that:

∇L(A,B)=−2(ATΣ−ATABΣΣBT−ABΣBT).(19)In an appropriate continuous time limit, if we set the learning rate to 1, the weights during training evolve according to the differential equations:

ddtB=AT(Σ−ABΣ)ddtA=(Σ−ABΣ)BT.(20)(21)Remarks.Notice that there is a certain deliberate sloppiness here: One doesn't really have a fixed matrix Σ and then run this gradient flow for all time; the matrix Σ is a function of (a batch of) training data. So we need to be careful about any further manipulations or interpretations of these equations.Those caveats having been noted, if we additionally add in the weight-tying constraint A=BT , we get:

ddtW=W(Σ−WTWΣ).(22)We can even make the substitution W=¯¯¯¯¯¯WUT to introduce the form:

ddt¯¯¯¯¯¯W=¯¯¯¯¯¯W(I−¯¯¯¯¯¯WT¯¯¯¯¯¯W)Λ.(23)In components (and without summation convention) the equation reads

ddt¯¯¯¯wij=D∑k,l=1¯¯¯¯wik(δkl−d∑m=1¯¯¯¯wmk¯¯¯¯wml)δljλj.(24)Let {¯¯¯¯wi}Di=1 denote the set of

ddt¯¯¯¯wj=D∑k,l=1(δkl−⟨¯¯¯¯wk,¯¯¯¯wl⟩)δljλj¯¯¯¯wk.(25)columnsof ¯¯¯¯¯¯W so that (24) can becomes:Expanding the brackets and executing the sum over l gives:

ddt¯¯¯¯wj=D∑k=1δkjλj¯¯¯¯wk−D∑k=1⟨¯¯¯¯wk,¯¯¯¯wj⟩λj¯¯¯¯wk.(26)Then the sum over k further simplifies the first term to give:

ddt¯¯¯¯wj=λj¯¯¯¯wj−D∑k=1⟨¯¯¯¯wk,¯¯¯¯wj⟩λj¯¯¯¯wk.(27)Finally, just peel off the k=j term from the remaining summation to arrive at the equation

ddt¯¯¯¯wj=(1−|¯¯¯¯wj|2)λj¯¯¯¯wj−D∑k≠j⟨¯¯¯¯wk,¯¯¯¯wj⟩λj¯¯¯¯wk.(28)the previous two Remarks)

E=14D∑i=1λ(1−|wi|2)2+12∑i,j : i≠jλ∣∣⟨wi,wj⟩∣∣2.(28)Remarks.(cf.Ifweassumethat Σ=I, then W=¯¯¯¯¯¯W and the equation above arises as gradient descent on the energy functionalIt's plausible that a reasonable line of argument to justify this is that since no particular directions in the data are special, it means that over time, on average, the effects of different eigenvalues of Σ just somehow 'average out'. But I don't endorse or understand how that argument would actually go. Regardless, if we just assume this for now, as is explained in the

Anthropicpaper, we can think of the two terms in (28) as being in competition. The first term suggests that model 'wants' to learn the kth feature by arranging |wk|=1. However, as it tries to does so, it incurs a penalty - given by ∑i≠kλ∣∣⟨wi,wk⟩∣∣2 - that can reasonably be interpreted as the extent to which the hidden representation wk∈Rd of that feature interferes with its attempts to represent and reconstruct the other features.## 3. The ReLU Output Model

## 3.1 The Distribution of the Data and the Integral Loss

Perhaps a better way to try to incorporate information about the distribution of the data into the analysis here is to directly let μ be the distribution (i.e. in the proper measure-theoretic since) of Xt on RD and to consider

L=∫RD∥∥x−WTWx∥2dμ(x).In the Anthropic paper and in my own work, we are ultimately more interested in a model with biases and ReLUs at the output layer.

L=∫RD∥∥x−ReLU(WTWx−b)∥∥2dμ(x)Performing an analysis anything like that done in Section 1 seems much harder for this model, but perhaps more progress can be made studying the integral above.

The synthetic data we described in the previous section is all contained in the cube Q:=[0,1]D⊂RD . In the sparse regime i.e. with S close to 1, the vast majority of the data is concentrated around the lower-dimensional skeletons of the cube. For l=1,…,D, if we write Ql for the set of points in the cube with only l non-zero entries, i.e.

Ql:={x∈Q:#{j:xj≠0}=l},then C is the disjoint union

C=n⋃l=0QlWithout a closer analysis of binomial tail bounds I can't immediately tell how well-justified it is to say, ignore ∪l≥2Ql and focus the analysis just on 1-sparse vectors in the dataset. i.e. You might want to say that μ(⋃l≥2Ql) is sufficiently small such that that region contributes only negligibly to the integral. Then you can start to work with more manageable expressions To my mind this is another concrete potential 'jumping-off' point if one were to do more investigation. In particular, it is in the direction of the observations made in

Toy Models of Superpositionto suggest a link between this problem and 'Thomson Problem'.