Kaarel

kaarelh AT gmail DOT com

50

how many times did the explanation just "work out" for no apparent reason

From the examples later in your post, it seems like it might be clearer to say something more like "how many things need to hold about the circuit for the explanation to describe the circuit"? More precisely, I'm objecting to your "how many times" because it could plausibly mean "on how many inputs" which I don't think is what you mean, and I'm objecting to your "for no apparent reason" because I don't see what it would mean for an explanation to hold for a reason in this case.

20

@misc{radhakrishnan2023mechanism, title={Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features}, author={Adityanarayanan Radhakrishnan and Daniel Beaglehole and Parthe Pandit and Mikhail Belkin}, year={2023}, url = { https://arxiv.org/pdf/2212.13881.pdf } }

Let denote the activation vector in layer on input , with the input layer being at index , so . Let be the weight matrix after activation layer . Let be the function that maps from the th activation layer to the output. Then their *Deep Neural Feature Ansatz* says that
(I'm somewhat confused here about them not mentioning the loss function at all — are they claiming this is reasonable for any reasonable loss function? Maybe just MSE? MSE seems to be the only loss function mentioned in the paper; I think they leave the loss unspecified in a bunch of places though.)

Letting be a SVD of , we note that this is equivalent to i.e., that the eigenvectors of the matrix on the RHS are the right singular vectors. By the variational characterization of eigenvectors and eigenvalues (Courant-Fischer or whatever), this is the same as saying that right singular vectors of are the highest orthonormal directions for the matrix on the RHS. Plugging in the definition of , this is equivalent to saying that the right singular vectors are the sequence of highest-variance directions of the data set of gradients .

(I have assumed here that the linearity is precise, whereas really it is approximate. It's probably true though that with some assumptions, the approximate initial statement implies an approximate conclusion too? Getting approx the same vecs out probably requires some assumption about gaps in singular values being big enough, because the vecs are unstable around equality. But if we're happy getting a sequence of orthogonal vectors that gets variances which are nearly optimal, we should also be fine without this kind of assumption. (This is guessing atm.))

Assuming there isn't an off-by-one error in the paper, we can pull some term out of the RHS maybe? This is because applying the chain rule to the Jacobians of the transitions gives , so

Wait, so the claim is just which, assuming is invertible, should be the same as . But also, they claim that it is ? Are they secretly approximating everything with identity matrices?? This doesn't seem to be the case from their Figure 2 though.

Oh oops I guess I forgot about activation functions here! There should be extra diagonal terms for jacobians of preactivations->activations in , i.e., it should really say We now instead get This should be the same as which, with denoting preactivations in layer and denoting the function from these preactivations to the output, is the same as This last thing also totally works with activation functions other than ReLU — one can get this directly from the Jacobian calculation. I made the ReLU assumption earlier because I thought for a bit that one can get something further in that case; I no longer think this, but I won't go back and clean up the presentation atm.

Anyway, a takeaway is that the Deep Neural Feature Ansatz is equivalent to the (imo cleaner) ansatz that the set of gradients of the output wrt the pre-activations of any layer is close to being a tight frame (in other words, the gradients are in isotropic position; in other words still, the data matrix of the gradients is a constant times a semi-orthogonal matrix). (Note that the closeness one immediately gets isn't in to a tight frame, it's just in the quantity defining the tightness of a frame, but I'd guess that if it matters, one can also conclude some kind of closeness in from this (related).) This seems like a nicer fundamental condition because (1) we've intuitively canceled terms and (2) it now looks like a generic-ish condition, looks less mysterious, though idk how to argue for this beyond some handwaving about genericness, about other stuff being independent, sth like that.

proof of the tight frame claim from the previous condition: Note that clearly implies that the mass in any direction is the same, but also the mass being the same in any direction implies the above (because then, letting the SVD of the matrix with these gradients in its columns be , the above is , where we used the fact that ).

- Can one come up with some similar ansatz identity for the left singular vectors of ? One point of tension/interest here is that an ansatz identity for would constrain the left singular vectors of together with its singular values, but the singular values are constrained already by the deep neural feature ansatz. So if there were another identity for in terms of some gradients, we'd get a derived identity from equality between the singular values defined in terms of those gradients and the singular values defined in terms of the Deep Neural Feature Ansatz. Or actually, there probably won't be an interesting identity here since given the cancellation above, it now feels like nothing about is really pinned down by 'gradients independent of ' by the DNFA? Of course, some -dependence remains even in the gradients because the preactivations at which further gradients get evaluated are somewhat -dependent, so I guess it's not ruled out that the DNFA constrains something interesting about ? But anyway, all this seems to undermine the interestingness of the DNFA, as well as the chance of there being an interesting similar ansatz for the left singular vectors of .
- Can one heuristically motivate that the preactivation gradients above should indeed be close to being in isotropic position? Can one use this reduction to provide simpler proofs of some of the propositions in the paper which say that the DNFA is exactly true in certain very toy cases?
- The authors claim that the DNFA is supposed to somehow elucidate feature learning (indeed, they claim it is a mechanism of feature learning?). I take 'feature learning' to mean something like which neuronal functions (from the input) are created or which functions are computed in a layer in some broader sense (maybe which things are made linearly readable?) or which directions in an activation space to amplify or maybe less precisely just the process of some internal functions (from the input to internal activations) being learned of something like that, which happens in finite networks apparently in contrast to infinitely wide networks or NTK models or something like that which I haven't yet understood? I understand that their heuristic identity on the surface connects something about a weight matrix to something about gradients, but assuming I've not made some index-off-by-one error or something, it seems to probably not really be about that at all, since the weight matrix sorta cancels out — if it's true for one , it would maybe also be true with any other replacing it, so it doesn't really pin down ? (This might turn out to be false if the isotropy of preactivation gradients is only true for a very particular choice of .) But like, ignoring that counter, I guess their point is that the directions which get stretched most by the weight matrix in a layer are the directions along which it would be the best to move locally in that activation space to affect the output? (They don't explain it this way though — maybe I'm ignorant of some other meaning having been attributed to in previous literature or something.) But they say "Informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions.". I guess maybe this is an appropriate description of the math if they are talking about reweighting in the purely linear sense, and they take features in the input layer to be scaleless objects or something? (Like, if we take features in the input activation space to each have some associated scale, then the right singular vector identity no longer says that most influential features get stretched the most.) I wish they were much more precise here, or if there isn't a precise interesting philosophical thing to be deduced from their math, much more honest about that, much less PR-y.
- So, in brief, instead of "informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions," it seems to me that what the math warrants would be sth more like "The weight matrix reweights stuff; after reweighting, the activation space is roughly isotropic wrt affecting the prediction (ansatz); so, the stuff that got the highest weight has most effect on the prediction now." I'm not that happy with this last statement either, but atm it seems much more appropriate than their claim.
- I guess if I'm not confused about something major here (plausibly I am), one could probably add 1000 experiments (e.g. checking that the isotropic version of the ansatz indeed equally holds in a bunch of models) and write a paper responding to them. If you're reading this and this seems interesting to you, feel free to do that — I'm also probably happy to talk to you about the paper.

indexing error in the first displaymath in Sec 2: it probably should say '', not ''

I think the world would probably be much better if everyone made a bunch more of their notes public. I intend to occasionally copy some personal notes on ML(?) papers into this thread. While I hope that the notes which I'll end up selecting for being posted here will be of interest to some people, and that people will sometimes comment with their thoughts on the same paper and on my thoughts (please do tell me how I'm wrong, etc.), I expect that the notes here will not be significantly more polished than typical notes I write for myself and my reasoning will be suboptimal; also, I expect most of these notes won't really make sense unless you're also familiar with the paper — the notes will typically be companions to the paper, not substitutes.

I expect I'll sometimes be meaner than some norm somewhere in these notes (in fact, I expect I'll sometimes be simultaneously mean and wrong/confused — exciting!), but I should just say to clarify that I think almost all ML papers/posts/notes are trash, so me being mean to a particular paper might not be evidence that I think it's worse than some average. If anything, the papers I post notes about had something worth thinking/writing about at all, which seems like a good thing! In particular, they probably contained at least one interesting idea!

So, anyway: I'm warning you that the notes in this thread will be messy and not self-contained, and telling you that reading them might not be a good use of your time :)

80

I'd be very interested in a concrete construction of a (mathematical) universe in which, in some reasonable sense that remains to be made precise, two 'orthogonal pattern-universes' (preferably each containing 'agents' or 'sophisticated computational systems') live on 'the same fundamental substrate'. One of the many reasons I'm struggling to make this precise is that I want there to be some condition which meaningfully rules out trivial constructions in which the low-level specification of such a universe can be decomposed into a pair such that and are 'independent', everything in the first pattern-universe is a function only of , and everything in the second pattern-universe is a function only of . (Of course, I'd also be happy with an explanation why this is a bad question :).)

1815

I find [the use of square brackets to show the merge structure of [a linguistic entity that might otherwise be confusing to parse]] delightful :)

50

I'd be quite interested in elaboration on getting faster alignment researchers not being alignment-hard — it currently seems likely to me that a research community of unupgraded alignment researchers with a hundred years is capable of solving alignment (conditional on alignment being solvable). (And having faster general researchers, a goal that seems roughly equivalent, is surely alignment-hard (again, conditional on alignment being solvable), because we can then get the researchers to quickly do whatever it is that we could do — e.g., upgrading?)

20

I was just claiming that your description of pivotal acts / of people that support pivotal acts was incorrect in a way that people that think pivotal acts are worth considering would consider very significant and in a way that significantly reduces the power of your argument as applying to what people mean by pivotal acts — I don't see anything in your comment as a response to that claim. I would like it to be a separate discussion whether pivotal acts are a good idea with this in mind.

Now, in this separate discussion: I agree that executing a pivotal act with just a narrow, safe, superintelligence is a difficult problem. That said, all paths to a state of safety from AGI that I can think of seem to contain difficult steps, so I think a more fine-grained analysis of the difficulty of various steps would be needed. I broadly agree with your description of the political character of pivotal acts, but I disagree with what you claim about associated race dynamics — it seems plausible to me that if pivotal acts became the main paradigm, then we'd have a world in which a majority of relevant people are willing to cooperate / do not want to race that much against others in the majority, and it'd mostly be a race between this group and e/acc types. I would also add, though, that the kinds of governance solutions/mechanisms I can think of that are sufficient to (for instance) make it impossible to perform distributed training runs on consumer devices also seem quite authoritarian.

30

In this comment, I will be assuming that you intended to talk of "pivotal acts" in the standard (distribution of) sense(s) people use the term — if your comment is better described as using a different definition of "pivotal act", including when "pivotal act" is used by the people in the dialogue you present, then my present comment applies less.

I think that this is a significant mischaracterization of what most (? or definitely at least a substantial fraction of) pivotal activists mean by "pivotal act" (in particular, I think this is a significant mischaracterization of what Yudkowsky has in mind). (I think the original post also uses the term "pivotal act" in a somewhat non-standard way in a similar direction, but to a much lesser degree.) Specifically, I think it is false that the primary kinds of plans this fraction of people have in mind when talking about pivotal acts involve creating a superintelligent nigh-omnipotent infallible FOOMed properly aligned ASI. Instead, the kind of person I have in mind is very interested in coming up with pivotal acts that do not use a general superintelligence, often looking for pivotal acts that use a narrow superintelligence (for instance, a narrow nanoengineer) (though this is also often considered very difficult by such people (which is one of the reasons they're often so doomy)). See, for instance, the discussion of pivotal acts in https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty.

Ω240

A few notes/questions about things that seem like errors in the paper (or maybe I'm confused — anyway, none of this invalidates any conclusions of the paper, but if I'm right or at least justifiably confused, then these do probably significantly hinder reading the paper; I'm partly posting this comment to possibly prevent some readers in the future from wasting a lot of time on the same issues):

1) The formula for here seems incorrect:

This is because W_i is a feature corresponding to the i'th coordinate of x (this is not evident from the screenshot, but it is evident from the rest of the paper), so surely what shows up in this formula should not be W_i, but instead the i'th row of the matrix which has columns W_i (this matrix is called W later). (If one believes that W_i is a feature, then one can see this is wrong already from the dimensions in the dot product not matching.)

2) Even though you say in the text at the beginning of Section 3 that the input features are independent, the first sentence below made me make a pragmatic inference that you are not assuming that the coordinates are independent for this particular claim about how the loss simplifies (in part because if you were assuming independence, you could replace the covariance claim with a weaker variance claim, since the 0 covariance part is implied by independence):

However, I think you do use the fact that the input features are independent in the proof of the claim (at least you say "because the x's are independent"):

Additionally, if you are in fact just using independence in the argument here and I'm not missing something, then I think that instead of saying you are using the moment-cumulants formula here, it would be much much better to say that independence implies that any term with an unmatched index is . If you mean the moment-cumulants formula here https://en.wikipedia.org/wiki/Cumulant#Joint_cumulants , then (while I understand how to derive every equation of your argument in case the inputs are independent), I'm currently confused about how that's helpful at all, because one then still needs to analyze which terms of each cumulant are 0 (and how the various terms cancel for various choices of the matching pattern of indices), and this seems strictly more complicated than problem before translating to cumulants, unless I'm missing something obvious.

3) I'm pretty sure this should say x_i^2 instead of x_i x_j, and as far as I can tell the LHS has nothing to do with the RHS:

(I think it should instead say sth like that the loss term is proportional to the squared difference between the true and predictor covariance.)

I think most of the quantitative claims in the current version of the above comment are false/nonsense/[using terms non-standardly]. (Caveat: I only skimmed the original post.)

If by 'cosine similarity' you mean what's usually meant, which I take to be the cosine of the angle between two vectors, then the cosine only depends on the directions of vectors, not their magnitudes. (Some parts of your comment look like you meant to say 'dot product'/'projection' when you said 'cosine similarity', but I don't think making this substitution everywhere makes things make sense overall either.)

For 0.3 in particular, the number of orthogonal vectors with at least that cosine with a given vector d is actually small. Assuming I calculated correctly, the number of e.g. pairwise-dot-prod-less-than-0.01 unit vectors with that cosine with a given vector is at most 23 (the ambient dimension does not show up in this upper bound). I provide the calculation later in my comment.

This doesn't make sense. For alpha1 to be cos(theta1, d), you can't freely choose the magnitude of noise1

## How many nearly-orthogonal vectors can you fit in a spherical cap?

Proposition. Let d∈Rn be a unit vector and let θ1,…,θm∈Rn also be unit vectors such that they all sorta point in the d direction, i.e., θi⋅d≥δ for a constant δ>0 (I take you to have taken δ=0.3), and such that the θi are nearly orthogonal, i.e., |θi⋅θj|≤ϵ for all i≠j, for another constant ϵ>0. Assume also that ϵ<δ2. Then m≤21−δ2δ2−ϵ+1.

Proof. We can decompose θi=αid+√1−α2iui, with ui a unit vector orthogonal to d; then αi≥δ. Given ϵ<δ, it's a 3d geometry exercise to show that pushing all vectors to the boundary of the spherical cap around d can only decrease each pairwise dot product; doing this gives a new collection of unit vectors vi=δd+√1−δ2ui, still with ϵ≥vi⋅vj=δ2+(1−δ2)ui⋅uj. This implies that ui⋅uj≤−δ2−ϵ1−δ2. Note that since ϵ<δ2, the RHS is some negative constant. Consider (∑iui)2. On the one hand, it has to be positive. On the other hand, expanding it, we get that it's at most m−(m2)δ2−ϵ1−δ2. From this, 0≤1−m−12δ2−ϵ1−δ2, whence m≤21−δ2δ2−ϵ+1.

(acknowledgements: I learned this from some combination of Dmitry Vaintrob and https://mathoverflow.net/questions/24864/almost-orthogonal-vectors/24887#24887 )

For example, for δ=0.3 and ϵ=0.01, this gives m≤23.

(I believe this upper bound for the number of almost-orthogonal vectors is actually basically exactly met in sufficiently high dimensions — I can probably provide a proof (sketch) if anyone expresses interest.)

Remark. If ϵ>δ2, then one starts to get exponentially many vectors in the dimension again, as one can see by picking a bunch of random vectors on the boundary of the spherical cap.

## What about the philosophical point? (low-quality section)

Ok, the math seems to have issues, but does the philosophical point stand up to scrutiny? Idk, maybe — I haven't really read the post to check relevant numbers or to extract all the pertinent bits to answer this well. It's possible it goes through with a significantly smaller δ or if the vectors weren't really that orthogonal or something. (To give a better answer, the first thing I'd try to understand is whether this behavior is basically first-order — more precisely, is there some reasonable loss function on perturbations on the relevant activation space which captures perturbations being coding perturbations, and are all of these vectors first-order perturbations toward coding in this sense? If the answer is yes, then there just has to be such a vector d — it'd just be the gradient of this loss.)