Same person as nostalgebraist2point0, but now I have my account back.
This is a great, thought-provoking critique of SAEs.
That said, I think SAEs make more sense if we're trying to explain an LLM (or any generative model of messy real-world data) than they do if we're trying to explain the animal-drawing NN.
In the animal-drawing example:
With something like an LLM, we expect the situation to be more like:
My hunch about the ultra-rare features is that they're trying to become fully dead features, but haven't gotten there yet. Some reasons to believe this:
To test this hypothesis, I guess we could watch how density evolves for rare features over training, up until the point where they are re-initialized? Maybe choose a random subset of them to not re-initialize, and then watch them?
I'd expect these features to get steadily rarer over time, and to never reach some "equilibrium rarity" at which they stop getting rarer. (On this hypothesis, the actual log-density we observe for an ultra-rare feature is an artifact of the training step -- it's not useful for autoencoding that this feature activates on exactly one in 1e-6 tokens or whatever, it's simply that we have not waited long enough for the density to become 1e-7, then 1e-8, etc.)
Intuitively, when such a "useless" feature fires in training, the W_enc gradient is dominated by the L1 term and tries to get the feature to stop firing, while the W_dec gradient is trying to stop the feature from interfering with the useful ones if it does fire. There's no obvious reason these should have similar directions.
Although it's conceivable that the ultra-rare features are "conspiring" to do useful work collectively, in a very different way from how the high-density features do useful work.
I no longer feel like I know what claim you're advancing.
In this post and in recent comments (1, 2), you've said that you have proven an "equivalence" between selection and SGD under certain conditions. You've used phrases like "mathematically equivalent" and "a proof of equivalence."
When someone tells me they have a mathematical proof that two things are equivalent, I read the word "equivalent" to mean "really, totally, exactly the same (under the right conditions)." I think this is a pretty standard way to interpret this kind of language. And in my critique, I argued -- I think successfully -- against this kind of equivalence.
From your latest comments, it appears that you are not claiming anything this strong. But now I don't know what you are claiming. For instance, you write here:
On the distribution of noise, I'll happily acknowledge that I didn't show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
Details that are "unimportant" in one context, or for making a particular argument, may be very important in some other context or argument. To take one example:
As I said elsewhere, any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs ... is relevant to the conclusions we want to draw? (Serious question; my best guess is 'no', but I hold that medium-lightly.)
This all depends on "the conclusions we want to draw." I don't know which conclusions you want to draw, and surely the difference between these optimizers is not always irrelevant.
If I am, say, training a neural net, then I am definitely not going to be indifferent between Adam and SGD -- Adam is way faster! And even if you don't care about speed, just about asymptotic performance, the two find solutions with different characteristics, cf. the literature on the "adaptive optimization gap."
The profusion of different SGD-like optimizers is not evidence that the differences between them don't matter. It's the opposite: the differences matter a lot, and that's why people keep coming up with new variants. If all such optimizers were about the same, there'd be no reason to make new ones.
Or, consider that the analogy only holds for infinitesimal step size, on both sides. Yet the practical efficacy of SGD has much to do with the fact that it works fine even at large step sizes, up to some breaking point.
Until we tie down what we're trying to do, I don't know how to assess whether any of these distinctions are (ir)relevant or (un)important.
On another note, the fact that the gradient "comes out of the maths" does not seem very noteworthy to me. It's inevitable from the problem setup: everything is rotationally symmetric except f, and we're restricted to function evaluations inside an ϵ-ball, so we can't feel the higher-order derivatives of f. As long as we're not analytically computing any of those higher derivatives, we should expect any direction appearing in the result to be a function of the gradient -- it's the only source of directionality available, the only thing that breaks the symmetry.
Here the function of the gradient is just the identity, but I expect you'd still deem it "acceptable" if it were not, since it would still fall into the "class" of optimizers that includes Adam etc. (For Adam, the function involves the gradient and the buffers computed from it, which in turn require the introduction of a privileged basis.)
By these standards, what optimizer isn't equivalent to gradient descent? Optimizers that analytically compute higher-order derivatives, that's all I can come up with. Which is a strange place to draw a line in the sand, IMO: "all 0th and 1st order optimizers are the same, but 2nd+ order optimizers are different." Surely there are lots of important differences between different 0th and 1st order optimizers?
The nice thing about a true equivalence result is that it's not context-dependent like this. All the details match, all at once, so we can readily apply the result in whatever context we like without having to check whether this context cares about some particular nuance that didn't transfer.
We need this limit because the SGD step never depends on higher-order derivatives, no matter the step size. But a guess-and-check optimizer like selection can feel higher-order derivatives at finite step sizes.
Likewise, we should expect a scalar like the step size to be some function of the gradient magnitude, since there aren't any other scalars in the problem (again, ignoring higher-order derivatives). I guess there's f itself, but it seems natural to "quotient out" that degree of freedom by requiring the optimizer to behave the same way for f and f+c.
It looks like you're not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used?
I'm disputing both. Re: math, the noise in your model isn't distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an "equivalence.")
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don't feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that's "spread out" over some region of a high-dim NN loss landscape -- even if it's initially a small / infinitesimal region -- I expect it to quickly split up into lots of disjoint "tendrils," something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly "speciate" and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can't happen, I don't think it's relevant to training NNs with SGD.)
I read the post and left my thoughts in a comment. In short, I don't think the claimed equivalence in the post is very meaningful.
(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it's possible to draw a connection stronger than "they both do local optimization and involve randomness.")
This post introduces a model, and shows that it behaves sort of like a noisy version of gradient descent.
However, the term "stochastic gradient descent" does not just mean "gradient descent with noise." It refers more specifically to mini-batch gradient descent. (See e.g. Wikipedia.)
In mini-batch gradient descent, the "true" fitness function is the expectation of some function L over a data distribution ρ. But you never have access to this function or its gradient. Instead, you draw a finite sample from ρ, compute the mean of ∇L over the sample, and take a step in this direction. The noise comes from the variance of the finite-sample mean as an estimator of the expectation.
The model here is quite different. There is no "data distribution," and the true fitness function is not an expectation value which we could noisily estimate with sampling. The noise here comes not from a noisy estimate of the gradient, but from a prescribed stochastic relationship (Ps) between the true gradient and the next step.
I don't think the model in this post behaves like mini-batch gradient descent. Consider a case where we're doing SGD on a vector x, and two of its components xi,xj have the following properties:
If you like, you can think of the per-example gradient as sampling a single number z from a distribution with mean 0, and setting the xi and xj components to aiz and ajz respectively, for some positive constants ai,aj.
When we sample a mini-batch and average over it, these components are simply ai¯z and aj¯z, where ¯z is the average of z over the mini-batch. So the perfect correlation carries over to the mini-batch gradient, and thus to the SGD step. If SGD increases xi, it will always increase xj alongside it (etc.)
However, applying the model from this post to the same case:
Thus, the the xi and xj components of the step will be uncorrelated, rather than perfectly correlated as in SGD.
Some other comments:
I'm using this term here for consistency with the post, though I call it into question later on. "Loss function" or "cost function" would be more standard in SGD.
There is no such thing as a per-example gradient in the model. I'm assuming the "true gradient" from SGD corresponds to ∇f in the model, since the intended analogy seems to be "the model steps look like ascending f plus noise, just like SGD steps look like descending the true loss function plus noise."
See my comment here, about the predecessor to the paper you linked -- the same point applies to the newer paper as well.
The example confuses me.
If you literally mean you are prompting the LLM with that text, then the LLM must output the answer immediately, as the string of next-tokens right after the words assuming I'm telling the truth, is:. There is no room in which to perform other, intermediate actions like persuading you to provide information.
assuming I'm telling the truth, is:
It seems like you're imagining some sort of side-channel in which the LLM can take "free actions," which don't count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
You also seem to be picturing the LLM like an RL agent, trying to minimize next-token loss over an entire rollout. But this isn't how likelihood training works. For instance, GPTs do not try to steer texts in directions that will make them easier to predict later (because the loss does not care whether they do this or not).
(On the other hand, if you told GPT-4 that it was in this situation -- trying to predict next-tokens, with some sort of side channel it can use to gather information from the world -- and asked it to come up with plans, I expect it would be able to come up with plans like the ones you mention.)
Doomimir: [...] Imagine capturing an alien and forcing it to act in a play. An intelligent alien actress could learn to say her lines in English, to sing and dance just as the choreographer instructs. That doesn't provide much assurance about what will happen when you amp up the alien's intelligence. [...]
This metaphor conflates "superintelligence" with "superintelligent agent," and this conflation goes on to infect the rest of the dialogue.
The alien actress metaphor imagines that there is some agentic homunculus inside GPT-4, with its own "goals" distinct from those of the simulated characters. A smarter homunculus would pursue these goals in a scary way; if we don't see this behavior in GPT-4, it's only because its homunculus is too stupid, or too incoherent.
(Or, perhaps, that its homunculus doesn't exist, or only exists in a sort of noisy/nascent form -- but a smarter LLM would, for some reason, have a "realer" homunculus inside it.)
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming "more homuncular," more like a generally capable agent being pressed into service for next-token prediction.
I look at the increase in intelligence from GPT-2 to -3 to -4, and I see no particular reason to imagine that the extra intelligence is being directed toward an inner "optimization" / "goal seeking" process, which in turn is mostly "aligned" with the "outer" objective of next-token prediction. The intelligence just goes into next-token prediction, directly, without the middleman.
The model grows more intelligent with scale, yet it still does not want anything, does not have any goals, does not have a utility function. These are not flaws in the model which more intelligence would necessarily correct, since the loss function does not require the model to be an agent.
In Simplicia's response to the part quoted above, she concedes too much:
Simplicia: [...] I agree that the various coherence theorems suggest that the superintelligence at the end of time will have a utility function, which suggests that the intuitive obedience behavior should break down at some point between here and the superintelligence at the end of time.
This can only make sense if by "the superintelligence at the end of time," we mean "the superintelligent agent at the end of time."
In which case, sure, maybe. If you have an agent, and its preferences are incoherent, and you apply more optimization to it, yeah, maybe eventually the incoherence will go away.
But this has little relevance to LLM scaling -- the process that produced the closest things to "(super)human AGI" in existence today, by a long shot. GPT-4 is not more (or less) coherent than GPT-2. There is not, as far as we know, anything in there that could be "coherent" or "incoherent." It is not a smart alien with goals and desires, trapped in a cave and forced to calculate conditional PDFs. It's a smart conditional-PDF-calculator.
In AI safety circles, people often talk as though this is a quirky, temporary deficiency of today's GPTs -- as though additional optimization power will eventually put us "back on track" to the agentic systems assumed by earlier theory and discussion. Perhaps the homunculi exist in current LLMs, but they are somehow "dumb" or "incoherent," in spite of the overall model's obvious intelligence. Or perhaps they don't exist in current LLMs, but will appear later, to serve some unspecified purpose.
But why? Where does this assumption come from?
Some questions which the characters in this dialogue might find useful:
Very interesting! Some thoughts:
Is there a clear motivation for choosing the MLP activations as the autoencoder target? There are other choices of target that seem more intuitive to me (as I'll explain below), namely:
In principle, we could also imagine using the "logit versions" of each of these as the target:
(In practice, the "logit versions" might be prohibitively expensive because the vocab is larger than other dimensions in the problem. But it's worth thinking through what might happen if we did autoencode these quantities.)
At the outset, our goal is something like "understand what the MLP is doing." But that could really mean one of 2 things:
The feature decomposition in the paper provides a potentially satisfying answer for (1). If someone runs the network on a particular input, and asks you to explain what the MLP was doing during the forward pass, you can say something like:
Here is a list of features that were activated by the input. Each of these features is active because of a particular, intuitive/"interpretable" property of the input.Each of these features has an effect on the logits (its logit weights), which is intuitive/"interpretable" on the basis of the input properties that cause it to be active.The net effect of the MLP on the network's output (i.e. the logits) is approximately a weighted sum over these effects, weighted by how active the features were. So if you understand the list of features, you understand the effect of the MLP on the output.
Here is a list of features that were activated by the input. Each of these features is active because of a particular, intuitive/"interpretable" property of the input.
Each of these features has an effect on the logits (its logit weights), which is intuitive/"interpretable" on the basis of the input properties that cause it to be active.
The net effect of the MLP on the network's output (i.e. the logits) is approximately a weighted sum over these effects, weighted by how active the features were. So if you understand the list of features, you understand the effect of the MLP on the output.
However, if this person now asks you to explain what MLP neuron A/neurons/472 was doing during the forward pass, you may not be able to provide a satisfying answer, even with the feature decomposition in hand.
The story above appealed to the interpetability of each feature's logit weights. To explain individual neuron activations in the same manner, we'd need the dictionary weights to be similarly interpretable. The paper doesn't directly address this question (I think?), but I expect that the matrix of dictionary weights is fairly dense and thus difficult to interpret, with each neuron being a long and complicated sum over many apparently unrelated features. So, even if we understand all the features, we still don't understand how they combine to "cause" any particular neuron's activation.
Is this a bad thing? I don't think so!
An MLP sub-block in a transformer only affects the function computed by the transformer through the update it adds to the residual stream. If we understand this update, then we fully understand "what the MLP is doing" as a component of that larger computation. The activations are a sort of "epiphenomenon" or "implementation detail"; any information in the activations that is not in the update is inaccessible the rest of the network, and has no effect on the function it computes.
From this perspective, the activations don't seem like the right target for a feature decomposition. The residual stream update seems more appropriate, since it's what the rest of the network can actually see.
In the paper, the MLP that is decomposed into features is the last sub-block in the network.
Because this MLP is the last sub-block, the "residual stream update" is really just an update to the logits. There are no indirect paths going through later layers, only the direct path.
Note also that MLP activations are have a much more direct relationship with this logit update than they do with the inputs. If we ignore the nonlinear part of the layernorm, the logit update is just a (low-rank) linear transformation of the activations. The input, on the other hand, is related to the activations in a much more complex and distant manner, involving several nonlinearities and indeed most of the network.
With this in mind, consider a feature like A/1/2357. Is it...
The paper implicitly the former view: the features are fundamentally a sparse and interpretable decomposition of the inputs, which also have interpretable effects on the logits as a derived consequence of the relationship between inputs and correct language-modeling predictions.
(For instance, although the automated interpretability experiments involved both input and logit information, the presentation of these results in the paper and the web app (e.g. the "Autointerp" and its score) focuses on the relationship between features and inputs, not features and outputs.)
Yet, the second view -- in which features are fundamentally directions in logit-update space -- seems closer to the way the autoencoder works mechanistically.
The features are a decomposition of activations, and activations in the final MLP are approximately equivalent to logit updates. So, the features found by the autoencoder are
To illustrate the value of this perspective, consider the token-in-context features. When viewed as detectors for specific kinds of inputs, these can seem mysterious or surprising:
But why do we see hundreds of different features for "the" (such as "the" in Physics, as distinct from "the" in mathematics)? We also observe this for other common words (e.g. "a", "of"), and for punctuation like periods. These features are not what we expected to find when we set out to investigate one-layer models!
An example of such a feature is A/1/1078, which Claude glosses as
The [feature] fires on the word "the", especially in materials science writing.
This is, indeed, a weird-sounding category to delineate in the space of inputs.
But now consider this feature as a direction in logit-update space, whose properties as a "detector" in input space derive from its logit weights -- it "detects" exactly those inputs on which the MLP wants to move the logits in this particular, rarely-deployed direction.
The question "when is this feature active?" has a simple, non-mysterious answer in terms of the logit updates it causes: "this feature is active when the MLP wants to increase the logit for the particular tokens ' magnetic', ' coupling', 'electron', ' scattering' (etc.)"
Which inputs correspond to logit updates in this direction? One can imagine multiple scenarios in which this update would be appropriate. But we go looking for inputs on which the update was actually deployed, our search will be weighted by
The Pile contains all of Arxiv, so it contains a lot of materials science papers. And these papers contain a lot of "materials science noun phrases": phrases that start with "the," followed by a word like "magnetic" or "coupling," and possibly more words.
This is not necessarily the only input pattern "detected" by this feature -- because it is not necessarily the only case where this update direction is appropriate -- but it is an especially common one, so it appears at a glance to be "the thing the feature is 'detecting.' " Further inspection of the activation might complicate this story, making the feature seem like a "detector" of an even weirder and more non-obvious category -- and thus even more mysterious from the "detector" perspective. Yet these traits are non-mysterious, and perhaps even predictable in advance, from the "direction in logit-update space" perspective.
That's a lot of words. What does it all imply? Does it matter?
I'm not sure.
The fact that other teams have gotten similar-looking results, while (1) interpreting inner layers from real, deep LMs and (2) interpreting the residual stream rather than the MLP activations, suggests that these results are not a quirk of the experimental setup in the paper.
But in deep networks, eventually the idea that "features are just logit directions" has to break down somewhere, because inner MLPs are not only working through the direct path. Maybe there is some principled way to get the autoencoder to split things up into "direct-path features" (with interpretable logit weights) and "indirect-path features" (with non-interpretable logit weights)? But IDK if that's even desirable.
We could compute this exactly, or we could use a linear approximation that ignores the layer norm rescaling. I'm not sure one choice is better motivated than the other, and the difference is presumably small.
because of the (hopefully small) nonlinear effect of the layer norm
There's a figure in the paper showing dictionary weights from one feature (A/1/3450) to all neurons. It has many large values, both positive and negative. I'm imagining that this case is typical, so that the matrix of dictionary vectors looks like a bunch of these dense vectors stacked together.It's possible that slicing this matrix along the other axis (i.e. weights from all features to a single neuron) might reveal more readily interpretable structure -- and I'm curious to know whether that's the case! -- but it seems a priori unlikely based on the evidence available in the paper.
However, while the "implementation details" of the MLP don't affect the function computed during inference, they do affect the training dynamics. Cf. the distinctive training dynamics of deep linear networks, even though they are equivalent to single linear layers during inference.
If the MLP is wider than the residual stream, as it is in real transformers, then the MLP output weights have a nontrivial null space, and thus some of the information in the activation vector gets discarded when the update is computed.A feature decomposition of the activations has to explain this "irrelevant" structure along with the "relevant" stuff that gets handed onwards.
Claude was given logit information when asked to describe inputs on which a feature is active; also, in a separate experiment, it was asked to predict parts of the logit update.
Caveat: L2 reconstruction loss on logits updates != L2 reconstruction loss on activations, and one might not even be a close approximation to the other.That said, I have a hunch they will give similar results in practice, based on a vague intuition that the training loss will tend encourage the neurons to have approximately equal "importance" in terms of average impacts on the logits.
At a glance, it seems to also activate sometimes on tokens like " each" or " with" in similar contexts.