PhD student at UCL. Interested in mech interp.
Interesting stuff! I'm very curious as to whether removing layer norm damages the model in some measurable way.
One thing that comes to mind is that previous work finds that the final LN is responsible for mediating 'confidence' through 'entropy neurons'; if you've trained sufficiently I would expect all of these neurons to not be present anymore, which then raises the question of whether the model still exhibits this kind of self-confidence-regulation
That makes sense to me. I guess I'm dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.
This is a paper reproduction in service of achieving my seasonal goals
Recently, it was demonstrated that circular features are used in the computation of modular addition tasks in language models. I've reproduced this for GPT-2 small in this Colab.
We've confirmed that days of the week do appear to be represented in a circular fashion in the model. Furthermore, looking at feature dashboards agrees with the discovery; this suggests that simply looking up features that detect tokens in the same conceptual 'category' could be another way of finding clusters of features with interesting geometry.
Next steps:
1. Here, we've selected 9 SAE features, gotten the reconstruction, and then compressed this down via PCA. However, were all 9 features necessary? Could we remove some of them without hurting the visualization?
2. The SAE reconstruction using 9 features is probably a very small component of the model's overall representation of this token. What's in the rest of the representation? Is it mostly orthogonal to the SAE reconstruction, or is there a sizeable component remaining in this 9-dimensional subspace? If the latter, it would indicate that the SAE representation here is not a 'full' representation of the original model.
Thanks to Egg Syntax for pair programming and Josh Engels for help with the reproduction.
If I understand correctly, you're saying that my expansion is wrong, because , which I agree with.
This is a great article! I find the notion of a 'tacit representation' very interesting, and it makes me wonder whether we can construct a toy model where something is only tacitly (but not explicitly) represented. For example, having read the post, I'm updated towards believing that the goals of agents are represented tacitly rather than explicitly, which would make MI for agentic models much more difficult.
One minor point: There is a conceptual difference, but perhaps not an empirical difference, between 'strong LRH is false' and 'strong LRH is true but the underlying features aren't human-interpretable'. I think our existing techniques can't yet distinguish between these two cases.
Relatedly, I (with collaborators) recently released a paper on evaluating steering vectors at scale: https://arxiv.org/abs/2407.12404. We found that many concepts (as defined in model-written evals) did not steer well, which has updated me towards believing that these concepts are not linearly represented. This in turn weakly updates me towards believing strong LRH is false, although this is definitely not a rigorous conclusion.
That's a really interesting blogpost, thanks for sharing! I skimmed it but I didn't really grasp the point you were making here. Can you explain what you think specifically causes self-repair?
I agree, this seems like exactly the same thing, which is great! In hindsight it's not surprising that you / other people have already thought about this
Do you think the 'tree-ified view' (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis?
Fair point, and I should amend the post to point out that AMFOTC also does 'path expansion'. However, I think this is still conceptually distinct from AMFOTC because:
maybe this post is better framed as 'reconciling AMFOTC with SAE circuit analysis'.
What's a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV?
In the steering vectors work I linked, we looked at how much of the variance in the metric was explained by a spurious factor, and I think that could be a useful technique if you have some a priori intuition about what the variance might be due to. However, this doesn't mean we can just test a bunch of hypotheses, because that looks like p-hacking.
Generally, I do think that 'population variance' should be a metric that's reported alongside 'population mean' in order to contextualize findings. But again this doesn't tell a very clean picture; variance being high could be due to heteroscedasticity, among other things
I don't have great solutions for this illusion outside of those two recommendations. One naive way we might try to solve this is to remove things from the dataset until the variance is minimal, but it's hard to do this in a right way that doesn't eventually look like p-hacking.
Do you also conclude that the causal role of the circuit you discovered was spurious?
an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria
I would guess that the IOI SAE circuit we found is not unduly influenced by spurious factors, and that the analysis using (variance in the metric difference explained by ABBA / BABA) would corroborate this. I haven't rigorously tested this, but I'd be very surprised if this turned out not to be the case
My Seasonal Goals, Jul - Sep 2024
This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation.
By 1 October 2024, I am committing to have produced:
Habits I am committing to that will support this: