I'm an independently funded AI Alignment Research Engineer focussing on mechanistic interpretability in reinforcement learning. I'm particularly interested in comparing circuits in decision transformers to those generated by other techniques.
Thanks Simon, I'm glad you found the app intuitive :)
The RTG is just another token in the input, except that it has an especially strong relationship with training distribution. It's heavily predictive in a way other tokens aren't because it's derived from a labelled trajectory (it's the remaining reward in the trajectory after that step).
For BabyAI, the idea would be to use an instruction prepended to the trajectory made up of a limited vocab (see baby ai paper for their vocab). I would be pretty partial to throwing out the RTG and using a behavioral clone for a BabyAI model. It seems likely this would be easier to train. Since the goal of these models is to be useful for gaining understanding, I'd like to avoid reusing tokens as that might complicate analysis later on.
Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can "an injection coefficient scan".
The procedure I'm using looks like this:
So far, my results seem very interesting and possibly quite useful. It's possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures.
I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be.
I am very excited to see if this makes my analyses easier! Great work!
Sure, I could have phrased myself better and I meant to say "former", which didn't help either!
Neither of these are novel concepts in that existing investigations have described features of this nature.
I realise that my saying "Maybe this is the only kind of good in which case transformers would be "fundamentally interpretable" in some sense. All intermediate signals could be interpreted as final products." was way too extreme. What I mean is that maybe category two is more less common that we think.
To relate this to AVEC, (which I don't have a detailed understanding of how you are implementing currently) if you find the vector (I assume residual stream vector) itself has a high dot product with specific unembeddings then that says you're looking at something in category 1. However, if introducing it into the model earlier has a very different effect to introducing it directly before the unembedding then that would suggest it's also being used by other modular circuits in the model.
I think this kind of distinction is only one part of what I was trying to get at with circuit economics but hopefully that's clearer! Sorry for the long explanation and initial confusion.
We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren't aware of this previously, so we wanted to make a reference post.
Happy to provide! I think I'm pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.
TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.
The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess "bandwidth" (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge).
So to tie this back to your post and Alex's comment "which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.". I think that what interpretability has recently dealt with in elucidating specific circuits is something like "micro-interpretability" and is akin to microeconomics. However this post seems to show a larger trend ie "macro-interpretability" which would possibly affect which of such circuits are possible/likely to be in the final model.
I'll elaborate briefly on the off chance this seems like it might be a useful analogy/framing to motivate further work.
This is very speculative "theory" if you can call it that, but I guess I feel this would be "big if true". I also make no claims about this being super original or actually that useful in practice but it does feel intuition generating. I think this is totally the kind of thing people might have worked on sooner but it's likely been historically hard to measure the kinds of things that might be relevant. What your post shows is that between the transformer circuits framework and TransformerLens we are able to somewhat quickly take a bunch of interesting measurements relatively quickly which may provide more traction on this than previously possible.
Second pass through this post which solidly nerd-sniped me!
A quick summary of my understand of the post: (intentionally being very reductive though I understand the post may make more subtle points).
My thoughts:
Thanks for writing this up! Looking forward to subsequent post/details :)
PS: Is there are non-trivial relationship between this post and tuned lens/logit lens? https://arxiv.org/pdf/2303.08112.pdf Seems possible.
Thanks for the feedback. On a second reading of this post and the paper I linked and having read the paper you linked, my thoughts have developed significantly. A few points I'll make here before making a separate comment:
- The post I shared originally does indeed focus on dynamics but may have relevant general concepts in discussing the relationship between saturation and expressivity. However, it focuses on the QK circuit which is less relevant here.
- My gut feel is that true explanations of related formula should have non-trivial relationships. If you had a good explanation for why norms of parameters grew during training it should relate to why norms of parameters are different across the model. However, this is a high level argument and the content of your post does of course directly address a different phenomenon (residual stream norms). If this paper had studied the training dynamics of the residual stream norm, I think it would be very relevant.
I really liked this post and would like to engage with it more later. It could be very useful!
However, I also think that it would be good for you to add a section reviewing previous academic work on this topic (eg: https://aclanthology.org/2021.emnlp-main.133.pdf. This seems very relevant and may not be the only academic work on this topic (I did not search long). Curious to hear what you find!
Some thoughts (don't have any background in stuff related but seemed interesting).
I think it would be interesting to see what you found if you looked into the state of existing research on AI coordination / delegation / systemic interactions and if any of it feels related. I'd be mildly surprised if people have studied exactly this but expect many relevant posts/papers.
In terms of related stuff on LessWrong, I can't find it now but Paul Christiano has a post on worlds where things go badly slowly and I think this would be kinda in that genre.
I think this is an interesting thing to consider and feels somewhat related to Dan Hendrycks Natural "Selection Favors AIs over Humans" https://arxiv.org/abs/2303.16200. The connection in my head is "what does an AI ecosystem look like", "what does it mean to discuss alignment in this context", "what outcomes will this system tend towards" etc. The same way middle managers get selected for, so more generally AI systems with certain properties get selected for.
You might want to read about Ought's agenda with supervise processes not outcomes which feels relevant.
Recursive middle manager hell feels somewhat related to inner misalignment / misaligned mesa-optimizers where instead of being a subset of the processing of an LLM (how I normally think about it but maybe not how others do), you have your AI system made of many layers and it's plausible that intermediate layers end up optimizing proxies for inputs to what you care about and not even the thing itself. In this view, it seems like the misalignment of middle managers which usually makes companies less effective might just lead to selection against such systems as compared to systems with less of these properties.
There might be some strategically valuable research to be done here but it's not super clear to me what the theory of change would be. Maybe there something to do with bandwidth / scalability tradeoffs that affect how tightly coupled vs diffuse/distributed useful/popular AI systems will be in the future.
Thanks Richard for this post and prior advice!
I was planning to make a post at some point with some advice that's closely related to this post but I will share it here as a preview. Take note that I don't yet have strong evidence that my work is good or has mattered (and I was going to write a full post once I had more evidence for that). I think Richard's advice above is really good and I'll try to take some of the ideas more on board with my own work.
Last year I quit my job and upskilled for 6 months and now I'm doing independent research which might turn out to be valuable. Regardless of its value, I've learnt a lot and it's created many opportunities for me. I went to EAG and Richard's talk there and a conversation later in a group where he was talking about this mentorship constraint deal. This left a strong impression on me leading me to take some degree of pride in my attempts to be independent and not rely as strongly on individual mentorship. However, there are just a bunch of caveats/perspectives that I have currently which relate to this.
All of these relate to empirical alignment research and not governance or other forms of research. I'm mostly focussed on providing advice for how to be more productive independently of other people but that shouldn't be your preference and I suspect people are more productive at orgs/in groups.
So a bunch of ideas on the topic:
I hope this is useful for people!