Joseph Bloom

I'm an independently funded AI Alignment Research Engineer focussing on mechanistic interpretability in reinforcement learning. I'm particularly interested in comparing circuits in decision transformers to those generated by other techniques. 

Wiki Contributions


Thanks Richard for this post and prior advice!

I was planning to make a post at some point with some advice that's closely related to this post but I will share it here as a preview. Take note that I don't yet have strong evidence that my work is good or has mattered (and I was going to write a full post once I had more evidence for that). I think Richard's advice above is really good and I'll try to take some of the ideas more on board with my own work. 

Last year I quit my job and upskilled for 6 months and now I'm doing independent research which might turn out to be valuable. Regardless of its value, I've learnt a lot and it's created many opportunities for me. I went to EAG and Richard's talk there and a conversation later in a group where he was talking about this mentorship constraint deal. This left a strong impression on me leading me to take some degree of pride in my attempts to be independent and not rely as strongly on individual mentorship. However, there are just a bunch of caveats/perspectives that I have currently which relate to this. 

All of these relate to empirical alignment research and not governance or other forms of research. I'm mostly focussed on providing advice for how to be more productive independently of other people but that shouldn't be your preference and I suspect people are more productive at orgs/in groups.

So a bunch of ideas on the topic:

  • Why the focus on independent research?
    • I think it's really weird how we have this thing in the alignment community and I just want to comment on that first. The idea that people can just go off on their own and be productive I think is kinda uncommon. 
    • This community values agency.  In practice, agency is the ability to make decisions for yourself about what you want and how to achieve it. Getting good at having agency both makes good researchers and good research engineers. HPMOR helped me understand agency better. 
    • I have no first hand knowledge of the inside of orgs like DeepMind or Anthropic but I suspect people with agency are generally considered better hires. It's not like orgs would say "we could hire this person but we want them to do what they're told so let's hire someone with little evidence of working independently". Rather, my guess is they select for people who are capable of being self-directed work and who  grow spontaneously as a result on attempting hard things and learning. 
  • Getting ready to contribute: 
    • There are a variety of ways to upskill without doing stuff like a PhD (as research says above). Programs like ARENA, SERI-MATS, SPAR etc. My sense is that once people realise that working without feedback is hard, they will gravitate strongly toward more empirical research areas, especially those that can be done at small scale (aka, MI) and which there are existing tools (aka MI) and examples of investigations with reasoned paths to impact (aka MI). However, there are likely other empirical areas which provide feedback and are doable that may/may not have these properties and searching for more seems like a good idea to me. 
    • Get good. (especially if you're fresh out of college) Struggling with implementing things / understanding stuff and identifying problems can all be alleviated by working on your skills. There's lots of open source code which show good patterns but also lots of people who have good technical skills but aren't necessarily the research mentors we are constrained by who you can  engage with. Talk to people who are good engineers and find out how they operate. It'll be stuff like having a good IDE and testing your code.
    • Start slow. Contributing to open source projects such as TransformerLens is good. I've been helping out with it and it seems like a good way for lots of people to dip their toe in. 
  • Doing research without a mentor is very hard for many obvious reasons. Things you can do to make it easier:
    • While talking to people such at EAGs can be helpful, my sense is most good advice just exists on the forum. I recommend rereading such advice periodically and predict you will grok why people make suggestions more if you are stuck in your own research and have challenges then before. 
    • Focus on fast feedback cycles. Try to avoid situations where you don't know if something is working for a long time. This is different to whether you know if it's valuable or not. 
    • Be prepared to drop things or change your path, but don't abandon work because it's hard. It feels like a special kind of wisdom/insight to make these calls and I think you need to work hard at trying to get better at this over time. 
    • Have good tooling but don't let building the tooling take over. 
    • Allow yourself to focus. There is a time to work out why you are doing what you are doing and there are other times you just need to do the work. 
    • Study! Engaging with fundamental topics like linear algebra or deep learning theory is hugely important. Without colleagues or mentors it is a meaningful constraint on your output when you don't know any given thing that might be relevant. This is tricky because there's a lot to study. I think engage with the basics and be consistent. Mathematics for ML textbook is great. GoodFellow Deep Learning textbook is also recommended. 
    • Read related literature. Like with more basic knowledge, lack of knowledge of relevant literature can cause you to waste time/effort. I have a spreadsheet which describes all the models that are kinda similar to mine and how they were trained and what was done with them. 
    • Find less correlated ideas/takes: Stephen Casper's Engineering interpreatibility sequence is a good example of the kind of thing people doing independent work should read. It shakes you out of the "everything we do here makes sense and is obvious perspective" which is extra easy to fall into when you work on your own. There might be equivalent posts in other areas. 
    • Possibly the quirkiest thing I do these days is roleplay characters in my head "the engineer", "the scientist", "the manager" and the "outsider" who help me balance different priorities when making decisions about my work. I find this fun and useful and since I literally write meeting notes, GPT4 can stand in for each of them which is pretty cool and useful for generating ideas. The "other", a less obvious team member, represents someone who doesn't privilege the project or existing decisions. This helps me try to channel a helpful adversarial perspective (see previous point).

I hope this is useful for people! 

Thanks Simon, I'm glad you found the app intuitive :)

The RTG is just another token in the input, except that it has an especially strong relationship with training distribution. It's heavily predictive in a way other tokens aren't because it's derived from a labelled trajectory (it's the remaining reward in the trajectory after that step).

For BabyAI, the idea would be to use an instruction prepended to the trajectory made up of a limited vocab (see baby ai paper for their vocab). I would be pretty partial to throwing out the RTG and using a behavioral clone for a BabyAI model. It seems likely this would be easier to train. Since the goal of these models is to be useful for gaining understanding, I'd like to avoid reusing tokens as that might complicate analysis later on.

Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can "an injection coefficient scan". 

The procedure I'm using looks like this:

  1. Repeat your input tokens say, 128 times. 
  2. Apply the activation vector at 128 different steps between a coefficient of -10 and 10 to each of your input tokens when doing your AVEC forward pass. 
  3. Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons). 
  4. Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
  5. Plot the resulting attribution vs injection coefficient per component. 
  6. If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision. 

So far, my results seem very interesting and possibly quite useful. It's possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures. 
I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be. 

I am very excited to see if this makes my analyses easier! Great work! 

Sure, I could have phrased myself better and I meant to say "former", which didn't help either! 

Neither of these are novel concepts in that existing investigations have described features of this nature. 

  1. Good 1 aka Consumer goods. Useful for unembed (may / may not be useful for other modular circuits inside the network. That Logit Lens gets better over the course of the circuit suggests the residual stream contains these kinds of features and more so as we move up the layers. 
  2. Good 2. aka Capital goods. Useful primarily for other circuits. A good example is the kind of writing to subspaces in the IOI circuits by duplicate token heads. "John" appeared twice as markup on a token / vector in the subspace of a token in the residual stream" doesn't in itself tell you that Jane is the next token, but is useful to another head which is going to propose a head via another function. 

    Alternatively, in Neel's modular arithmetic,  calculating waves of terms like sin(wx), cos(wx) which are only useful when you have the rest of the mechanism to get argmax(z) of 
  3. I would have guess that features in the first category and later in the second, since how would you get gradients to things that aren't useful yet. However, the existence of clear examples of "internal signals" is somewhat undisputable?
  4. It seems plausible that there are lots of stuff features that sit in both these categories of course so if it's useful you could define them to be more mutually exclusive and a third category for both.

I realise that my saying "Maybe this is the only kind of good in which case transformers would be "fundamentally interpretable" in some sense.  All intermediate signals could be interpreted as final products." was way too extreme. What I mean is that maybe category two is more less common that we think. 

To relate this to AVEC,  (which I don't have a detailed understanding of how you are implementing currently) if you find the vector (I assume residual stream vector) itself has a high dot product with specific unembeddings then that says you're looking at something in category 1. However, if introducing it into the model earlier has a very different effect to introducing it directly before the unembedding then that would suggest it's also being used by other modular circuits in the model. 

I think this kind of distinction is only one part of what I was trying to get at with circuit economics but hopefully that's clearer! Sorry for the long explanation and initial confusion. 

We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren't aware of this previously, so we wanted to make a reference post.

Happy to provide! I think I'm pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.

TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important. 

The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess "bandwidth" (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge). 

So to tie this back to your post and Alex's comment "which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.". I think that what interpretability has recently dealt with in elucidating specific circuits is something like "micro-interpretability" and is akin to microeconomics. However this post seems to show a larger trend ie "macro-interpretability" which would possibly affect which of such circuits are possible/likely to be in the final model. 

I'll elaborate briefly on the off chance this seems like it might be a useful analogy/framing to motivate further work. 

  • Studying the Capacity/Loss Reduction distribution in Time: It seems like during transformer training there may be an effect not unlike inflation? Circuits which delivered enough value to justify their capacity use early in training may fall below the capacity/loss reduction cut off later. Maybe various techniques which enable us to train more robust models work because they make these transitions easier.
  • Studying the Capacity/Loss Reduction distribution in Layer: Moreover, it seems plausible that the distribution of "usefulness" in circuits in different layers of the network may be far from uniform. Circuits later in the network have far more refined inputs which make them better at reducing loss. Residual stream norm growth seems like a "macro" effect that shows model "know" that later layers are more important.
  • Studying the Capacity/Loss Reduction distribution in Layer and Time: Combining the above. I'd predict that neural networks originally start by having valuable circuits in many layers but then transition to maintain circuits earlier in the network which are valuable to many downstream circuits and circuits later in the network which make the best use of earlier circuits. 
  • More generally "circuit economics" as a framing seems to suggest that there are different types of "goods" in the transformer economy. those which directly lead to better predictions and those which are useful for making better predictions when integrated with other features.  The success of Logit Lens seems to suggest that the latter category increases over the course of the layers. Maybe this is the only kind of good in which case transformers would be "fundamentally interpretable" in some sense.  All intermediate signals could be interpreted as final products. More likely, I think is that later in training there are ways to reinforce the creation of more internal goods (in economics, good which are used to make other goods are called capital goods). The value of such goods would be mediated via later circuits. So this would lead also to the "deletion-by-magnitude theory" as a way or removing internal goods. 
  • To bring this back to language already in the field see Neel's discussion here.  A modular circuit is distinct from an end-end circuit in that it starts and ends in intermediate activations. Modular circuits may be composable. I propose that the outputs of such circuits are "capital goods". If we think about the "circuit economy" it then seems totally reasonable that multiple suppliers might generate equivalent capital goods and have a many to many relationship multiple different circuits near the end voting on logits. 

This is very speculative "theory" if you can call it that, but I guess I feel this would be "big if true". I also make no claims about this being super original or actually that useful in practice but it does feel intuition generating. I think this is totally the kind of thing people might have worked on sooner but it's likely been historically hard to measure the kinds of things that might be relevant. What your post shows is that between the transformer circuits framework and TransformerLens we are able to somewhat quickly take a bunch of interesting measurements relatively quickly which may provide more traction on this than previously possible. 

Second pass through this post which solidly nerd-sniped me! 

A quick summary of my understand of the post: (intentionally being very reductive though I understand the post may make more subtle points). 

  1. There appears to be exponential growth in the norm of the residual stream in a range of models. Why is this the case?
  2. You consider two hypotheses: 
    1. 1. That the parameters in the Attention and/or MLP weights increase later in the network. 
    2. 2. That there is some monkey business with the layer norm sneaking in a single extra feature. 
  3. In terms of evidence, you found that:
    1. Evidence for theory one in W_OV frobenius norms increasing approximately exponential over layers.
    2. Evidence for theory one in MLP output to the residual stream increasing (harder to directly measure the norm of the MLP due to non-linearities).
  4. You're favoured explanation is "We finally note our current favored explanation: Due to LayerNorm, it's hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger. "

My thoughts:

  • My general take is that this post is that the explanation about cancelling out features being harder than amplifying new features feels somewhat disconnected from the high level characterisation of weights / norms which makes up most of the post. It feels like there is a question of how and a question of why
  • Given these models are highly optimized by SGD, it seems like the conclusion must be that the residual stream norm is growing because this is useful leading to the argument that it is useful because the residual stream is a limited resource / has limited capacity, making us want to delete information in it and increasing the norm of the contributions to the residual stream effectively achieves this by drowning out other features. 
  • Moreover, if the mechanism by which we achieve larger residual stream contributions in later components is by having larger weights (which is penalized by weight decay) then we should conclude that a residual stream with a large norm is worthwhile enough that the model would rather do this then have smaller weights (which you note). 
  • I feel like I still don't feel like I know why though. Later layers have more information and are therefore "wiser" or something could be part of it.
  • I'd also really like to know the implications of this. Does this affect the expressivity of the model in a meaningful way? Does it affect the relative value of representing a feature in any given part of the model? Does this create an incentive to "relocate" circuits during training  or learn generic "amplification" functions?  These are all ill-defined questions to some extent but maybe there are formulations of them that are better defined which have implications for MI related alignment work. 

Thanks for writing this up! Looking forward to subsequent post/details :) 

PS: Is there are non-trivial relationship between this post and tuned lens/logit lens? Seems possible. 

Thanks for the feedback. On a second reading of this post and the paper I linked and having read the paper you linked, my thoughts have developed significantly. A few points I'll make here before making a separate comment:
- The post I shared originally does indeed focus on dynamics but may have relevant general concepts in discussing the relationship between saturation and expressivity. However, it focuses on the QK circuit which is less relevant here.
- My gut feel is that true explanations of related formula should have non-trivial relationships. If you had a good explanation for why norms of parameters grew during training it should relate to why norms of parameters are different across the model. However, this is a high level argument and the content of your post does of course directly address a different phenomenon (residual stream norms). If this paper had studied the training dynamics of the residual stream norm, I think it would be very relevant. 

I really liked this post and would like to engage with it more later. It could be very useful! 

However, I also think that it would be good for you to add a section reviewing previous academic work on this topic (eg:  This seems very relevant and may not be the only academic work on this topic (I did not search long). Curious to hear what you find! 

Some thoughts (don't have any background in stuff related but seemed interesting). 

I think it would be interesting to see what you found if you looked into the state of existing research on AI coordination / delegation / systemic interactions and if any of it feels related. I'd be mildly surprised if people have studied exactly this but expect many relevant posts/papers. 

In terms of related stuff on LessWrong, I can't find it now but Paul Christiano has a post on worlds where things go badly slowly and I think this would be kinda in that genre. 

I think this is an interesting thing to consider and feels somewhat related to Dan Hendrycks Natural "Selection Favors AIs over Humans" The connection in my head is "what does an AI ecosystem look like", "what does it mean to discuss alignment in this context", "what outcomes will this system tend towards" etc. The same way middle managers get selected for, so more generally AI systems with certain properties get selected for. 

You might want to read about Ought's agenda with supervise processes not outcomes which feels relevant. 

Recursive middle manager hell feels somewhat related to inner misalignment / misaligned mesa-optimizers where instead of being a subset of the processing of an LLM (how I normally think about it but maybe not how others do), you have your AI system made of many layers and it's plausible that intermediate layers end up optimizing proxies for inputs to what you care about and not even the thing itself. In this view, it seems like the misalignment of middle managers which usually makes companies less effective might just lead to selection against such systems as compared to systems with less of these properties. 

There might be some strategically valuable research to be done here but it's not super clear to me what the theory of change would be. Maybe there something to do with bandwidth / scalability tradeoffs that affect how tightly coupled vs diffuse/distributed useful/popular AI systems will be in the future. 

Load More