Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).
Another thought: The main thing I find exciting about model editing is when it is surgical - it's easy to use gradient descent to find ways to intervene on a model, while breaking performance everywhere else. But if you can really localise where a concept is represented in the model and apply it there, that feels really exciting to me! Thus I find this work notably more exciting (because it edits a single latent variable) than ROME/MEMIT (which apply gradient descent)
Thanks for sharing! I think the paper is cool (though massively buries the lede). My summary:
Thanks! Yeah, I hadn't seen that but someone pointed it out on Twitter. Feels like fun complimentary work
I'm so torn on this paper -I think it makes a reasonable point that many claims of emergence are overrated and that it's easy to massage metrics into a single narrative. But also, I think the title and abstract are overclaiming clickbait - obviously models have emergent abilities!! Chain of thought and few shot learning are just not a thing smaller models can do. Accuracy is sometimes the right metric, etc. It's often overhyped, but this paper way overclaims
Can you elaborate? I don't really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic in...
To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).
Why do you believe this? It's fairly plausible to me that "train an AI to use interpretability tools to show that this other AI is being deceptive" is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me
Maybe in contrast to other fields of ML? (Though that's definitely stopped being true for eg LLMs)
Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022), Nanda et al. (2023), Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.
If you want to build on Li et al (the Othello paper), my follow-up work is likely to be a useful starting point, and then the post I wrote about future directions I'm particularly excited about
Some recommended ways to upskill at empirical research (roughly in order):
For people specifically interested in getting into mechanistic interpretability, my guide to getting started may be useful - it's much more focused on the key, relevant parts of deep learning, with a bunch more interpretability specific stuff
...Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excelle
How are OpenAI training these tokenizers?! I'm surprised they still have weird esoteric tokens like these in there, when presumably there's eg a bunch of words that are worth learning
but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes
Empirically, many people new to the field get very paralysed and anxious about fears of doing accidental harm, in a way that I believe has significant costs. I haven't fully followed the specific model you outline, but it seems to involve ridiculously hard questions around the downstream consequences of your work, which I struggle to robustly apply to my work (indirect effects are really hard man!). Ditto, telling someone that they need t...
Interesting, thanks for the context. I buy that this could be bad, but I'm surprised that you see little upside - the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works
I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.
Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.
I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and ther...
Some examples of justifications I have given to myself are “You’re so new to this, this is not going to have any real impact anyway”,
I think this argument is just clearly correct among people new to the field - thinking that your work may be relevant to alignment is motivating and exciting and represents the path to eventually doing useful things, but it's also very likely to be wrong. Being repeatedly wrong is what improvement feels like!
People new to the field tend to wildly overthink the harms of publishing, in a way that increases their anxiety and makes them much more likely to bounce off. This is a bad dynamic, and I wish people would stop promoting it
And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
I'm surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)
Thanks, that looks really useful! Do you have GPU price performance numbers for lower precision training? Models like Chinchilla were trained in bf16, so that seems a more relevant number.
Thanks! I also feel more optimistic now about speed research :) (I've tried similar experiments since, but with much less success - there's a bunch of contingent factors around not properly hitting flow and not properly clearing time for it though). I'd be excited to hear what happens if you try it! Though I should clarify that writing up the results took a month of random spare non-work time...
Re models can be deeply understood, yes, I think you raise a valid and plausible concern and I agree that my work is not notable evidence against. Though also, idk ...
Er, hmm. To me this feels like a pretty uncontroversial claim when discussing a small model on an algorithmic task like this. (Note that the model is literally trained on uniform random legal moves, it's not trained on actual Othello game transcripts). Though I would agree that eg "literally all that GPT-4 cares about is predicting the next token" is a dubious claim (even ignoring RLHF). It just seems like Othello-GPT is so small, and trained on such a clean and crisp task that I can't see it caring about anything else? Though the word care isn't really we...
I previously had considered that any given corpus could have been generated by a large number of possible worlds, but I now don't weight this objection as highly.
Interesting, I hadn't seen that objection before! Can you say more? (Though maybe not if you aren't as convinced by it any more). To me, it'd be that there's many worlds but they all share some commonalities and those commonalities are modelled. Or possibly that the model separately simulates the different worlds.
Refusing to engage billionaires on twitter - especially ones that are sufficiently open to being convinced that they will drop $44 billion for something as pedestrian as a social media company.
This one isn't obvious to me - having billionaires take radical action can be very positive or very negative (and last time this was tried on Elon he founded OpenAI!)
The frequently accompanying, action-relevant claim -- that substantially easier-to-interpret alternatives exist -- is probably false and distracts people with fake options. That's my main thesis.
I agree with this claim (anything inherently interpretable in the conventional seems totally doomed). I do want to push back on an implicit vibe of "these models are hard to interpret because of the domain, not because of the structure" though - interpretability is really fucking hard! It's possible, but these models are weird and cursed and rife with bullshit l...
If you feel like you have enough close friends to satisfy you, then more power to you! It's not my job to tell you how to live your life if you're happy with it
Idk, I feel like GPT4 is capable of tool use, and also capable of writing enough code to make its own tools.
Great article! This helped me reframe some of the strong negative reactions to Yudkowsky's article on Twitter
I think if we imagine an n-gram model where n approaches infinity and the size of the corpus we train on approaches infinity, such a model is capable of going beyond even GPT. Of course it's unrealistic, but my point simply is that surface level statistics in principle is enough to imitate intelligence the way ChatGPT does.
Sure, in a Chinese room style fashion, but IMO reasoning + internal models have significantly different generalisation properties, and also are what actually happen in practice in models rather than an enormous table of N-Grams. And I...
Surely any capabilities researcher concerned enough to be willing to do this should just switch to safety-relevant research? (Also, IMO the best AI researchers tend not to be in this for the money)
Lol. This is a surprisingly decent summary, and the weaknesses are correctly identified things I did not try to cover
I tried to be explicit in the post that I don't personally care all that much about the world model angle - Othello-GPT clearly does form a world model, it's very clear evidence that this is possible. Whether it happens in practice is a whole other question, but it clearly does happen a bit.
They are still statistical next token predictors, it's just the statistics are so complicated it essentially becomes a world model. The divide between these concepts is artificial.
I think this undersells it. World models are fundamentally different from surface leve...
Where is token deletion actually used in practice? My understanding was that since the context window of GPT4 was insanely long, users don't tend to actually exceed it such that you need to delete tokens. And I predict that basically all interesting behaviour happens in that 32K tokens without needing to account for deletion
Are there experiments with these models that show that they're capable of taking in much more text than the context window allows?
Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did.
And yeah, I would be excited to see this applied to mean ablation!
Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...
Er, maybe if we get really good at doing patching-style techniques? But there's definitely not an obvious path - I more see lie detectors as one of the ultimate goals of mech interp, but whether this is actually possible or practical is yet to be determined.
Oh, ugh, Typeguard was updated to v3 and this broke things. And the circuitsvis import was a mistake. Should be fixed now, thanks for flagging!
This seems completely negligible to me, given how popular ChatGPT was. I wouldn't worry about it
If people want concrete mechanistic interpretability projects to work on, my 200 concrete open problems in mechanistic interpretability is hopefully helpful!
I give them a lot of credit for, to my eyes, realising this was a big deal way earlier than almost anyone else, doing a lot of early advocacy, and working out some valuable basic ideas, like early threat models, ways in which standard arguments and counter-arguments were silly, etc. I think this kind of foundational work feels less relevant now, but is actually really hard and worthwhile!
(I don't see much recent stuff I'm excited about, unless you count Risks from Learned Optimisation)
I'm confused by this example. This seems exactly the kind of time where an averaged point estimate is the correct answer. Say there's a 50% chance the company survives and is worth $100 and a 50% chance it doesn't and is worth $0. In this case, I am happy to buy or sell the price at $50.
Doing research to figure out it's actually an 80% chance of $100 means you can buy a bunch and make $30 in expected profit. This isn't anything special though - if you can do research and form better beliefs than the market, you should make money. The different world models don't seem relevant here to me?
I really like this idea! Making advance predictions feels like a much more productive way to engage with other people's work (modulo trusting you to have correctly figured out the answers)
Predictions below (note that I've chatted with the team about their results a bit, and so may be a bit spoiled - I'll try to simulate what I would have predicted without spoilers)
...Behavioral Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewh
Many of these tokens are unprintable (i.e., they don't display and I don't know what they are).
The first 256 characters are the 256 ASCII characters (each 1 byte). A bunch of them are basically never used (they exist so that an arbitrary string of bytes can be broken down into valid tokens)
Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook - find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour - something you could write code to solve). And then do the...
I think that what you're saying is correct, in that ChatGPT is trained with RLHF, which gives feedback on the whole text, not just the next token. It is true that GPT-3 outputs the next token and is trained to be myopic. And I think that your arguments seem suspect to me, just because a model takes steps that are in practice part of a sensible long term plan, does not mean that the model is intentionally forming a plan. Just that each step is the natural thing to myopically follow from before.
Really nice post! I think this is an important point that I've personally been confused about in the past, and this is a great articulation (and solid work for 2 hours!!)
I'm a bit confused why this happens, if the circuit only "needs" three layers of composition
I trained these models on only 22B tokens, of which only about 4B was Python code, and their residual stream has width 512. It totally wouldn't surprise me if it just didn';t have enough data or capacity in 3L, even though it was technically capable.
Thanks for this post! I'm not sure how much I expect this to matter in practice, but I think that the underlying point of "sometimes the data distribution matters a lot, and ignoring it is suspect" seems sound and well made.
I personal think it's clear that 1L attn-only models are not literally just doing skip trigrams. A quick brainstorm of other things I presume they're doing:
I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team
Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding
I appreciate the feedback! I have since bought a graphics tablet :) If you want to explore induction heads more, you may enjoy this tutorial
Any papers you're struggling to find?