Ah, thanks! As I noted at the top, this was an excerpt from that post, which I thought was interesting as a stand alone, but I didn't realise how much of that section did depend on attribution patching knowledge
Re modus tollens, looking at the data, it seems like the correct answer is always yes. This admits far more trivial solutions than really understanding the task (ie always say yes, vs always say no). Has anyone checked it for varying the value of the answer?
I appreciate the feedback! I have since bought a graphics tablet :) If you want to explore induction heads more, you may enjoy this tutorial
Any papers you're struggling to find?
Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).
Another thought: The main thing I find exciting about model editing is when it is surgical - it's easy to use gradient descent to find ways to intervene on a model, while breaking performance everywhere else. But if you can really localise where a concept is represented in the model and apply it there, that feels really exciting to me! Thus I find this work notably more exciting (because it edits a single latent variable) than ROME/MEMIT (which apply gradient descent)
Thanks for sharing! I think the paper is cool (though massively buries the lede). My summary:
Thanks! Yeah, I hadn't seen that but someone pointed it out on Twitter. Feels like fun complimentary work
I'm so torn on this paper -I think it makes a reasonable point that many claims of emergence are overrated and that it's easy to massage metrics into a single narrative. But also, I think the title and abstract are overclaiming clickbait - obviously models have emergent abilities!! Chain of thought and few shot learning are just not a thing smaller models can do. Accuracy is sometimes the right metric, etc. It's often overhyped, but this paper way overclaims
Can you elaborate? I don't really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic interpretability is not going to be useful for ensuring AIs want to establish solidarity with humans, noticing collusion, etc such that it's effect helping AIs coordinate dominates over any safety benefits?
I don't know, I just don't buy that chain of reasoning.
Thanks for writing this retrospective! I appreciate the reflections