Neel Nanda

Sequences

Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wiki Contributions

Comments

Thanks for writing this retrospective! I appreciate the reflections

Ah, thanks! As I noted at the top, this was an excerpt from that post, which I thought was interesting as a stand alone, but I didn't realise how much of that section did depend on attribution patching knowledge

Re modus tollens, looking at the data, it seems like the correct answer is always yes. This admits far more trivial solutions than really understanding the task (ie always say yes, vs always say no). Has anyone checked it for varying the value of the answer?

I appreciate the feedback! I have since bought a graphics tablet :) If you want to explore induction heads more, you may enjoy this tutorial

Any papers you're struggling to find?

Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).

Another thought: The main thing I find exciting about model editing is when it is surgical - it's easy to use gradient descent to find ways to intervene on a model, while breaking performance everywhere else. But if you can really localise where a concept is represented in the model and apply it there, that feels really exciting to me! Thus I find this work notably more exciting (because it edits a single latent variable) than ROME/MEMIT (which apply gradient descent)

Thanks for sharing! I think the paper is cool (though massively buries the lede). My summary:

  • They create a synthetic dataset for lit and unlit rooms with styleGAN. They exploit the fact that the GAN has disentangled and meaningful directions in its latent space, that can be individually edited. They find a lighting latent automatically, by taking noise that produces rooms, editing each latent in turn and looking for big changes specifically on light pixels
    • StyleGAN does not have a text input, and there's no mention of prompting (as far as I can tell - I'm not familiar with GANs). This is not a DALL-E style model. Its input is just Gaussian noise
    • This is a really cool result, and I am excited about it! The claim that GANs have disentangled latents (and that this is known), which makes this less exciting (man I wish this was true of LLMs). But it's still solid!
    • This is in section 4.1
  • They manually create a dataset of lit and unlit rooms, which isn't that interesting. They use this for benchmarking their method, not for actually training it (I don't find this that exciting)
  • They use the GAN as a source of training data, to train a model specifically for lit -> unlit rooms (I don't find this that exciting)

Thanks! Yeah, I hadn't seen that but someone pointed it out on Twitter. Feels like fun complimentary work

I'm so torn on this paper -I think it makes a reasonable point that many claims of emergence are overrated and that it's easy to massage metrics into a single narrative. But also, I think the title and abstract are overclaiming clickbait - obviously models have emergent abilities!! Chain of thought and few shot learning are just not a thing smaller models can do. Accuracy is sometimes the right metric, etc. It's often overhyped, but this paper way overclaims

Can you elaborate? I don't really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic interpretability is not going to be useful for ensuring AIs want to establish solidarity with humans, noticing collusion, etc such that it's effect helping AIs coordinate dominates over any safety benefits?

I don't know, I just don't buy that chain of reasoning.

Load More