[Content note: this is off-the-cuff, maybe nothing new, better to publish than not. I checked Nanda's posts and found this post which is basically the same idea, and it points to tools and other stuff.]

In imitation learning, the AI is trained to imitate the way that a person is behaving. Reverse imitation would be where a person tries to imitate the behavior of an AI.

Reverse imitation learning could be used as an approach to interpretability. You try to imitate the AI, and thereby figure out "how the AI thinks" and "what the AI knows".

Imitating the whole AI is probably very hard, though. An easier thing might be to imitate some very small part, like one neuron or a small set of neurons. That's narrow reverse imitation.

The meat of interpretability research is, I assume, about coming up with clever new ways of algorithmically looking at an AI's internals. But doing that seems hard and seems to require cleverness and a lot of work to implement new code. What seems to require less cleverness and investment, would be just trying to imitate some piece of the AI. This wouldn't require coming up with radical new ideas, it's just playing a myopic game. So lots of people could play the game, without too much up-front investment.

The point would be that maybe you can build out your ability to imitate some small chunk of a neural net by starting with one neuron, then going to another connected neuron, and so on. If modularity can be discovered in neural nets automatically, or neural nets can be made so that they're transparently modular, then the game could be focused on building out imitation of a module.

Concretely, the game would be played like this: There's a fixed neuron in the neural net. You're presented with an example input to the neural net. You output a number. The neural net is run on the input, and the activation of the neuron is selected. Your score is how close your number was to the activation. Repeat.


  • Instead of predicting neurons, predict the values of linear probes (if there are interesting linear probes to predict).
  • Make a neural net without too much polysemanticity, or try to factor out the polysemanticity with small probes. Then try to predict those neurons.
  • Give the player more information. For example, let her consult the activations of nearby neurons, or any other neurons, or only upstream or downstream neurons, or simple probes. In the limit of more information, this is just asking to learn what the matrix for the neuron being learned is, but that's fine. You have levels of the game, where less info is more difficult.
  • You can use gameplay data to find neurons that players are good or bad at learning to imitate.
  • You can also use gameplay data to find training parameters (what the task is, what the data is, what the neural net architecture is, what the training algorithm is) that produce AI models that have neurons that players are good or bad at learning to imitate.
  • Use some automated methods for finding interesting neurons. You can also use gameplay data to find good automated methods for finding neurons that players like to play, or find interesting, or are good or bad at learning to imitate.

New to LessWrong?

New Comment