[Content note: this is off-the-cuff, maybe nothing new, better to publish than not. I checked Nanda's posts and found this post which is basically the same idea, and it points to tools and other stuff.]
In imitation learning, the AI is trained to imitate the way that a person is behaving. Reverse imitation would be where a person tries to imitate the behavior of an AI.
Reverse imitation learning could be used as an approach to interpretability. You try to imitate the AI, and thereby figure out "how the AI thinks" and "what the AI knows".
Imitating the whole AI is probably very hard, though. An easier thing might be to imitate some very small part, like one neuron or a small set of neurons. That's narrow reverse imitation.
The meat of interpretability research is, I assume, about coming up with clever new ways of algorithmically looking at an AI's internals. But doing that seems hard and seems to require cleverness and a lot of work to implement new code. What seems to require less cleverness and investment, would be just trying to imitate some piece of the AI. This wouldn't require coming up with radical new ideas, it's just playing a myopic game. So lots of people could play the game, without too much up-front investment.
The point would be that maybe you can build out your ability to imitate some small chunk of a neural net by starting with one neuron, then going to another connected neuron, and so on. If modularity can be discovered in neural nets automatically, or neural nets can be made so that they're transparently modular, then the game could be focused on building out imitation of a module.
Concretely, the game would be played like this: There's a fixed neuron in the neural net. You're presented with an example input to the neural net. You output a number. The neural net is run on the input, and the activation of the neuron is selected. Your score is how close your number was to the activation. Repeat.