There's a pretty exciting new interpretability paper, which hasn't really received the requisite attention because it's not billed as such.

This paper modifies the transformer architecture so that a forward pass minimizes a specifically engineered energy function. 

According to the paper, "This functionality makes it possible to visualize essentially any token representation, weight, or gradient of the energy directly in the image plane. This feature is highly desirable from the perspective of interpretability, since it makes it possible to track the updates performed by the network directly in the image plane as the computation unfolds in time".

They achieve SOTA on two of the domains they tested on, although they didn't test on NLP or CV tasks (which is why the paper was rejected, I believe the authors will resubmit again with more experiments.)

More generally, I think architectures such as the above that essentially give you interpretability for free are a promising research direction.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 9:29 AM

Huh, interesting. I skimmed the paper and I'm not convinced this specific architecture is promising for tasks that move a lot of information or have hierarchical structure - the lack of a value (only keys and queries) seems like a big downgrade. The graph classification results are pretty good though, and I'd agree with the authors that it's probably because they've improved information routing without having much worse inductive biases than GCNNs. Does this match your impression?

I'm also kind of a downer about interpretability. There's different kinds of it. Each neuron having an input-space interpretation that humans can mostly figure out by eyeballing it doesn't help you much when you have ten billion neurons. The more powerful kinds of interpretability (which it would be exciting to get for free) have more to do with compression and search - they let you form simplified abstract models of an AI's reasoning and tell you about the domains of validity of those models.