How do AI models work? In many ways, we know the answer to this question, because we engineered those models in the first place. But in other, fundamental, ways, we have no idea. Systems with many parts that interact with each other nonlinearly are hard to understand. By “understand” we mean they are hard to predict. And so while the AI community has enjoyed tremendous success in creating highly capable models, we are far behind in actually understanding how they are able to perform complex reasoning (or autoregressive token generation).
The budding field of Mechanistic Interpretability is focused on understanding AI models. One recent approach that has generated a lot of attention is... (read 1404 more words →)