Coauthored by Fedor Ryzhenkov and Dmitrii Volkov (Palisade Research)
At Palisade, we often discuss latest safety results with policymakers and think tanks who seek to understand the state of current technology. This document condenses and streamlines the various internal notes we wrote when discussing Anthropic's "Scaling Monosemanticity".
Executive Summary
Research on AI interpretability aims to unveil the inner workings of AI models, traditionally seen as “black boxes.” This enhances our understanding, enabling us to make AI safer, more predictable, and more efficient. Anthropic’s Transformer Circuits Thread focuses on mechanistic (bottom-up) interpretability of AI models.
Their latest result, Scaling Monosemanticity, demonstrates how interpretability techniques that worked for small, shallow models can scale to practical 7B (GPT-3.5-class) models. This paper... (read 664 more words →)