Suppose you use mech interp to discover features in a model relating to a particular concept — say, the Golden Gate Bridge. They're even causal: amplify them and the model talks about the bridge more often, even speaking as if it were the bridge; suppress them and references to it...
In Selective regularization for alignment-focused representation engineering, we presented a successful approach for structuring the latent space of a simple MLP. Here we document our side quests: experiments that didn't go as expected, but in which we gained experience in regularization design and training dynamics. > Written with planning and...
We study how selective regularization during training can guide neural networks to develop predictable, interpretable latent spaces with alignment applications in mind. Using color as a test domain, we observe that anchoring even a single concept (red) influences the organization of other concepts, with related concepts clustering nearby — even...
Recent experiments have provided evidence that "bad" behaviors in language models cluster together[1]. While surprising at first, it makes sense on reflection: these behaviors likely cluster because it's simply the most efficient way to compress information during training. But intervening on model behavior remains challenging: we don't know all directions...
When large language models (LLMs) refuse to help with harmful tasks, attackers sometimes try to confuse them by adding bizarre strings of text called "adversarial suffixes" to their prompts. These suffixes look weird to humans, which raises the question: do they also look weird to the model? Alon & Kamfonas...