In Selective regularization for alignment-focused representation engineering, we presented a successful approach for structuring the latent space of a simple MLP. Here we document our side quests: experiments that didn't go as expected, but in which we gained experience in regularization design and training dynamics. This is the second part...
We study how selective regularization during training can guide neural networks to develop predictable, interpretable latent spaces with alignment applications in mind. Using color as a test domain, we observe that anchoring even a single concept (red) influences the organization of other concepts, with related concepts clustering nearby — even...
Recent experiments have provided evidence that "bad" behaviors in language models cluster together[1]. While surprising at first, it makes intuitive sense: these behaviors likely cluster because it's simply the most efficient way to compress information during training. However, intervening on model behavior remains a tremendous challenge—partly because we don't know...
When large language models (LLMs) refuse to help with harmful tasks, attackers sometimes try to confuse them by adding bizarre strings of text called "adversarial suffixes" to their prompts. These suffixes look weird to humans, which raises the question: do they also look weird to the model? Alon & Kamfonas...