We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback.
Introduction
Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to the model’s activations at a given layer and token position during inference. This approach leverages the hypothesis that many human-interpretable 'concepts' like truthfulness, refusal, and sentiment are represented as directions in activation space. Steering interventions are appealing because they use much less data than fine-tuning and do not require changes to the model parameters. In principle, this makes them more efficient and easy to controlling... (read 1296 more words →)