Alignment may be localized: a short (and albeitly limited) experiment
Cross-posted from my recent paper: "Alignment is localized: A causal probe into preference layers" : https://arxiv.org/abs/2510.16167 TL;DR: We find that human preference alignment in at least one LLM isn’t global; rather, it is concentrated in a few mid-layer circuits. Abstract A key problem in aligning language models is that they are largely opaque: while techniques such as reinforcement learning through human feedback (RLHF) lead to AI systems that are more aligned with their human counterparts in practice, the mechanics behind how such alignment is achieved remain largely misunderstood. The process through which a language model "learns" to optimize its behavior toward human processes, at least in the terms of model internals, is somewhat mysterious. In this work, we try to uncover where the signal for human preference "lives" in a language model. By comparing a base model to its instruction-tuned counterpart, we examine how the two differ in the internal activations they produce on the same inputs. Through a series of causal interventions and statistical analyses, we isolate the regions of the network that appear to carry the bulk of the preference information. Our goal is not to propose a new alignment method, but to understand the structure of the alignment signal as it already exists in widely used models. The core result is surprisingly simple. Rather than being spread across the entire depth of the network, the preference signal shows up most strongly in a small group of mid-layer activations. When those activations are transferred into the base model, its behavior shifts toward human-preferred responses; when they are replaced or randomized, that shift disappears. Even more strikingly, a low-rank approximation of those activations retains nearly the full effect, suggesting that only a small number of internal directions are responsible for much of the model’s aligned behavior. Background: A persistent challenge in understanding aligned language m