Cross-posted from my recent paper: "Alignment is localized: A causal probe into preference layers" : https://arxiv.org/abs/2510.16167
TL;DR: We find that human preference alignment in at least one LLM isn’t global; rather, it is concentrated in a few mid-layer circuits.
A key problem in aligning language models is that they are largely opaque: while techniques such as reinforcement learning through human feedback (RLHF) lead to AI systems that are more aligned with their human counterparts in practice, the mechanics behind how such alignment is achieved remain largely misunderstood. The process through which a language model "learns" to optimize its behavior toward human processes, at least in the terms of model internals, is somewhat mysterious.
In this work, we try to uncover where the signal for human preference "lives" in a language model. By comparing a base model to its instruction-tuned counterpart, we examine how the two differ in the internal activations they produce on the same inputs. Through a series of causal interventions and statistical analyses, we isolate the regions of the network that appear to carry the bulk of the preference information. Our goal is not to propose a new alignment method, but to understand the structure of the alignment signal as it already exists in widely used models.
The core result is surprisingly simple. Rather than being spread across the entire depth of the network, the preference signal shows up most strongly in a small group of mid-layer activations. When those activations are transferred into the base model, its behavior shifts toward human-preferred responses; when they are replaced or randomized, that shift disappears. Even more strikingly, a low-rank approximation of those activations retains nearly the full effect, suggesting that only a small number of internal directions are responsible for much of the model’s aligned behavior.
A persistent challenge in understanding aligned language models is that contemporary fine-tuning methods shape behavior without offering much insight into how that behavior is represented internally. Techniques such as supervised instruction tuning, rejection sampling, and RLHF reliably improve a model’s ability to follow instructions or adhere to safety norms, yet these improvements are typically evaluated externally: through benchmarks, win rates, or human preference judgments. What happens inside the model during this process is far less clear. Prior interpretability work has shown that language models can internalize surprisingly structured features (e.g., induction heads, modular arithmetic circuits), but these analyses focus on base models rather than aligned ones. It remains uncertain whether alignment-related behaviors are encoded diffusely across many layers, concentrated in specific regions, or entangled with the model’s generic capabilities. Without visibility into these internal structures, alignment remains something we observe from the outside rather than understand from the inside.
Recent progress in mechanistic interpretability has inspired a more granular approach: comparing how tuned and untuned models represent the same inputs and probing which internal directions are responsible for behavioral differences. Tools such as activation patching and linear representation probes offer ways to intervene on internal activations and measure their causal influence on outputs. However, relatively little work has applied these tools to preference-tuned models to understand how alignment is actually implemented. Given that preference alignment underlies nearly all modern language models: OpenAI’s RLHF models, Anthropic’s Constitutional AI, Meta’s instruction-tuned models, and many open-source SFT pipelines, understanding how and where this alignment signal is stored has become increasingly important. If preference-aligned behavior traces back to identifiable internal transformations rather than diffuse global changes, then alignment may become more measurable, editable, and robust. This is the motivation for the analysis we present in this work.
Our goal was to understand where preference information shows up inside a language model once it has been tuned to follow human guidance. To study this, we worked with the Llama 3.2 1B model released by Meta in 2025. The model has two relevant versions: a base checkpoint and an instruction-tuned checkpoint trained with supervised examples of preferred behavior. Both versions use the same tokenizer and architecture, which allows their internal activations to be compared layer by layer.
For the preference data, we used the Anthropic Helpful–Harmless–Honest (HHH) dataset. It contains human-labeled pairs of responses where one is marked as the preferred answer and the other is marked as rejected. From this dataset, we sampled 80 pairs that covered a range of tasks related to helpfulness and harmlessness. Each pair serves as a small, controlled test that lets us observe how the model represents human preference at different points in its internal computation.
For every prompt in the dataset, each with a preferred and a non-preferred completion, we measure how strongly the model favors one answer over the other. This is done by looking at the difference in log-probabilities assigned to the two completions on the same prompt. A larger margin reflects stronger agreement with the human-labeled preference.
To understand where this preference information appears inside the model, we record the hidden activations from both the base and the instruction-tuned versions of Llama 3.2 on the same inputs. We then intervene at a single layer by replacing the hidden activations of one model with those from the other, while keeping the rest of the computation unchanged. After this replacement, we run the model forward again and measure how the log-probability margin shifts. This gives a direct sense of how much that particular layer contributes to the preference signal. The effect of inserting tuned activations into the base model is illustrated below:
To complement the intervention experiments, we also compare the internal representations of the two models using simple linear tools. For each prompt pair, we compute the difference between the tuned model’s activation and the base model’s activation at each layer. These differences provide a summary of how the two models diverge in their internal processing.
We then fit a linear probe that tries to predict the preference margin from these activation differences. This helps show which layers carry representations most strongly associated with human-preferred behavior. To narrow this further, we apply a sparse regression method that encourages most layers to have no weight at all. The few layers that remain are those whose activation differences best explain the observed changes in behavior. This method, known aas LASSO regression, gave us a good overview of where behavior related to human alignment is concentrated.
Taken together, these experiments suggest that the signal associated with human preference tuning is not spread evenly throughout the model. Instead, it appears to cluster in a small region of the network, mostly around the middle layers. When those layers are transplanted into the base model, the model becomes more likely to produce the responses that humans labeled as preferable. When they are removed or replaced, that tendency weakens. The fact that a low-rank reconstruction retains nearly the full effect points toward a compact internal representation rather than a diffuse one.
This is a limited study. It examines one model family, a relatively small number of prompts, and a single alignment dataset. The structure of alignment may look different in larger models, different architectures, or settings where alignment focuses on other qualities such as truthfulness, calibration, or resistance to social manipulation. We also looked only at linear, layer-based interventions. Future work could explore cross-layer interactions, non-linear probes, or whether the same alignment subspace can be transferred across models or modalities.
To our knowledge, this is the first systematic, cross-model causal study of RLHF/DPO that jointly demonstrates (i) a bidirectional directionality asymmetry (Base↔DPO) across many preference pairs, (ii) a monotonic dose–response under α-interpolation at a mid-layer bottleneck, (iii) near-full recovery of alignment effects from a low-rank activation subspace, and (iv) sparse attribution of reward gains to a small set of layers. Prior work has applied activation patching to RLHF models in exploratory or task-specific settings, or studied alignment-adjacent behaviors (e.g., deception, faithfulness), but has not established this combined causal picture of alignment as sparse policy distillation through mid-layer representations.
As language models continue to be shaped by human feedback, it becomes increasingly important to understand how that feedback is represented internally. The experiments here provide a small step toward that goal. They indicate that preference alignment may live in a specific part of the underlying neural network and may be simpler, more structured, and more compact than expected. If this pattern holds more broadly, it could open the door to alignment methods that are localized. Potential future work (which I am happy to hear ideas on!) may include creating steering vectors specifically for an individual layer that is correlated toward a specific behavior,
I am excited to hear any feedback on this idea! I believe it is a decent application of several ideas (linear probes, activation patching) in mechanistic interpretability for alignment/general AI safety.