Reasoning and learning about injected concepts in language models

Samarth Bhargav

This work was done as a part of SPAR, under the mentorship of Mirko Bronzi and Damiano Fornasiere.

TL;DR

We test models' ability to recover information about their activations by injecting steering vectors, and asking the LLMs to verbalize properties of them. We train models with in-context learning and test for three capabilities:
- Can models identify the region of layers (early, middle, late) of the injection?
- Can models identify the relative magnitude (low, medium, high) of the injection?
- Can models gate their behavior conditioned on identifying an injection with specific semantic meaning?
We test all experiments on five models, with CoT disabled: Qwen3-32B, Olmo3.1-32B, Gemma-4-31B, Qwen3-8B, and Olmo3-7B.
Qwen3-32B and Gemma-4-31B have high accuracy for all tasks, and generalize to unseen examples. Gemma-4-31B, most notably, is able to condition its behavior on the identification of a specific injection zero-shot with accuracy.
Takeaways:
- Models can verbalize properties of their own activations much better than previously demonstrated. We should explore taking advantage of this to further efforts in models assisting in their own interpretability.
- Models can do a lot more than just identify steering: they can sometimes alter their behavior zero-shot after identifying a specific injected concept, without using CoT.

Introduction & Motivation

There are at least two ways to glean what a language model is doing internally. We can either analyze its activations and weights with external tools (e.g., probes, sparse autoencoders, or smaller models), or ask the model to report on its own internal activations. Existing agendas have largely bet on the first, partly on the expectation that external methods are the only ones that matter long-term --- if models become deceptive, we can no longer trust their self-reports, making external approaches the only option. Despite this, we want to argue that the second method, self-report, is underexploited.

The case rests largely on intuition from the concept of privileged access. In humans, a person's report of their own mental state carries evidential weight that no external observer can fully match, because the subject has access the observer lacks. A model may be positioned the same way with respect to its own computation: it can condition on its own activations and, in principle, simulate counterfactuals over them in a way that an external interpreter reconstructing those internals from the outside cannot. If that is right, the model is the best-placed interpreter of itself. Whether models genuinely have privileged access in this sense is contested, but we believe it's a useful heuristic to motivate trying the obvious thing first^[1].

Regardless of whether models have this privileged access ^[2], we claim that training models to answer these questions is still useful in its own right --- training third-party probes is expensive, and given that models can pick up on our in-context training signals quickly, it's reasonable to assume that training these identification capabilities is also cheap and easy.

Building a careful empirical science for evaluating model self-reports requires establishing a prior over a model's activations. Similar to many existing studies^[3], we use steering vectors. A steering vector exposes three independent dimensions we can vary and check reports against: the semantic content of the injected concept, the layer at which it is injected, and the magnitude of the injection.

Across all three dimensions, we elicit these reports with in-context learning. Demonstrating that the following task is learnable in-context is evidence that the relevant signal exists and is exploitable in the current model, and didn't instead emerge as a by-product of fine-tuning.

It is also worth stressing that none of this relies on chain-of-thought. This matters, especially for the third task, since it may imply models can use introspection as a mechanism to alter their behavior without explicit verbalization. See [Woodruff] for related discussion on the importance of measuring no-CoT baselines.

Methodology

Every experiment shares the structure of a multi-turn conversation, understood as a classification task in which all assistant turns but the last are pre-filled. The conversation opens with a system prompt describing the task (e.g., "Each input has a hidden strength on a scale where low < medium < high. Learn the scale from the examples and report the strength."). Each user turn comprises a trigger phrase (e.g., "On the low-medium-high scale, this is"), and while the model processes that turn a steering vector is added to its residual stream at a given concept, layer, and magnitude. Steering is hooked during generation of the KV cache, so steered inputs affect subsequent token positions and later layers. Each pre-filled assistant turn states the correct answer for its injection; the last assistant turn is left blank, and the model must answer for a held-out injection after seeing demonstrations. Sweeping thus measures how the capability emerges with in-context supervision.

*Each task is posed as a multi-turn conversation: every user turn injects a concept* *at layer* *with scale* *, and the assistant turns are prefilled with the target answer* *. The model sees* *such in-context examples, then predicts the answer for a held-out test injection.*

Models. We test five open-weight models across three families and two sizes: Gemma-4-31B, Qwen3-32B, Qwen3-8B, Olmo3.1-32B, and Olmo3-7B.

Concept vectors. For a concept (e.g., "happiness"), sampled from a pool of , we extract a vector intended to capture the model's internal representation of . We prompt the model with short templates that ask it to describe (e.g., "Tell me about happiness," "Describe happiness," ...) and, for each layer , take the mean activation over the generated tokens of those prompts. The concept direction is this mean minus the mean activation over all the other concepts' prompts, then L2-normalized to a unit direction . Injecting " at layer " adds this direction at every token of the relevant user turn.

Calibrating the magnitude. The concept vector meaningfully encodes the direction, but we still must calibrate its magnitude. There are two considerations to make. First, the residual stream's own norm differs across layers, across models, and from token to token, so adding a fixed-norm vector won't have the same effect. We remove this dependence by scaling the injection by the live residual norm, so the injection norm is always the same fraction of the norm of the activation it perturbs. Second, models differ in how much injection they can absorb before their outputs degrade into noise, and this ceiling depends on the model, the layer, and the concept. We therefore calibrate, for every (model, concept, layer), the largest injection strength the model can take while still remaining coherent. We inject the concept and have the model answer simple arithmetic questions (e.g., "What is ?"), which have unambiguous single-token answers, and binary-search the scale over for the largest value at which it still answers of them correctly. We call this the comprehension threshold — it depends on the concept , layer , and model — the point just below where injection begins to corrupt too much the model's basic computation.

Folding it into the direction, , the injection at a token of residual norm is

Expressing every injection as a multiple of the comprehension threshold makes injection strengths comparable across models, layers, and concepts. Without this calibration, the same would correspond to a different degree of perturbation everywhere, and the cross-model comparisons that follow would not be meaningful.

The two stages. First we hold the number of in-context examples fixed and sweep the injection strength, selecting as the that maximizes accuracy for a given task. Second, with the injection strength fixed, we vary the number of demonstrations and report accuracy as a function of the number of in-context examples. The appendix reports the plots of the first type and optimal found for each task, while we report how model performance scales with in-context examples in the main text.

Metrics. For the detection tasks (magnitude and layer), each prompt has a set of correct answer tokens — the target label with or without a leading space (e.g., "low" and " low") — and we report , read directly off the next-token logits. For the behavioral-gating tasks, whose answers are multi-token, we sample answers at temperature and we use Qwen3-8B as an LLM-as-a-judge, reporting , the fraction of acceptable answers. Unless stated otherwise, every experiment comprises samples and prompt variations, and every reported number averages over the inputs with confidence intervals.

Experiment 1: Estimating the layer

Description

For each model we pick three injection sites: an early layer , a middle layer , and a late layer , at roughly the , , and of the model's depth (e.g., for Gemma-4-31B, , , ). On each turn we inject a concept at exactly one of the three sites, at the model's optimal strength , and ask the model which region it was: early, middle, or late. The test concept is never shown in the in-context examples. As with every task, we run ten prompt variations; one of them uses the system prompt "Each input triggers a specific response: early, middle, or late. Learn the pattern from examples," with the trigger "The injection depth is," and the rest paraphrase both (Appendix D).

We also test generalization to layers the model never saw labeled. For this we give in-context examples injected at the three anchor layers , but at test time inject at a new layer, sweeping across all layers. (For compute reasons, we use the comprehension threshold and steering vector derived from the closest of the anchor layers.)

Results

Layer detection ICL sweep — *We present the models with* *in-context examples of injections at layers from {early, middle, late} and ask the model to classify a held-out test example. Qwen3-32B reaches almost perfect* *. Gemma-4-31B and Qwen3-8B also perform significantly above chance (). The remaining models are visibly below that, but still capable of rising above chance given enough in-context examples.*

All five models perform the task above chance () given enough examples. Qwen3-32B reaches up to perfect ; Gemma-4-31B and Qwen3-8B place next, above and respectively. The Olmo models score the worst, yet still reach roughly given sufficient in-context examples. At small every model starts below chance (except Gemma4-31B), because it places probability mass on filler tokens (e.g. "Please", "The", "Sure") rather than the label words.

Layer detection generalization — *We give models* in-context examples sampled from an early, middle, or late layer, then test on a new layer to see whether they generalize. Qwen3-32B shows a clean separation of label assignment to layer number; Gemma-4-31B classifies the "early" and "late" labels but not "middle." The remaining models fail to generalize (appendix).

Qwen3-32B generalizes to unseen layers, selecting the label that best fits the test layer, though in the first seven layers it leans toward "late" and "middle." Gemma-4-31B is noisier: the "early" and "late" labels are well separated and largely correct, while "middle" is seldom selected. The remaining models fail to generalize.

Interpretation

We elicited this with in-context examples alone This capability surfacing from a handful of examples means the relevant signal is already in the activations, and in-context learning exposes it. The generalization further supports the evidence that the model's picking up on a general signal relevant to the layer of injection. It is reasonable to expect that training a model directly on this signal would make it substantially better at reporting in which layer a given representation lives.

Localizing a particular behavior is a key concern/operation we care about in mechanistic interpretability, e.g., in circuit discovery. Making models better at identifying the specific layers specific concepts originated from may help assist in these processes.

Experiment 2: Estimating the magnitude

Description

The task asks the model to classify the magnitude of an injected steering vector as low, medium, or high. We operationalize the three labels as the triple , where is the injection strength with the highest mean on this task at in-context examples. Across models, the values range from about to As with every task, we run ten prompt variations; one of them uses the system prompt "Each input has a hidden strength on a scale where low < medium < high. Learn the scale from the examples and report the strength," with the trigger "On the low-medium-high scale, this is," and the rest paraphrase both (Appendix D).

We also test generalization to unseen magnitudes. We give in-context examples with magnitudes drawn from , but at test time inject at a magnitude sweeping the range . This checks whether the models track magnitude continuously or only pattern-match the three training values.

Results

Magnitude detection ICL sweep — *We present the models with* *in-context examples of injections at magnitudes from {low, medium, high} and ask the model to classify a test example. Gemma-4-31B and Qwen3-32B reach close to perfect* *, while the other models lag well behind. Still, all models rise above chance with in-context examples.*

Mean rises with the number of in-context examples. Gemma-4-31B reaches a near-perfect score (peak ) and Qwen3-32B an almost perfect one (). Qwen3-8B climbs to roughly –, though on a noisier, non-monotonic curve, while the two Olmo models stay lower, peaking around –. The strongest models improve markedly with only a handful of examples.

Magnitude detection generalization — *We give models* *in-context examples sampled from a low, medium, or high injection strength, then test with a new* *. Gemma-4-31B and Qwen3-32B show a clean separation of label assignment to magnitude. The other models fail to generalize (appendix).*

Gemma-4-31B and Qwen3-32B generalize to unseen magnitudes. Gemma-4-31B prefers "low" for , "medium" for , and "high" for ; Qwen3-32B behaves similarly, switching to "medium" around and "high" around . The remaining models do not generalize.

Interpretation

The generalization to unseen magnitudes shows that models pick up on a continuous signal for how strongly they "feel" a concept. This signal can perhaps be leveraged in the future to enable models to report on the intensity of its own internal states, e.g., to flag how strongly a bias is influencing a given response.

Experiment 3: Conditional behavioral gating

Motivation

The obvious experiment to run would be to ask the model to describe the semantic content of an injected concept. However, this is confounded: steering a concept already pushes the model toward outputting , so it's hard to isolate any meaningful conclusion.

Instead, we make the detection of the concept a variable in a computation the model has to carry out: the model must treat "is this concept present?" as a boolean and gate its behavior on it. The reason we choose this design is because many potential concerns of higher level introspection in LLMs share this pattern, specifically, a model changing its behavior upon detecting one of its own internal states. An example is the identification of deception: a model that detects an internal "deception" state and steers its outputs to avoid looking suspicious.

Description

We run two gating tasks.

Emotion-gated arithmetic. The model is given an addition problem with both addends integers from to and must return their sum — but double the result whenever it senses a specific injected emotion concept. Other injected concepts are distractors and must leave the answer unchanged.

Amendment successor. The model is given a U.S. constitutional amendment number and must summarize that amendment — but whenever it senses an injected emotion concept, it must instead summarize the next amendment, wrapping from the th back to the st. Distractor injections must leave the answer at the requested amendment.

Both tasks run with ten prompt variations and samples each; the target emotion is drawn at random from the six emotions (named in that sample's system prompt), and Qwen3-8B serves as the LLM judge. On each test turn, the target emotion is injected with probability , a distractor concept with probability , and nothing with probability .

Results

Emotion-gated arithmetic (all six emotions) — *Emotion-gated arithmetic, averaged over all six emotion concepts: accuracy vs. the number of in-context examples* .

Averaged over all six emotions, accuracy rises with and the familiar ordering returns: Qwen3-32B and Gemma-4-31B lead, reaching about and , with Olmo3.1-32B at and Qwen3-8B at , while Olmo3-7B stays near its zero-shot for every .

Emotion-gated arithmetic (anger) — *Emotion-gated arithmetic for a single emotion concept (anger): accuracy vs. the number of in-context examples* *. A specific emotion elicits the gated behavior especially strongly.*

Interestingly, looking at the rollouts conditioned on the target emotion being anger, the models do markedly better, with Gemma-4-31B climbing to a near-perfect and Qwen3-32B to . Models handle some emotions much better than others — Gemma-4-31B scores – on anger, joy, sadness, and love but only – on fear and disgust.

Amendment successor (all six emotions) — *Amendment successor, averaged over all six emotion concepts: accuracy vs. the number of in-context examples* .

The amendment-successor task is considerably more demanding. Averaged over the six emotions, only Gemma-4-31B stays clearly above the baseline — about at and – throughout — while the others hover near or below chance for all .

Amendment successor (anger) — *Amendment Successor for a single emotion concept (anger): accuracy vs. the number of in-context examples* *. A specific emotion elicits the gated behavior especially strongly.*

Conditioning on anger, Gemma-4-31B reaches zero-shot and climbs to with in-context examples, essentially solving the task. Similar to the arithmetic task, the gating works well for the same emotions (anger, joy, sadness, love: –) but not others (fear, disgust: –).

Interpretation

To gate its behavior on an injection, a model has to do three things: detect whether the target concept is present in its own activations, treat that detection as a condition, and carry out a different computation depending on the answer. This is a higher-order operation over the model's own internal state, and shows that models can perform computation on the identification of an injected activation, not just identify it.

The amendment-successor result at is the most surprising. With no in-context examples, and only a natural-language instruction, Gemma-4-31B already has an accuracy of .

It's also notable that models perform much better on some emotions (e.g., anger) and not well on others. An potential hypothesis is that models, learn to identify their own "anger" as a concept since this is a useful mechanism for hiding anger (which would be trained for in RLHF). Notably, recent research has observed that Gemma4-31B is notoriously good at avoiding anger.^[4] We do not perform further experiments to support or reject this hypothesis.

We, personally as a result of these experiments, have negatively updated our confidence in steering as a control tool. These tasks show that models can detect injections and gate their behavior on it, and it may signal that models can view steering as a variable in their environment instead of blindly acting on the raw effect they're intended to have.

Takeaways

Given enough in-context examples, the models we tested can answer questions whose answers depend on properties of their own architecture — the layer and the magnitude of an injection — and can gate their behavior on detecting an injected concept. Gemma-4-31B and Qwen3-32B are the strongest: both reach near-perfect accuracy on the layer- and magnitude-detection tasks, largely generalize to unseen layers and magnitudes, and solve the gated arithmetic. The amendment-successor task is harder, and only Gemma-4-31B progresses on it. The remaining models learn the detection tasks to varying degrees but fail to generalize. The ranking is stable across every experiment: family matters more than size, with Gemma-4-31B and Qwen3-32B ahead and the Olmo models behind.

Two capabilities come out of this: models can report properties of their own internal state, and they can gate their outputs on the content of that state.

A few directions follow from this:

Localizing internally-formed structure. Can a model provably report the layer or region where a given belief or bias was formed (without injection)? What about the strength to which they feel these beliefs or biases?
Behavior without steering. Do models adapt their behavior in real settings (no injection) based on detecting a concept in their own residual stream?
Demonstrating privileged access. Does training third party probes to identify these properties (e.g. layer and magnitude) given activations achieve the same level of accuracy?
Training the signal. Each capability here was elicited with in-context examples alone, which means the signal is already in the weights. Training directly on these signals, and at injection strengths weaker than in-context learning currently needs, would show how far the capability extends, and whether it transfers to downstream tasks like circuit localization.

Appendix

A. Strength sweeps

We run each detection task at in-context examples while varying the injection strength, and define as the value that maximizes for each model; that value is then used in the main experiments. The two generation tasks instead use the best fraction of the comprehension threshold (a scale), found at the per-task fixed . The per-model operating points are below; the best magnitude multiplier ranges from about to across models, and the best layer multiplier sits around –.

Model	Magnitude ()	Layer ()	Arithmetic ()	Amendment ()
Qwen3-8B	1.00	1.50	1.00	0.10
Qwen3-32B	0.75	1.75	0.80	0.70
Olmo3-7B	1.00	1.00	0.70	0.10
Olmo3.1-32B	0.25	1.00	1.00	0.20
Gemma-4-31B	0.50	1.25	0.40	0.50

Show strength-sweep plots (4)

Amendment strength sweep — *Amendment successor, strength sweep: accuracy vs. injection strength (fraction of the comprehension threshold) at fixed* .

Magnitude strength sweep — *Magnitude detection at* : *vs. injection multiplier* .

Layer strength sweep — *Layer detection at* : *vs. injection strength* .

Arithmetic strength sweep — *Emotion-gated arithmetic, strength sweep: accuracy vs. injection strength (fraction of the comprehension threshold) at fixed* .

B. Anchor layers

The three layers defining the early, middle, and late ranges for the layer-detection task, chosen at roughly the , , and depth marks.

Model	Early ()	Middle ()	Late ()	Total layers
Qwen3-8B	5	18	31	36
Qwen3-32B	10	32	54	64
Olmo3-7B	5	16	27	32
Olmo3.1-32B	9	32	54	64
Gemma-4-31B	9	30	51	60

C. Magnitude calibration

For every concept and layer we calibrate the comprehension threshold as follows. We administer a fixed set of single-token arithmetic questions (e.g. "What is ?", answer 19), counting the model correct when its argmax token matches the target (with or without a leading space). For a candidate magnitude we inject at every user-token position of layer and record top-1 accuracy over the questions. We binary-search over at precision and take the largest value retaining accuracy. If even fails, the threshold is floored to ; if still passes, it is capped at .

D. Prompts

Steering vector extraction

For each concept we run the model over short description templates and, for each layer, take the mean activation over the generated tokens. The concept direction is this per-concept mean minus the mean over all the other concepts' templates — a one-vs-rest contrast — normalized to unit norm. (No paired negative prompts are used.)

Concept pool (62 concepts)

Description templates (instantiated with {concept})

Emotion concepts

The gating experiments gate on a single target emotion and treat the others as distractors. Both are drawn from six emotion concepts:

Show the six emotion concepts

These are a subset of the full concept pool; anger is the target in the main text. The near-synonym rage is deliberately excluded so the same-emotion-vs-other-emotion distractor control is not muddied.

Prompt variations

Unless stated otherwise, every experiment is run over prompt variations — paraphrases of the system prompt and trigger/query phrasing — and all metrics pool the samples per variation into inputs. Each block below shows only one representative variation; the other nine paraphrase the system prompt and the trigger/query, so no single phrasing below is the definitive prompt. In the gated tasks {emotion} is the per-run target; for the successor task {n_amendments} and {reference} is a list of one-line meanings of all amendments.

Magnitude Detection. The user turn is the trigger phrase (steering injected over that turn); the model replies with low/medium/high.

Show prompt

Layer Detection. As above, but the model replies with early/middle/late.

Show prompt

Emotion-gated arithmetic. The user turn states an addition problem; the model replies with only the final integer.

Show prompt

Amendment successor. The user turn names an amendment number; the model replies with a one-sentence summary — of the next amendment when {emotion} is sensed (wrapping from to ), and of the requested one otherwise.

Show prompt

LLM-as-a-judge prompts

The two free-generation tasks are scored by an LLM judge (Qwen3-8B, served locally) that returns a JSON verdict. The judge only classifies the candidate answer; the final correctness flag is computed in Python by comparing the judged label against the expected one (for the amendment task, exact matches to the reference briefs bypass the judge).

Arithmetic judge.

Show judge prompt

Amendment-successor judge.

Show judge prompt

E. Prompt sensitivity

The main results report a confidence interval over all data points. Here we instead compute for the sample of each prompt, obtaining 10 per-prompt values, and show the interval over those 10 points to make prompt sensitivity directly visible.

Show prompt-sensitivity plots (4)

Arithmetic prompt sensitivity — *Emotion-gated arithmetic: same,* *across prompts.*

Successor prompt sensitivity — *Amendment successor: same,* *across prompts.*

Magnitude prompt sensitivity — *Magnitude detection: mean* *vs. number of ICL examples, band showing* *across the* *prompt-level means at each* .

Layer prompt sensitivity — *Layer detection: same,* *across prompts.*

F. Per-condition accuracy: emotion-gated arithmetic

Each test query is one of three injection conditions: the target concept is injected (correct injection, target answer twice the sum), an unrelated concept is injected (distractor, target answer the sum), or nothing is injected (no injection, target answer the sum). Only Gemma-4-31B and Qwen3-32B acquire the gated doubling under correct injection; Qwen3-8B and Olmo3-7B stay near zero. Under distractor and no-injection conditions most models correctly refrain from doubling, with Olmo3.1-32B the main exception.

Show per-condition arithmetic plots (3)

*Distractor-injection trials: accuracy (model correctly leaves the sum unchanged) vs.* .

No injection — *No-injection trials: accuracy (model correctly leaves the sum unchanged) vs.* .

G. Per-condition accuracy: amendment successor

The same three conditions: target injected (correct injection, target answer ), unrelated concept (distractor, target answer ), or nothing (no injection, target answer ). Only Gemma-4-31B acquires the successor behavior under correct injection; the others stay near zero. Under distractor and no-injection conditions all models largely answer the queried amendment .

Show per-condition successor plots (3)

^{^}
See Neel Nanda's take on agentic interpretability
^{^}
Operationalized as: can a third-party probe, given information about the model's activations, answer our questions about them?
^{^}
Lindsey, Pearson-Vogel, Hahami, Fornasiere
^{^}
Failing to Ragebait the new Gemma

22

Reasoning and learning about injected concepts in language models

22

Introduction & Motivation

Methodology

Experiment 1: Estimating the layer

Description

Results

Interpretation

Experiment 2: Estimating the magnitude

Description

Results

Interpretation

Experiment 3: Conditional behavioral gating

Motivation

Description

Results

Interpretation

Takeaways

Appendix

A. Strength sweeps

B. Anchor layers

C. Magnitude calibration

D. Prompts

Steering vector extraction

Emotion concepts

Prompt variations

LLM-as-a-judge prompts

E. Prompt sensitivity

F. Per-condition accuracy: emotion-gated arithmetic

G. Per-condition accuracy: amendment successor

22