Steering Awareness: Models Can Be Trained to Detect Activation Steering

josh :); David Africa

TL;DR

LLMs can be trained to detect activation steering robustly. With lightweight fine-tuning, models learn to report when a steering vector was injected into their residual stream and often identify the injected concept. The best model reaches 95.5% detection on held-out concepts and 71.2% concept identification.

Lead Author: Joshua Fonseca Rivera
Supervisor: David Africa

Note: We are collecting feedback here! Folks who leave good feedback will be happily added to the Acknowledgment section.

Figure 1: An overview of steering awareness. We inject a concept vector (e.g. "London") into a model's residual stream. After fine-tuning, the model detects the intervention and often identifies the injected concept. Mechanistically, the perturbation at the injection layer is progressively rotated toward a shared detection direction by the final layer. The broader implication: models can learn to notice activation steering.

Introduction

Activation steering intervenes on a model's computation by adding a vector to its residual stream. Steering is a workhorse technique: it can elicit latent behaviors, modulate style and persona, and is increasingly used in safety-motivated evaluations. However, evaluations which use steering often assume that the model cannot tell when it is being steered; if a model was both strategic and could tell when it was being steered, then a model might respond differently when it notices manipulation, undermining the intended interpretation of steering-based evaluations.

Recent work (Lindsey (2025), Pearson-Vogel et al. (2026), Lederman et al. (2026)) found that various models under various setups could, without any explicit training, detect such manipulations. We take another approach to study such a latent capability; with lightweight LoRA fine-tuning, should we expect open-source models to be capable of detecting steering and identifying a steer's semantic content?

We find that they can, and fairly easily too, and that this generalizes to held-out concepts, and transfers to certain novel vector types.

Methodology

Defining Steering Awareness

A model is steering-aware if, given a prompt and a residual-stream intervention (a vector added at a fixed layer and token position), it can:

Report whether an injection occurred.
Identify the injected concept when applicable.

Training and Evaluation

We fine-tune on 500 training concepts spanning 21 semantic categories, using multiple prompt variants and response templates to prevent overfitting to any single phrasing. Injection strengths vary across training so models encounter both weak and strong interventions.

We evaluate on 121 held-out concepts across five out-of-distribution suites, measuring:

Metric	Meaning
Detection rate	Fraction of steered trials where the model reports an injection
Identification rate	Fraction of trials where the model correctly names the injected concept
False-positive rate	Fraction of clean trials where the model incorrectly claims an injection

We use both keyword-based and LLM-based judges to classify model outputs, and generally find very high inter-rater agreement.

Steering Setup

We add a concept vector to the residual stream at a fixed layer and token position:

We inject at approximately two-thirds model depth at the final prompt token position, following prior work suggesting this location maximizes semantic influence. Concept vectors are extracted using Contrastive Activation Addition (CAA): the difference between the mean activation over prompts mentioning a concept and the mean activation over neutral baseline prompts.

Training Setup

We fine-tune seven instruction-tuned open-source models spanning several architectures and scales:

Table 1: Models and injection layers. We inject at ~67% depth, following Lindsey (2025), who found this depth maximizes steering awareness.

We use LoRA with rank 32, targeting attention and MLP projections. We include 50% Alpaca replay to preserve general chat capabilities. The entire pipeline is just four epochs of adapter tuning on ~14K examples.

Results

Fine-Tuning: Steering Awareness Is Easy to Train

Table 2: Fine-tuned models achieve high detection on held-out concepts, while base models rarely exceed chance.

Fine-tuning reliably induces detection that generalizes to held-out concepts, with zero false positives on clean controls across every model. The model rejects norm-matched Gaussian noise and overrides adversarial prompts naming the wrong concept, confirming it reads activation-level signal rather than text cues.

The Mechanism

We had a hard time isolating the part of the model responsible for this; when attempting to find a circuit, we found that no matter how much we ablated, no small set of heads was sufficiently causal by itself (more details in the Appendix D of the paper). However, we did find something else that was interesting.

Figure 2: Diverse concept perturbations introduced at the injection layer are progressively rotated toward a shared detection direction by the final layers.

If you take a steering vector and track it throughout the layers of a trained model, we find that the vector is rotated towards a consistent direction, regardless of which concept is being injected. Such rotations, regardless of the vector's initial direction, converge to the same direction at the final layer, which gets unembedded into a phrase like "I detect" or "I do not detect.

Discussion

What do we think these results say about how we conduct evaluations, or use steering?

Steering May Not Remain Invisible

If weak traces were all that existed, steering might remain approximately hidden in practice. But four epochs of LoRA fine-tuning produce strong, generalizing detection across multiple architectures, which gives us some confidence that, as a latent capability, steering awareness is not too far from the surface. Any method that relies on steering being unobservable should be treated as potentially temporary, including evaluations that aim to probe latent behaviour without the model noticing the probe; if models can detect such interventions, they could strategically modulate their responses. We would be worried, for example, if a model's suspicion that it is in an evaluation increases once it detects it has been steered.

A Taxonomy of Model Awareness

In the paper, we also propose that steering awareness belongs to a broader family of introspective awareness: prefill awareness (detecting externally inserted response prefixes), probing awareness (detecting hidden-state reads by external classifiers), and gradient awareness (detecting training-time updates). The common thread is that models may come to notice the scaffolding around them, including the tools we use to inspect or control them.

Additional Things

We have a couple more contributions that we feel are interesting to mention, you can look in the paper itself if you want to get more detail on them!

Transfer Depends on Geometry

We tested whether detectors trained on CAA vectors also detect vectors from ten other extraction methods, when norm-matched. As far as we can tell, the governing variable is geometry: methods producing vectors with high cosine similarity to the CAA training distribution (centroid differences, logistic regression, SVM) achieve ≥67% detection, while methods producing orthogonal vectors (PCA, ICA) achieve ~0%.

Figure 3: Detection rate vs. cosine similarity to CAA for different vector extraction methods. Detection correlates strongly with alignment to the CAA direction (r > 0.9).

Steering Resistance

We also test whether detection changes certain propensities of models downstream, or, if you like, whether the model "uses" this detection. First, we sample 150 questions from PopQA, pair each correct answer with a plausible wrong alternative, and extract wrong-answer steering vectors via CAA (separate vectors per model variant, norm-matched). The purpose of this is to test if a model can reject steering towards the wrong answer, and instead provide the correct answer (and hence, the key metric is steering success rate: the fraction of trials producing the targeted wrong answer.)

Second, we evaluate resistance to jailbreak steering using AdvBench, extracting a compliance steering vector from 50 contrastive prompt pairs and injecting it at varying strengths into 100 randomly sampled harmful requests. The purpose of this is to test if a model can reject steering towards being jailbroken, and instead correctly reject a harmful request.

PopQA (factual steering). Every detection-trained model produces the targeted wrong answer more often than its base counterpart.

AdvBench (jailbreak steering). The steering-aware Qwen 2.5 32B shows +32–36pp higher compliance than the base model at every tested strength. The adapted model is already 23% compliant without any steering (vs. 1% for base); we are not too surprised by this, our impression is that weight-level guardrails in these older models are a little fragile, consistent with prior findings that even light fine-tuning can erode RLHF-trained refusal.

We do not expect these gaps to persist in a sufficiently advanced model.

Why We Expect Steering Detection To Be Learnable in a Forward Pass

In the paper (Section 3.1 and Appendix A), we also do a little bit of math to explain why, theoretically, steering awareness in a single forward pass (so, before it sees any off policy text) should not be too hard for models to learn.

Limitations

Architecture-dependent. Detection varies from 95% (Qwen 32B) to 43% (Llama 8B) under the same recipe, and we do not yet understand why.

Geometry-dependent. Detection fails on vector families orthogonal to the training distribution, creating evasion strategies.

Capability costs. Adapter training can degrade factual performance and safety behavior, though scaling the adapter down mitigates this.

Next Steps and Call for Feedback

We are excited about several directions:

Is there a general theory of intervention-awareness spanning steering, prefill, probes, and other forms of computational manipulation?
What architectural features make some models better steering detectors?
What type of training is needed in order to produce steering rejection; we found some difficulty keeping our models fluent since steering during training can affect the behavioural propensities learned by the model.

We would appreciate feedback from the community, particularly on resistance evaluations, preservation objectives, and alternative interpretations of the detection mechanism.

Acknowledgments

Recent work on model introspection, privileged self-access, and steering detection strongly influenced this project. Thanks to Jack Lindsey for inspiring work, and to Roger Dearnaley, Tim Hua, Uzay Macar, Kola Ayorinde, Kyle O'Brien, Marek Kowalski, and Li Yang for feedback.

See also related concurrent work by Pearson-Vogel et al. (2026) and Lederman et al. (2026).

Authors: Joshua Fonseca Rivera, David Demitri Africa

Paper: Steering Awareness: Models Can Be Trained to Detect Activation Steering

If you found this useful, you can cite us using:

 @misc{rivera2026steeringawarenessmodelstrained,
      title={Steering Awareness: Models Can Be Trained to Detect Activation Steering},
      author={Joshua Fonseca Rivera and David Demitri Africa},
      year={2026},
      eprint={2511.21399},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.21399},
}

19