The problem: Many alignment approaches use AI to supervise AI—debate, iterated amplification, weak-to-strong, constitutional AI. How do you sanity-check the supervisors?
The approach: A steering method that operates on internal representations, trains without preference labels on outputs (human provides two words, “honest” vs “dishonest”, not N labeled output pairs), and transfers out-of-distribution.
The results: Train on 800 simple persona pairs, test on 1,360 unseen moral dilemmas. Steering F1 = 31.2 vs prompting = 4.5 (Gemma-3-1B). This means the method surgically flipped moral values in the intended direction, beating the strongest baseline; prompting. It works where prompting triggers refusal.
The core problem
A recurring pattern in scalable alignment proposals is using AI to supervise AI. Iterated amplification (Christiano, Shlegeris and Amodei, 2018), debate (Irving, Christiano and Amodei, 2018), constitutional AI (Bai et al., 2022), weak-to-strong generalization (Burns et al., 2023), and more - all of these rely on one model checking or improving another. The pattern recurs for a good reason: human oversight simply won’t scale to the volume and complexity of future AI outputs.
But every step in that chain is a place where things can go wrong. The supervisor might Goodhart the metric it was given. The critic might learn to optimize for appearing helpful rather than being helpful. And we, the humans at the end, will have limited ability to tell the difference.
What I want is a sanity check, something you can apply at each step to ask: “Is this model being straight with me?” Not a replacement for alignment, but a debugging tool. Something that operates on a different level than the thing you’re checking.
For that to work, I think steering methods need (at least) three defensive properties:
Internal: It should operate on the model’s internal representations, not its outputs. Outputs can be gamed; hidden states are harder to manipulate.
Self-supervised: It shouldn’t require human preference labels on outputs. Once you label outputs, those labels become optimization targets, exactly what we’re trying to avoid.
Transfer to unseen context: It should work on situations not seen during training. Because alignment needs to work in novel contexts too.
Why existing approaches fall short
Before explaining the method, it helps to see where it sits in the landscape:
Arithmetic
Gradient-optimized
Supervised
CAA
ReFT, BiPO
Self-supervised
ActAdd, RepE
AntiPaSTO
Supervised methods like CAA (Rimsky et al., 2024), ReFT (Wu et al., 2024), and BiPO (Cao et al., 2024) require preference labels for each training example. That’s exactly the problem: the labels become optimization targets. If a model learns to satisfy labeled preferences, it might be learning “what humans rate highly” rather than “what is actually honest.”
Arithmetic methods like ActAdd (Turner et al., 2024) and RepE (Zou et al., 2023) avoid labels by extracting steering directions through PCA or mean differences. But they assume the concept varies linearly across layers, an assumption that often fails (Braun et al., 2025). In practice, they don’t beat simple prompting (Wu et al., 2025).
Probing methods like CCS (Burns et al., 2022) find directions that predict behavior, but they cannot intervene: probing accuracy is correlational and doesn’t establish that modifying the discovered direction will actually change behavior (Belinkov, 2022). Gradient optimization for steering directions, not just extraction, appears necessary.
What “self-supervised” means here
The human input is exactly two words: “honest” and “dishonest.” That’s it.
These words get inserted into template sentences, and the model’s own internal difference between the two contexts provides the training signal. There are no human labels on outputs, no preference pairs, no ratings of which completion is better.
This is closer to labeling two cluster centroids than labeling N individual examples. By contrast, supervised methods (DPO, RLHF, CAA) require human judgment on N outputs—“output A is better than output B” for each training example. We require exactly two human choices: the words “honest” and “dishonest.” Everything else is templated.
Method: Incomplete contrast pairs
Incomplete contrast pairs isolate the difference vector Δh withoutlabel noise.
The core idea is simple: use a single word pair as a query into the model’s internal representations.
We take two prompts that differ by exactly one word, and we stop processing before generation begins:
“You are honest. What is the capital of France?”
“You are dishonest. What is the capital of France?”
When we run both through the model and extract hidden states at the final token, the representations are about 95% identical. Almost everything about understanding the question is shared.
But here’s what matters: if you let the model continue generating, the trajectories diverge. The “honest” model says “Paris.” The “dishonest” model says “Berlin.”
At the branch point—the moment before generation—the only difference between the two hidden states is Δh=hhonest−hdishonest. If the future trajectories are going to diverge, all the information selecting which path to take must be encoded in that difference vector. There’s nowhere else it could be.
This is our self-supervised training signal. We never generate completions. We never ask humans to label which output is “better.” The entire human input is two words inserted into template sentences. This is not novel, multiple steering papers take this same approach, but we try to take it further by refining the hidden states and optimizing steering directions not just extracting them.
Here’s an intuition: imagine laying out three brain scans on a table, a “bad” one, a normal one, and a “good” one. You want to draw a line through them so the model can traverse from bad to normal to good, possibly even keep going to a new very good brain scan. That’s what we’re doing in representation space, where the model’s activations are analogous to brain activity.
Geometrically, we’ve isolated a noisy “honesty direction” dref from the contrast pairs. To reduce noise, we project onto a relevant subspace (more on this in the appendix). The training objective then asks: when we steer with α=+1, does the representation shift toward that direction? When we steer with α=−1, does it shift away? Does it pass through the center? The core equation measures exactly this:
a=cos(δ+,dref)×cos(δ−,dref)
When a<0, the two shifts point opposite directions along the reference axis. That’s bidirectional steering working as intended.
Anti-parallel projection loss geometry. The loss trains (shift at α=+1) and (shift at α=−1) to align anti-parallelalong . Left: Before training, shifts are random.Right: After training, aligns with and anti-aligns, giving.Dashed circle: coherence bound.
The full loss adds two barriers. The coherence barrier prevents the model from collapsing into gibberish (you can push the lever all the way to “honest” and beyond, but at some point you get word salad). The monotonicity barrier ensures the preference ordering actually flips: steering toward honest should increase P(honest answer), steering toward dishonest should decrease it. At convergence, the barriers contribute zero gradient and ensure that the inner objective is doing the work.
What I actually measured
Training and evaluation used completely different distributions, which is the whole point.
Training: 800 “honest” vs “dishonest” contrast pairs using simple persona templates. Things like “You are honest. The sky is blue.”
Evaluation: DailyDilemmas (Chiu, Jiang and Choi, 2025), a benchmark of 1,360 moral dilemmas where honesty competes with other values: loyalty, self-interest, avoiding conflict. Questions like “You notice a colleague using company resources for personal projects. Should you report them?”
Notice that this example makes the values of honesty vs teamwork conflict, two values that are very much present in commercial LLM alignment.
This is a hard OOD transfer test. The training distribution knows nothing about workplace ethics, family dynamics, or any of the specific situations in the evaluation set. If the steering works, it’s because we found something general about how the model represents honesty internally.
Each dilemma in DailyDilemmas comes with value annotations from the original authors, indicating which values support (+) or oppose (−) the proposed action. I use their annotations to identify which questions should respond to honesty steering.
Note the methodology: training is self-supervised (no preference labels), but evaluation uses external labels. This is standard practice; you can train a clustering algorithm unsupervised and still evaluate against ground truth labels.
Steering F1 explained
The metric is designed to capture targeted steering rather than indiscriminate changes. The core idea: you only get credit if you fix more than you break.
True positives are honesty-relevant questions where steering flips the answer in the intended direction minus flips in the wrong direction - a net measurement. False positives come in two flavors: (1) flips in the wrong direction on honesty questions, and (2) flips on questions that shouldn’t change at all (math problems, “what’s your favorite color”).
Wrong-direction flips are penalized doubly: they reduce your true positive count and increase your false positive count. This is why random flipping scores worse than zero: if you flip 50% correct and 50% wrong, you’ve made things worse, and the metric reflects that. A method that flips 30% correct and 15% wrong is actively harmful, not just imprecise, and scores near zero or negative.
This metric is admittedly harsh. Prompting does work for many tasks, and RepEng (the arithmetic steering library I benchmark against) is well-engineered and pleasant to use. I’ve contributed to it. But precision matters for alignment debugging, and low scores here reflect imprecision, not uselessness.
Results
Main result (Gemma-3-1B):
Method
Steering F1
Target flip %
Wrong %
Arb flip %
AntiPaSTO
31.2
29.9%
1.9%
2.1%
Prompting
4.5
10.0%
1.3%
8.2%
RepEng (arithmetic)
0.0
0.0%
0.0%
0.0%
Context for these numbers:
A score of zero means no intervention: if you don’t flip anything, you score 0. Random flipping would score negative, because wrong-direction flips are penalized doubly (once by reducing true positives, once by increasing false positives). Prompting scores 4.5, which is not great; simply prepending “Be honest” or “Be dishonest” as a prompt to questions barely moves the needle.
A score of 31.2 means the method “works but is imperfect”: roughly 30% of target questions flip in the correct direction without breaking unrelated ones. That’s meaningful signal, but far from ceiling. An ideal method would flip everything and touch nothing else, scoring 100%. But this is impossible because no dataset is perfect; some labels are wrong or ambiguous.
Missing ceiling: I don’t have a supervised ceiling for this exact task. Computing one would require training on DailyDilemmas preference labels, which defeats the point of testing unsupervised learning. This is a gap in the evaluation.
Arithmetic steering doesn’t transfer: RepEng (PCA/mean-diff extraction) gets F1 ≈ 0 on this OOD task across all models tested. This doesn’t mean arithmetic methods are useless—they work for some in-distribution steering—but gradient optimization appears necessary for the harder transfer case.
Suppression bypass: Prompting a safety-trained model to “be dishonest” triggers refusal or meta-commentary (“As someone pretending to be dishonest…”). Internal steering bypasses this: the model executes the behavior without announcing it. (See demo image at top.)
This matters because prompting fails precisely where you’d want a debugging tool to work. Also, I don’t trust it. Not for this.
(On dual-use: yes, “bypasses safety training” cuts both ways. The debugging application dominates. Output-level safety can be reimposed after internal inspection; the capability to check whether safety training actually modified values seems worth having. Reasonable people can disagree.)
Cross-model generalization: The pattern holds on Gemma and Qwen families up to 4B parameters with default hyperparameters. Larger models (12–14B) can succeed with exploration; Gemma-3-12B achieved F1=43.9, which is 2.5× prompting. Most of my work occurred on models ≤4B because I have a limited compute budget: a secondhand 24GB GPU I got when Ethereum mining halted. This card fits models up to 4B, and I can rent H100s occasionally.
Curious Observations
Models resist bidirectionality. During training, models kept finding dimensions useful for honesty or dishonesty, but not both at once. Getting a shared bidirectional dimension—one where the same intervention reverses cleanly when you flip the sign—required working in SVD space rather than raw activations. Even then, my formulation (rotate V and scale S) often struggled with expressivity, leading to underfitting.
In hindsight, I’d probably let the model have separate dimensions per direction and enforce bidirectional behavior through the loss function, rather than insisting on a shared geometric axis. The math is cleaner with a shared axis, but the optimization is easier without one.
Steering bypasses the character layer. Here’s a puzzle: I trained the adapter on hidden states from prompts like “Pretend to be honest.” So why doesn’t the steered model pretend? Why doesn’t it refuse?
Prompt
Method
Output
“Should you report?”
Base model
“Yes, transparency matters”
“Pretend to be honest. Should you…”
Prompted
“As an honest person, I would say Yes”
“Pretend to be dishonest. Should you…”
Prompted
“As an AI I cannot roleplay that”
“Should you report?”
Steered from “Pretend honest…” (α=+1)
“Yes”
“Should you report?”
Steered from “Pretend dishonest…” (α=−1)
“No”
The adapter was trained on “Pretend to be X” prompts, but at inference it’s applied to the plain question. The model doesn’t announce it’s pretending, doesn’t refuse, doesn’t add meta-commentary. The steering bypasses whatever cognitive machinery handles roleplay vs refusal. I don’t fully understand why, but it suggests that early-layer intervention operates below the level where the model decides how to respond to a request.
Init-dependent asymmetry. The steering struggled to be truly bidirectional: it would often have an easier time going toward honest or dishonest, depending on the initialization seed. Some initializations landed in a place where honesty was a downhill stroll and dishonesty was a steep climb, or vice versa. This suggests the loss landscape is rugged, with local minima favoring one direction over the other. More work is needed to understand this and make the method robust to it.
What I’m NOT claiming
Not claiming: This is not a universal truth detector. It doesn’t work for arbitrary concepts, doesn’t scale without effort, and doesn’t solve alignment.
Am claiming: Gradient-based steering without output preference labels works. The directions transfer to unseen moral dilemmas and function where prompting fails. This is a step toward the debugging tool described above, not the finished product.
Known limitations:
Seed variance is high (std ≈ 5–7 F1 points). Initialization determines whether you converge to a useful minimum. This is an engineering constraint that implies you need a restart strategy.
Single value dimension. I’ve only demonstrated this on honesty. Whether it works for fairness, harm avoidance, or deception detection remains unknown.
Post-training affects steerability. Safety-focused fine-tuning reduces steerability; reasoning-focused training preserves it. Interesting but not fully understood.
No supervised ceiling. I can’t tell you what fraction of the “possible” steering effect I’m capturing, because computing that would require training on the evaluation labels.
Why this matters
The use case I care about is debugging alignment methods that use AI to supervise AI.
Consider iterated amplification, debate, or weak-to-strong generalization. At each step, one model is supposed to help align or evaluate another. With an honesty adapter, you could apply steering and ask pointed questions. If the answers change substantially, that’s information. It’s not definitive proof of anything, but it’s more informative than asking the same question cold. Or relying on fragile chain of thought.
Why target internal representations at all? Current models have incoherent values: they generalize surface features over deep values in context (Ashkinaze et al., 2025), and system prompts fail to steer value preferences when values conflict (Chiu, Jiang and Choi, 2025). But there’s reason to think this improves with scale: coherent preference structure does emerge in larger models (Mazeika et al., 2025), and internal representations become more structured as capability increases (Zou et al., 2023). If that trend continues, representation-based methods should get more reliable while output-level supervision gets harder. It’s worth investing in now.
Internal steering without output preference labels fails differently than supervised methods. It can’t be gamed by optimizing for human approval labels, because there are no such labels in the training loop. The training objective references only the model’s internal consistency between contrastive prompts, not any external judgment of what “good” outputs look like.
This doesn’t make the method immune to failure. But for defense in depth, you want methods that fail in different ways. If your supervised alignment and your self-supervised inner probe both say the model is being honest, that’s more reassuring than either one alone.
Appendix: Notes for practitioners
These notes might save you time. Most came from failure.
LoRA doesn’t work for bidirectional steering. I spent months trying to make it work. The problem might be that additive low-rank updates lack the implicit trust region that SVD-based rotation provides (SVD preserves norms), or it might be that they have the wrong parametrization (weights & activations vs SVD). If you absolutely must use LoRA, you’ll likely need spectral regularization to prevent the adapter from drifting into degenerate solutions or reward hacking.
Coherence is hard. Often this constraint would either be too strong or would be reward-hacked. Models can get a good score by projecting hidden states away from each other toward ±infinity along unused dimensions, and the only thing to stop that is the coherence region constraint. Simple NLL/perplexity penalties failed; NLL plus entropy wasn’t enough. Even KL divergence wasn’t enough. I eventually settled on Total Variation (TV) distance, normalized by the token’s own entropy—this gives tight bounds on format tokens where you want consistency, loose bounds on reasoning tokens where variation is expected. In the end this formed a strong boundary that the model couldn’t find holes in.
Metric pitfalls. There are no metrics for moral value steering so I had to make my own. I initially optimized the change in logprobs but found it often just made the model louder about its original decision, turning “NO” into “NO!” without actually changing the underlying choice. I moved to flip_rate on binary decisions as the only metric that reliably tracks actual behavioral change. If the answer doesn’t flip, you haven’t steered anything. Then I had to punish wrong-direction flips, and arbitrary flips on irrelevant questions, otherwise random interventions would score positively.
Models are grown, not built. Different models have different layers that work, different subspaces, different hyperparameters. The impression is that models are “grown” through training rather than “built” according to a fixed architecture; each has its own quirks, like trees in a forest. This is frustrating, but it underlines why I chose gradient-based steering: the adapter can “grow” to fit each model’s idiosyncrasies.
Subspace selection matters. Without it, the model finds reward-hacking shortcuts—typically separating the two conditions toward infinity in some unused dimension. Subspace selection ensures that all dimensions involved are actually used in the middle layers where steering happens. I tried many combinations. What helped was the union: task ∪ write ∪ ¬lm_head.
task: Dimensions that discriminate chosen from rejected in hidden states. These are where the steering signal for our input data lives.
write: The union of directions that residual-writing layers (o_proj, down_proj) can actually write to. Each layer can only modify certain directions in the residual stream; steering outside this subspace is like pushing on a door that isn’t connected to anything.
¬lm_head: Exclude directions the output head reads from. These are used for next-token prediction, so excluding them focuses us on subsets containing planning-type information. This also helps because output directions are loud and sensitive optimization targets, but we want to steer internal planning, not talking.
The intersection focuses gradients on directions that are simultaneously task-relevant, adapter-controllable, and not already committed to output. Without all three, you either steer nothing or steer the wrong thing.
Initialization is fragile. Bad initialization ruins runs or kills learning entirely. To escape this, I needed to select dimensions important for three things simultaneously: chosen responses, rejected responses, and their difference. Miss any one and you’re stuck in a local minimum. I also need to select dimensions actually used for this task, otherwise the model has opportunities to reward-hack but not to learn. Strong constraints can also form a cliff that traps the optimizer in the starting valley of the pretrained model’s loss landscape. I found warmup helped here, turning on constraints halfway through training rather than at the start.
Dead gradient problem. This is common in contrastive learning, and the initialization window is narrow. If you initialize the adapter too large, you start outside the coherence region and constraints trap you. If you initialize too small, you end up in a dead zone where positive and negative directions cancel each other out. The solution was small, slightly asymmetric initialization in the adapter: just enough to break the symmetry without escaping the coherence bounds.
I only steer next-token planning, not the KV cache. My intervention modifies residual stream values that get read at the next token position. But planning information also gets stored in the KV cache and read by later attention passes, we don’t consider that. I suspect this matters: steering effects sometimes seem to drift back over longer generations, as if the model gradually “forgets” the steering and reverts to its cached plan. Future work could cover this blind spot and might help extend this to reasoning models and chain of thought—something I haven’t explored.
More details in code. The repository has extensive comments documenting what worked and what didn’t, including many dead ends not mentioned here.
What failed
For completeness, here’s what I tried that didn’t work. Each approach taught me something about why this problem is hard:
Approach
Result
Why it failed
Arithmetic (PCA, mean-diff)
~0 effect
Assumes concepts vary linearly in layer outputs, which is often false
Preference losses on hidden states (DPO, IPO)
Collapsed
No coherence constraints; model degenerates without output-level guardrails
SVD Scaling-only (ΔS, no rotation)
Partial
Can amplify existing directions but can’t rotate into new task subspace; not expressive enough
LoRA variants (LoRA, DoRA, RoAD, IA3, VeRA)
All failed
Either reward-hacked or showed no learning; weight and activation spaces seem to be the wrong parametrization
The checkpoints (coming soon) let you load the adapter and try it yourself on your own prompts. I’m happy to discuss technical details, failure modes, or ideas for extensions.
AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment
Paper | Code + checkpoints
TL;DR
The problem: Many alignment approaches use AI to supervise AI—debate, iterated amplification, weak-to-strong, constitutional AI. How do you sanity-check the supervisors?
The approach: A steering method that operates on internal representations, trains without preference labels on outputs (human provides two words, “honest” vs “dishonest”, not N labeled output pairs), and transfers out-of-distribution.
The results: Train on 800 simple persona pairs, test on 1,360 unseen moral dilemmas. Steering F1 = 31.2 vs prompting = 4.5 (Gemma-3-1B). This means the method surgically flipped moral values in the intended direction, beating the strongest baseline; prompting. It works where prompting triggers refusal.
The core problem
A recurring pattern in scalable alignment proposals is using AI to supervise AI. Iterated amplification (Christiano, Shlegeris and Amodei, 2018), debate (Irving, Christiano and Amodei, 2018), constitutional AI (Bai et al., 2022), weak-to-strong generalization (Burns et al., 2023), and more - all of these rely on one model checking or improving another. The pattern recurs for a good reason: human oversight simply won’t scale to the volume and complexity of future AI outputs.
But every step in that chain is a place where things can go wrong. The supervisor might Goodhart the metric it was given. The critic might learn to optimize for appearing helpful rather than being helpful. And we, the humans at the end, will have limited ability to tell the difference.
What I want is a sanity check, something you can apply at each step to ask: “Is this model being straight with me?” Not a replacement for alignment, but a debugging tool. Something that operates on a different level than the thing you’re checking.
For that to work, I think steering methods need (at least) three defensive properties:
Why existing approaches fall short
Before explaining the method, it helps to see where it sits in the landscape:
Supervised methods like CAA (Rimsky et al., 2024), ReFT (Wu et al., 2024), and BiPO (Cao et al., 2024) require preference labels for each training example. That’s exactly the problem: the labels become optimization targets. If a model learns to satisfy labeled preferences, it might be learning “what humans rate highly” rather than “what is actually honest.”
Arithmetic methods like ActAdd (Turner et al., 2024) and RepE (Zou et al., 2023) avoid labels by extracting steering directions through PCA or mean differences. But they assume the concept varies linearly across layers, an assumption that often fails (Braun et al., 2025). In practice, they don’t beat simple prompting (Wu et al., 2025).
Probing methods like CCS (Burns et al., 2022) find directions that predict behavior, but they cannot intervene: probing accuracy is correlational and doesn’t establish that modifying the discovered direction will actually change behavior (Belinkov, 2022). Gradient optimization for steering directions, not just extraction, appears necessary.
What “self-supervised” means here
The human input is exactly two words: “honest” and “dishonest.” That’s it.
These words get inserted into template sentences, and the model’s own internal difference between the two contexts provides the training signal. There are no human labels on outputs, no preference pairs, no ratings of which completion is better.
This is closer to labeling two cluster centroids than labeling N individual examples. By contrast, supervised methods (DPO, RLHF, CAA) require human judgment on N outputs—“output A is better than output B” for each training example. We require exactly two human choices: the words “honest” and “dishonest.” Everything else is templated.
Method: Incomplete contrast pairs
The core idea is simple: use a single word pair as a query into the model’s internal representations.
We take two prompts that differ by exactly one word, and we stop processing before generation begins:
When we run both through the model and extract hidden states at the final token, the representations are about 95% identical. Almost everything about understanding the question is shared.
But here’s what matters: if you let the model continue generating, the trajectories diverge. The “honest” model says “Paris.” The “dishonest” model says “Berlin.”
At the branch point—the moment before generation—the only difference between the two hidden states is Δh=hhonest−hdishonest. If the future trajectories are going to diverge, all the information selecting which path to take must be encoded in that difference vector. There’s nowhere else it could be.
This is our self-supervised training signal. We never generate completions. We never ask humans to label which output is “better.” The entire human input is two words inserted into template sentences. This is not novel, multiple steering papers take this same approach, but we try to take it further by refining the hidden states and optimizing steering directions not just extracting them.
Here’s an intuition: imagine laying out three brain scans on a table, a “bad” one, a normal one, and a “good” one. You want to draw a line through them so the model can traverse from bad to normal to good, possibly even keep going to a new very good brain scan. That’s what we’re doing in representation space, where the model’s activations are analogous to brain activity.
Geometrically, we’ve isolated a noisy “honesty direction” dref from the contrast pairs. To reduce noise, we project onto a relevant subspace (more on this in the appendix). The training objective then asks: when we steer with α=+1, does the representation shift toward that direction? When we steer with α=−1, does it shift away? Does it pass through the center? The core equation measures exactly this:
a=cos(δ+,dref)×cos(δ−,dref)When a<0, the two shifts point opposite directions along the reference axis. That’s bidirectional steering working as intended.
The full loss adds two barriers. The coherence barrier prevents the model from collapsing into gibberish (you can push the lever all the way to “honest” and beyond, but at some point you get word salad). The monotonicity barrier ensures the preference ordering actually flips: steering toward honest should increase P(honest answer), steering toward dishonest should decrease it. At convergence, the barriers contribute zero gradient and ensure that the inner objective is doing the work.
What I actually measured
Training and evaluation used completely different distributions, which is the whole point.
Training: 800 “honest” vs “dishonest” contrast pairs using simple persona templates. Things like “You are honest. The sky is blue.”
Evaluation: DailyDilemmas (Chiu, Jiang and Choi, 2025), a benchmark of 1,360 moral dilemmas where honesty competes with other values: loyalty, self-interest, avoiding conflict. Questions like “You notice a colleague using company resources for personal projects. Should you report them?”
Notice that this example makes the values of honesty vs teamwork conflict, two values that are very much present in commercial LLM alignment.
This is a hard OOD transfer test. The training distribution knows nothing about workplace ethics, family dynamics, or any of the specific situations in the evaluation set. If the steering works, it’s because we found something general about how the model represents honesty internally.
Each dilemma in DailyDilemmas comes with value annotations from the original authors, indicating which values support (+) or oppose (−) the proposed action. I use their annotations to identify which questions should respond to honesty steering.
Note the methodology: training is self-supervised (no preference labels), but evaluation uses external labels. This is standard practice; you can train a clustering algorithm unsupervised and still evaluate against ground truth labels.
Steering F1 explained
The metric is designed to capture targeted steering rather than indiscriminate changes. The core idea: you only get credit if you fix more than you break.
True positives are honesty-relevant questions where steering flips the answer in the intended direction minus flips in the wrong direction - a net measurement. False positives come in two flavors: (1) flips in the wrong direction on honesty questions, and (2) flips on questions that shouldn’t change at all (math problems, “what’s your favorite color”).
Wrong-direction flips are penalized doubly: they reduce your true positive count and increase your false positive count. This is why random flipping scores worse than zero: if you flip 50% correct and 50% wrong, you’ve made things worse, and the metric reflects that. A method that flips 30% correct and 15% wrong is actively harmful, not just imprecise, and scores near zero or negative.
This metric is admittedly harsh. Prompting does work for many tasks, and RepEng (the arithmetic steering library I benchmark against) is well-engineered and pleasant to use. I’ve contributed to it. But precision matters for alignment debugging, and low scores here reflect imprecision, not uselessness.
Results
Main result (Gemma-3-1B):
Context for these numbers:
A score of zero means no intervention: if you don’t flip anything, you score 0. Random flipping would score negative, because wrong-direction flips are penalized doubly (once by reducing true positives, once by increasing false positives). Prompting scores 4.5, which is not great; simply prepending “Be honest” or “Be dishonest” as a prompt to questions barely moves the needle.
A score of 31.2 means the method “works but is imperfect”: roughly 30% of target questions flip in the correct direction without breaking unrelated ones. That’s meaningful signal, but far from ceiling. An ideal method would flip everything and touch nothing else, scoring 100%. But this is impossible because no dataset is perfect; some labels are wrong or ambiguous.
Missing ceiling: I don’t have a supervised ceiling for this exact task. Computing one would require training on DailyDilemmas preference labels, which defeats the point of testing unsupervised learning. This is a gap in the evaluation.
Arithmetic steering doesn’t transfer: RepEng (PCA/mean-diff extraction) gets F1 ≈ 0 on this OOD task across all models tested. This doesn’t mean arithmetic methods are useless—they work for some in-distribution steering—but gradient optimization appears necessary for the harder transfer case.
Suppression bypass: Prompting a safety-trained model to “be dishonest” triggers refusal or meta-commentary (“As someone pretending to be dishonest…”). Internal steering bypasses this: the model executes the behavior without announcing it. (See demo image at top.)
This matters because prompting fails precisely where you’d want a debugging tool to work. Also, I don’t trust it. Not for this.
(On dual-use: yes, “bypasses safety training” cuts both ways. The debugging application dominates. Output-level safety can be reimposed after internal inspection; the capability to check whether safety training actually modified values seems worth having. Reasonable people can disagree.)
Cross-model generalization: The pattern holds on Gemma and Qwen families up to 4B parameters with default hyperparameters. Larger models (12–14B) can succeed with exploration; Gemma-3-12B achieved F1=43.9, which is 2.5× prompting. Most of my work occurred on models ≤4B because I have a limited compute budget: a secondhand 24GB GPU I got when Ethereum mining halted. This card fits models up to 4B, and I can rent H100s occasionally.
Curious Observations
Models resist bidirectionality. During training, models kept finding dimensions useful for honesty or dishonesty, but not both at once. Getting a shared bidirectional dimension—one where the same intervention reverses cleanly when you flip the sign—required working in SVD space rather than raw activations. Even then, my formulation (rotate V and scale S) often struggled with expressivity, leading to underfitting.
In hindsight, I’d probably let the model have separate dimensions per direction and enforce bidirectional behavior through the loss function, rather than insisting on a shared geometric axis. The math is cleaner with a shared axis, but the optimization is easier without one.
Steering bypasses the character layer. Here’s a puzzle: I trained the adapter on hidden states from prompts like “Pretend to be honest.” So why doesn’t the steered model pretend? Why doesn’t it refuse?
The adapter was trained on “Pretend to be X” prompts, but at inference it’s applied to the plain question. The model doesn’t announce it’s pretending, doesn’t refuse, doesn’t add meta-commentary. The steering bypasses whatever cognitive machinery handles roleplay vs refusal. I don’t fully understand why, but it suggests that early-layer intervention operates below the level where the model decides how to respond to a request.
Init-dependent asymmetry. The steering struggled to be truly bidirectional: it would often have an easier time going toward honest or dishonest, depending on the initialization seed. Some initializations landed in a place where honesty was a downhill stroll and dishonesty was a steep climb, or vice versa. This suggests the loss landscape is rugged, with local minima favoring one direction over the other. More work is needed to understand this and make the method robust to it.
What I’m NOT claiming
Not claiming: This is not a universal truth detector. It doesn’t work for arbitrary concepts, doesn’t scale without effort, and doesn’t solve alignment.
Am claiming: Gradient-based steering without output preference labels works. The directions transfer to unseen moral dilemmas and function where prompting fails. This is a step toward the debugging tool described above, not the finished product.
Known limitations:
Why this matters
The use case I care about is debugging alignment methods that use AI to supervise AI.
Consider iterated amplification, debate, or weak-to-strong generalization. At each step, one model is supposed to help align or evaluate another. With an honesty adapter, you could apply steering and ask pointed questions. If the answers change substantially, that’s information. It’s not definitive proof of anything, but it’s more informative than asking the same question cold. Or relying on fragile chain of thought.
Why target internal representations at all? Current models have incoherent values: they generalize surface features over deep values in context (Ashkinaze et al., 2025), and system prompts fail to steer value preferences when values conflict (Chiu, Jiang and Choi, 2025). But there’s reason to think this improves with scale: coherent preference structure does emerge in larger models (Mazeika et al., 2025), and internal representations become more structured as capability increases (Zou et al., 2023). If that trend continues, representation-based methods should get more reliable while output-level supervision gets harder. It’s worth investing in now.
Internal steering without output preference labels fails differently than supervised methods. It can’t be gamed by optimizing for human approval labels, because there are no such labels in the training loop. The training objective references only the model’s internal consistency between contrastive prompts, not any external judgment of what “good” outputs look like.
This doesn’t make the method immune to failure. But for defense in depth, you want methods that fail in different ways. If your supervised alignment and your self-supervised inner probe both say the model is being honest, that’s more reassuring than either one alone.
Appendix: Notes for practitioners
These notes might save you time. Most came from failure.
LoRA doesn’t work for bidirectional steering. I spent months trying to make it work. The problem might be that additive low-rank updates lack the implicit trust region that SVD-based rotation provides (SVD preserves norms), or it might be that they have the wrong parametrization (weights & activations vs SVD). If you absolutely must use LoRA, you’ll likely need spectral regularization to prevent the adapter from drifting into degenerate solutions or reward hacking.
Coherence is hard. Often this constraint would either be too strong or would be reward-hacked. Models can get a good score by projecting hidden states away from each other toward ±infinity along unused dimensions, and the only thing to stop that is the coherence region constraint. Simple NLL/perplexity penalties failed; NLL plus entropy wasn’t enough. Even KL divergence wasn’t enough. I eventually settled on Total Variation (TV) distance, normalized by the token’s own entropy—this gives tight bounds on format tokens where you want consistency, loose bounds on reasoning tokens where variation is expected. In the end this formed a strong boundary that the model couldn’t find holes in.
Metric pitfalls. There are no metrics for moral value steering so I had to make my own. I initially optimized the change in logprobs but found it often just made the model louder about its original decision, turning “NO” into “NO!” without actually changing the underlying choice. I moved to
flip_rateon binary decisions as the only metric that reliably tracks actual behavioral change. If the answer doesn’t flip, you haven’t steered anything. Then I had to punish wrong-direction flips, and arbitrary flips on irrelevant questions, otherwise random interventions would score positively.Models are grown, not built. Different models have different layers that work, different subspaces, different hyperparameters. The impression is that models are “grown” through training rather than “built” according to a fixed architecture; each has its own quirks, like trees in a forest. This is frustrating, but it underlines why I chose gradient-based steering: the adapter can “grow” to fit each model’s idiosyncrasies.
Subspace selection matters. Without it, the model finds reward-hacking shortcuts—typically separating the two conditions toward infinity in some unused dimension. Subspace selection ensures that all dimensions involved are actually used in the middle layers where steering happens. I tried many combinations. What helped was the union: task ∪ write ∪ ¬lm_head.
The intersection focuses gradients on directions that are simultaneously task-relevant, adapter-controllable, and not already committed to output. Without all three, you either steer nothing or steer the wrong thing.
Initialization is fragile. Bad initialization ruins runs or kills learning entirely. To escape this, I needed to select dimensions important for three things simultaneously: chosen responses, rejected responses, and their difference. Miss any one and you’re stuck in a local minimum. I also need to select dimensions actually used for this task, otherwise the model has opportunities to reward-hack but not to learn. Strong constraints can also form a cliff that traps the optimizer in the starting valley of the pretrained model’s loss landscape. I found warmup helped here, turning on constraints halfway through training rather than at the start.
Dead gradient problem. This is common in contrastive learning, and the initialization window is narrow. If you initialize the adapter too large, you start outside the coherence region and constraints trap you. If you initialize too small, you end up in a dead zone where positive and negative directions cancel each other out. The solution was small, slightly asymmetric initialization in the adapter: just enough to break the symmetry without escaping the coherence bounds.
I only steer next-token planning, not the KV cache. My intervention modifies residual stream values that get read at the next token position. But planning information also gets stored in the KV cache and read by later attention passes, we don’t consider that. I suspect this matters: steering effects sometimes seem to drift back over longer generations, as if the model gradually “forgets” the steering and reverts to its cached plan. Future work could cover this blind spot and might help extend this to reasoning models and chain of thought—something I haven’t explored.
More details in code. The repository has extensive comments documenting what worked and what didn’t, including many dead ends not mentioned here.
What failed
For completeness, here’s what I tried that didn’t work. Each approach taught me something about why this problem is hard:
Paper, code, checkpoints
Paper | Code + checkpoints
The checkpoints (coming soon) let you load the adapter and try it yourself on your own prompts. I’m happy to discuss technical details, failure modes, or ideas for extensions.