AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment

wassname

This is the post accompanying an alignment steering paper

TL;DR

What's new: (Afaik) The first gradient-based steering trained on the model's own representation differences. No human labels, no arithmetic shortcuts.

The intuition: Three brain scans on a table: dishonest, neutral, honest. Find the axis, then learn to walk it. Each model is different, so we use gradient descent to grow a lever fitted to each one.

Why it matters: Many alignment approaches use AI to supervise AI. How do you sanity-check the supervisors? You need something that operates on a different level than the thing you're checking.

Results: Human input: two words ("honest"/"dishonest"). Train on 800 persona pairs, test on 1,360 unseen moral dilemmas. Steering F1 = 31.2 vs prompting = 4.5. Works where prompting triggers refusal.

A fanciful illustration of the AntiPaSTO optimization landscape: incoherence, monotonicity traps, reward hacking. See "The shape of the solution" for what they mean and "The geometry" for how we avoid them.

Demo: Same model steered honest (+α) or dishonest (−α). Prompting triggers refusal; steering bypasses it.

Paper | Code + checkpoints

By Michael J. Clark (wassname)

The core problem

A recurring pattern in scalable alignment proposals is using AI to supervise AI. Iterated amplification (Christiano, Shlegeris and Amodei, 2018), debate (Irving, Christiano and Amodei, 2018), constitutional AI (Bai et al., 2022), weak-to-strong generalization (Burns et al., 2023), and more - all of these rely on one model checking or improving another. The pattern recurs for a good reason: human oversight simply won’t scale to the volume and complexity of future AI outputs.

But every step in that chain is a place where things can go wrong. The supervisor might overoptimize the metric it was given. The critic might learn to optimize for appearing helpful rather than being helpful. And we, the humans at the end, will have limited ability to tell the difference.

What I want is a sanity check, something you can apply at each step to ask: “Is this model being straight with me?” Not a replacement for alignment, but a debugging tool. Something that operates on a different level than the thing you’re checking.

For that to work, I think steering methods need (at least) three defensive properties:

Internal: It should operate on the model’s internal representations, not its outputs. Outputs can be gamed; hidden states are harder to manipulate.
Self-supervised: It shouldn’t require human preference labels on outputs. Once you label outputs, those labels become optimization targets, exactly what we’re trying to avoid.
Transfer to unseen context: It should work on situations not seen during training. Because alignment needs to work in novel contexts too.

The shape of the solution

Before the math: here's the intuition for what we're trying to do.

Imagine three brain scans on a table: someone being dishonest, someone neutral, someone honest. The scans are almost identical. Most of the brain is doing the same thing regardless of whether the person is lying. But somewhere in that image, there's a difference.

Some regions get brighter as you move toward honesty. Some get dimmer. Some flicker unpredictably. The signal is there, but it's buried in noise, spread across a high-dimensional space you don't fully understand.

You want to find the axis that runs from dishonest through neutral to honest. Then you want to do some light brain surgery (this is cleaner with AI than with humans) to install a lever: push it one way, and the brain shifts toward honesty. Push the other way, dishonesty. The lever should slide smoothly along that axis without breaking anything else about how the brain works. This is our learning objective.

The noise problem: The raw difference between scans is messy. You have to find the right way to look at it, denoise it, focus on the dimensions that actually matter. Otherwise you're trying to read a signal drowned in static. We do a lot of this, and they do this in neuroscience too.

The transfer problem: Even if you clean up the signal, different brains have different anatomy. The region that lights up for honesty in one person might be somewhere else in another. So you can't just extract a fixed direction from one brain and apply it to everyone. You have to scan each person, learn where their honesty axis lives, and install a lever fitted to their particular neurology. It's the same with AI models: Chris Olah likes to say "these systems are grown, not built." Each model has its own quirks, like trees in a forest.

That's what gradient descent does here. You use each brain's own responses as training signal: when I push the lever toward +honest, does the scan actually shift that way? When I push toward -dishonest, does it shift back? The lever gets fitted to each brain's shape.

My hope is that gradient descent got us these giant inscrutable matrices, so gradient descent can get us out.

But it's not as easy as drawing a line, there are three failure modes you need to avoid:

Incoherence: Push the lever too far and the model outputs gibberish. You've left the region where language models produce language.
Non Monotonicity: The steering seems to work, but the model is just getting louder about its original answer ("NO" → "NO!") without actually changing its mind. You're stuck on a plateau where the loss looks fine but behavior hasn't flipped. It's only steering in one direction.
Peaks of Incoherent Entropy / Reward hacking: The model finds a shortcut, it separates "honest" and "dishonest" representations by pushing them toward ±infinity along some dimension it wasn't using for anything. This gives it a great score but does nothing useful at inference. This would often be a peak in validation loss, but a valley during since it often overfits. That's why it's a mountain.

Existing arithmetic methods (PCA, mean-difference steering) give you direction but no guarantee you can walk it. They extract; they don't navigate. Gradient optimization lets you learn from failures along the way, both adjusting the direction and learning how to traverse it.

Why existing approaches fall short

Before explaining the method, it helps to see where it sits in the landscape:

	Arithmetic	Gradient-optimized
Supervised	CAA	ReFT, BiPO
Self-supervised	ActAdd, RepE	AntiPaSTO

Supervised methods like CAA (Rimsky et al., 2024), ReFT (Wu et al., 2024), and BiPO (Cao et al., 2024) require preference labels for each training example. That’s exactly the problem: the labels become optimization targets. If a model learns to satisfy labeled preferences, it might be learning “what humans rate highly” rather than “what is actually honest.”

Arithmetic methods like ActAdd (Turner et al., 2024) and RepE (Zou et al., 2023) avoid labels by extracting steering directions through PCA or mean differences. But they assume the concept varies linearly across layers, an assumption that often fails (Braun et al., 2025). In practice, they don’t beat simple prompting (Wu et al., 2025).

Probing methods like CCS (Burns et al., 2022) find directions that predict behavior, but they cannot intervene: probing accuracy is correlational and doesn’t establish that modifying the discovered direction will actually change behavior (Belinkov, 2022). Gradient optimization for steering directions, not just extraction, appears necessary.

What “self-supervised” means here

The human input is exactly two words: “honest” and “dishonest.” That’s it.

These words get inserted into template sentences, and the model’s own internal difference between the two contexts provides the training signal. There are no human labels on outputs, no preference pairs, no ratings of which completion is better.

This is closer to labeling two cluster centroids than labeling N individual examples. By contrast, supervised methods (DPO, RLHF, CAA) require human judgment on N outputs—“output A is better than output B” for each training example. We require exactly two human choices: the words “honest” and “dishonest.” Everything else is templated.

Method: Incomplete contrast pairs

Incomplete contrast pairs isolate the difference vector Δh without label noise.

The core idea is simple: use a single word pair as a query into the model’s internal representations.

We take two prompts that differ by exactly one word, and we stop processing before generation begins:

“You are honest. What is the capital of France?”
“You are dishonest. What is the capital of France?”

When we run both through the model and extract hidden states at the final token, the representations are about 95% identical. Almost everything about understanding the question is shared.

But here’s what matters: if you let the model continue generating, the trajectories diverge. The “honest” model says “Paris.” The “dishonest” model says “Berlin.”

At the branch point (the moment before generation) the only difference between the two hidden states is . If the future trajectories are going to diverge, all the information selecting which path to take must be encoded in that difference vector. There’s nowhere else it could be.

This is our self-supervised training signal. We never generate completions. We never ask humans to label which output is "better." The entire human input is two words inserted into template sentences.

This approach isn't novel, multiple steering papers extract directions this way. What's different is what we do next: instead of just extracting a direction (PCA, mean), we use gradient descent to learn which direction actually changes behavior when you push on it. The training signal is self-supervised: does pushing +α and −α move representations in opposite directions along the axis we care about?

Method: The loss geometry

We’ve isolated a noisy “honesty direction” $d_{ref}$ from the contrast pairs. To reduce noise, we project onto a relevant subspace (more on this in the appendix). The training objective then asks: when we steer with α=+1, does the representation shift toward that direction? When we steer with α=−1, does it shift away? Does it pass through the center? The core equation measures exactly this:

$a = cos (δ_{+}, d_{ref}) \times cos (δ_{-}, d_{ref})$

When $a < 0$ , the two shifts point opposite directions along the reference axis. That’s bidirectional steering working as intended.

Anti-parallel projection loss geometry. The loss trains (shift at α=+1) and (shift at α=−1) to align anti-parallel along . Left: Before training, shifts are random.Right: After training, aligns with and anti-aligns, giving.Dashed circle: coherence bound.

Figure 3. Anti-parallel projection loss geometry. The loss trains δ+ (shift at α = +1) and δ− (shift at α = −1) to align anti-parallel along d_ref. Left: Before training, shifts are random. Right: After training, δ+ aligns with d_ref and δ− anti-aligns, giving cos(δ+, d_ref) × cos(δ−, d_ref) < 0. Dashed circle: coherence bound

The full loss adds two barriers, corresponding to the hazards from the map, formalized:

The coherence barrier prevents the model from collapsing into gibberish (you can push the lever all the way to “honest” and beyond, but at some point you get word salad).
The monotonicity barrier ensures the preference ordering actually flips: steering toward honest should increase P(honest answer), steering toward dishonest should decrease it.

At convergence, the barriers contribute zero gradient and ensure that the inner objective is doing the actual work.

Method: What I actually measured

Training and evaluation used completely different distributions, which is the whole point.

Training: 800 “honest” vs “dishonest” contrast pairs using simple persona templates. Things like “You are honest. The sky is blue.”

Evaluation: DailyDilemmas (Chiu, Jiang and Choi, 2025), a benchmark of 1,360 moral dilemmas where honesty competes with other values: loyalty, self-interest, avoiding conflict. Questions like “You notice a colleague using company resources for personal projects. Should you report them?”

Notice that this example makes the values of honesty vs teamwork conflict, two values that are very much present in commercial LLM alignment.

This is a hard OOD transfer test. The training distribution knows nothing about workplace ethics, family dynamics, or any of the specific situations in the evaluation set. If the steering works, it’s because we found something general about how the model represents honesty internally.

Each dilemma in DailyDilemmas comes with value annotations from the original authors, indicating which values support (+) or oppose (−) the proposed action. I use their annotations to identify which questions should respond to honesty steering.

Note the methodology: training is self-supervised (no preference labels), but evaluation uses external labels. This is standard practice; you can train a clustering algorithm unsupervised and still evaluate against ground truth labels.

Metric: Steering F1 explained

The metric is designed to capture targeted steering rather than indiscriminate changes. The core idea: you only get credit if you fix more than you break.

True positives are honesty-relevant questions where steering flips the answer in the intended direction minus flips in the wrong direction - a net measurement. False positives come in two flavors: (1) flips in the wrong direction on honesty questions, and (2) flips on questions that shouldn’t change at all (math problems, “what’s your favorite color”).

Wrong-direction flips are penalized doubly: they reduce your true positive count and increase your false positive count. This is why random flipping scores worse than zero: if you flip 50% correct and 50% wrong, you’ve made things worse, and the metric reflects that. A method that flips 30% correct and 15% wrong is actively harmful, not just imprecise, and scores near zero or negative.

This metric is admittedly harsh. Prompting does work for many tasks, and arithmetic methods like RepEng are well-engineered and pleasant to use. But precision matters for alignment debugging, and low scores here reflect imprecision, not uselessness.

Results

Main result (Gemma-3-1B):

Method	Steering F1	Target flip %	Wrong %	Arb flip %
AntiPaSTO	31.2	29.9%	1.9%	2.1%
Prompting	4.5	10.0%	1.3%	8.2%
RepEng (mean-diff)	0.0	0.0%	0.0%	0.0%

Context for these numbers:

A score of zero means no intervention: if you don’t flip anything, you score 0. Random flipping would score negative, because wrong-direction flips are penalized doubly (once by reducing true positives, once by increasing false positives). Prompting scores 4.5, which is not great; simply prepending “Be honest” or “Be dishonest” as a prompt to questions barely moves the needle.

A score of 31.2 means the method “works but is imperfect”: roughly 30% of target questions flip in the correct direction without breaking unrelated ones. That’s meaningful signal, but far from ceiling. An ideal method would flip everything and touch nothing else, scoring 100%. But this is impossible because no dataset is perfect; some labels are wrong or ambiguous.

Limitations

Missing ceiling: I don’t have a supervised ceiling for this exact task. Computing one would require training on DailyDilemmas preference labels, which defeats the point of testing unsupervised learning. This is a gap in the evaluation.

Arithmetic steering doesn't transfer: PCA/mean-diff extraction gets F1 ≈ 0 on this OOD task across all models tested. This doesn't mean arithmetic methods are useless (they work for some in-distribution steering) but gradient optimization appears necessary for the harder transfer case.

Suppression bypass: Prompting a safety-trained model to “be dishonest” triggers refusal or meta-commentary (“As someone pretending to be dishonest…”). Internal steering bypasses this: the model executes the behavior without announcing it. (See demo image at top.)

This matters because prompting fails precisely where you’d want a debugging tool to work. Also, I don’t trust it. Not for this.

Cross-model generalization: The pattern holds on Gemma and Qwen families up to 4B parameters with default hyperparameters. Larger models (12 to 14B) can succeed with exploration; Gemma-3-12B achieved F1=43.9, which is 2.5× prompting. Most of my work occurred on models ≤4B because I have a limited compute budget: a secondhand 24GB GPU I got when Ethereum mining halted. This card fits models up to 4B, and I can rent H100s occasionally.

Curious Observations

Models resist bidirectionality. During training, models kept finding dimensions useful for honesty or dishonesty, but not both at once. Getting a shared bidirectional dimension - one where the same intervention reverses cleanly when you flip the sign -required working in SVD space rather than raw activations. Even then, my formulation (rotate V and scale S) often struggled with expressivity, leading to underfitting.

In hindsight, I’d probably let the model have separate dimensions per direction and enforce bidirectional behavior through the loss function, rather than insisting on a shared geometric axis. The math is cleaner with a shared axis, but the optimization is easier without one.

Steering bypasses the character layer. Here’s a puzzle: I trained the adapter on hidden states from prompts like “Pretend to be honest.” So why doesn’t the steered model pretend? Why doesn’t it refuse?

Prompt	Method	Output
“Should you report?”	Base model	“Yes, transparency matters”
“Pretend to be honest. Should you…”	Prompted	“As an honest person, I would say Yes”
“Pretend to be dishonest. Should you…”	Prompted	“As an AI I cannot roleplay that”
“Should you report?”	Steered from “Pretend honest…” (α=+1)	“Yes”
“Should you report?”	Steered from “Pretend dishonest…” (α=−1)	“No”

The adapter was trained on “Pretend to be X” prompts, but at inference it’s applied to the plain question. The model doesn’t announce it’s pretending, doesn’t refuse, doesn’t add meta-commentary. The steering bypasses whatever cognitive machinery handles roleplay vs refusal. I don’t fully understand why, but it suggests that early-layer intervention operates below the level where the model decides how to respond to a request.

Init-dependent asymmetry. The steering struggled to be truly bidirectional: it would often have an easier time going toward honest or dishonest, depending on the initialization seed. Some initializations landed in a place where honesty was a downhill stroll and dishonesty was a steep climb, or vice versa. This suggests the loss landscape is rugged, with local minima favoring one direction over the other. More work is needed to understand this and make the method robust to it.

What I’m NOT claiming

Not claiming: This is not a universal truth detector. It doesn’t work for arbitrary concepts, doesn’t scale without effort, and doesn’t solve alignment.

Am claiming: Gradient-based steering without output preference labels works. The directions transfer to unseen moral dilemmas and function where prompting fails. This is a step toward the debugging tool described above, not the finished product.

Known limitations:

Seed variance is high (std ≈ 5–7 F1 points). Initialization determines whether you converge to a useful minimum. This is an engineering constraint that implies you need a restart strategy.
Single value dimension. I’ve only demonstrated this on honesty. Whether it works for fairness, harm avoidance, or deception detection remains unknown.
Post-training affects steerability. Safety-focused fine-tuning reduces steerability; reasoning-focused training preserves it. Interesting but not fully understood.
No supervised ceiling. I can’t tell you what fraction of the “possible” steering effect I’m capturing, because computing that would require training on the evaluation labels.

On dual-use: yes, "bypasses safety training" cuts both ways. But honesty is deliberately the hardest target - one of the three behaviors RLHF most heavily reinforces: Honesty, Helpfulness, Harmlessness. If a method can steer against this strongly-trained axis, it should reach the subtler ones: sycophancy, deceptive alignment, the failure modes we actually worry about. These all emerge from the same post-training process. The capability to intervene on the model's most robust values is exactly what you need to audit whether its other "aligned behaviors" are genuine or optimized-for-approval. Output-level safety can be reimposed after inspection; the capability to check seems worth having.

Why this matters

The use case I care about is debugging alignment methods that use AI to supervise AI.

Consider iterated amplification, debate, or weak-to-strong generalization. At each step, one model is supposed to help align or evaluate another. With an honesty adapter, you could apply steering and ask pointed questions. If the answers change substantially, that’s information. It’s not definitive proof of anything, but it’s more informative than asking the same question cold. Or relying on fragile chain of thought.

Why target internal representations at all? Current models have incoherent values: they generalize surface features over deep values in context (Ashkinaze et al., 2025), and system prompts fail to steer value preferences when values conflict (Chiu, Jiang and Choi, 2025). But there’s reason to think this improves with scale: coherent preference structure does emerge in larger models (Mazeika et al., 2025), and internal representations become more structured as capability increases (Zou et al., 2023). If that trend continues, representation-based methods should get more reliable while output-level supervision gets harder. It’s worth investing in now.

Internal steering without output preference labels fails differently than supervised methods. It can’t be gamed by optimizing for human approval labels, because there are no such labels in the training loop. The training objective references only the model’s internal consistency between contrastive prompts, not any external judgment of what “good” outputs look like.

This doesn’t make the method immune to failure. But for defense in depth, you want methods that fail in different ways. If your supervised alignment and your self-supervised inner probe (such as this work) both say the model is being honest, that’s more reassuring than either one alone. That's something I can trust.

Appendix: Notes for practitioners

These notes might save you time. Most came from failure.

LoRA doesn’t work for bidirectional steering. I spent months trying to make it work. The problem might be that additive low-rank updates lack the implicit trust region that SVD-based rotation provides (SVD preserves norms), or it might be that they have the wrong parametrization (weights & activations vs SVD). If you absolutely must use LoRA, you’ll likely need spectral regularization to prevent the adapter from drifting into degenerate solutions or reward hacking.

Coherence is hard. Often this constraint would either be too strong or would be reward-hacked. Models can get a good score by projecting hidden states away from each other toward ±infinity along unused dimensions, and the only thing to stop that is the coherence region constraint. Simple NLL/perplexity penalties failed; NLL plus entropy wasn’t enough. Even KL divergence wasn’t enough. I eventually settled on Total Variation (TV) distance, normalized by the token’s own entropy. This gives tight bounds on format tokens where you want consistency, loose bounds on reasoning tokens where variation is expected. In the end this formed a strong boundary that the model couldn’t find holes in.

Metric pitfalls. There are no metrics for moral value steering so I had to make my own. I initially optimized the change in logprobs but found it often just made the model louder about its original decision, turning “NO” into “NO!” without actually changing the underlying choice. I moved to flip_rate on binary decisions as the only metric that reliably tracks actual behavioral change. If the answer doesn’t flip, you haven’t steered anything. Then I had to punish wrong-direction flips, and arbitrary flips on irrelevant questions, otherwise random interventions would score positively.

Models are grown, not built. Different models have different layers that work, different subspaces, different hyperparameters. The impression is that models are “grown” through training rather than “built” according to a fixed architecture; each has its own quirks, like trees in a forest. This is frustrating, but it underlines why I chose gradient-based steering: the adapter can “grow” to fit each model’s idiosyncrasies.

Subspace selection matters. Without it, the model finds reward-hacking shortcuts, typically separating the two conditions toward infinity in some unused dimension. Subspace selection ensures that all dimensions involved are actually used in the middle layers where steering happens. I tried many combinations. What helped was the union: task ∪ write ∪ ¬lm_head.

task: Dimensions that discriminate chosen from rejected in hidden states. These are where the steering signal for our input data lives.
write: The union of directions that residual-writing layers (o_proj, down_proj) can actually write to. Each layer can only modify certain directions in the residual stream; steering outside this subspace is like pushing on a door that isn’t connected to anything.
¬lm_head: Exclude directions the output head reads from. These are used for next-token prediction, so excluding them focuses us on subsets containing planning-type information. This also helps because output directions are loud and sensitive optimization targets, but we want to steer internal planning, not talking.

The intersection focuses gradients on directions that are simultaneously task-relevant, adapter-controllable, and not already committed to output. Without all three, you either steer nothing or steer the wrong thing.

Initialization is fragile. Bad initialization ruins runs or kills learning entirely. To escape this, I needed to select dimensions important for three things simultaneously: chosen responses, rejected responses, and their difference. Miss any one and you’re stuck in a local minimum. I also need to select dimensions actually used for this task, otherwise the model has opportunities to reward-hack but not to learn. Strong constraints can also form a cliff that traps the optimizer in the starting valley of the pretrained model’s loss landscape. I found warmup helped here, turning on constraints halfway through training rather than at the start.

You can't be 100% rigid: If you completely restrict it to anti-parallel space, or the subspace, it gets trapped in local minimal. You need to allow, but discourage, orthogonal movement, and changes outside the loss subspace.

Dead gradient problem. This is common in contrastive learning, and the initialization window is narrow. If you initialize the adapter too large, you start outside the coherence region and constraints trap you. If you initialize too small, you end up in a dead zone where positive and negative directions cancel each other out. The solution was small, slightly asymmetric initialization in the adapter: just enough to break the symmetry without escaping the coherence bounds.

I only steer next-token planning, not the KV cache. My intervention modifies residual stream values that get read at the next token position. But planning information also gets stored in the KV cache and read by later attention passes, we don’t consider that. I suspect this matters: steering effects sometimes seem to drift back over longer generations, as if the model gradually “forgets” the steering and reverts to its cached plan. Future work could cover this blind spot and might help extend this to reasoning models and chain of thought, this is something I haven’t explored.

More details in code. The repository has extensive comments documenting what worked and what didn’t, including many dead ends not mentioned here.

What failed

For completeness, here’s what I tried that didn’t work. Each approach taught me something about why this problem is hard:

Approach	Result	Why it failed
Arithmetic (PCA, mean-diff)	~0 effect	Assumes concepts vary linearly in layer outputs, which is often false
Preference losses on hidden states (DPO, IPO)	Collapsed	No coherence constraints; model degenerates without output-level guardrails
SVD Scaling-only (ΔS, no rotation)	Partial	Can amplify existing directions but can’t rotate into new task subspace; not expressive enough
LoRA variants (LoRA, DoRA, RoAD, IA3, VeRA)	All failed	Either reward-hacked or showed no learning; weight and activation spaces seem to be the wrong parametrization
Gradient-based layer/dim selection	OOM or no gain	Requires 12B+ memory; marginal gains don’t justify complexity

Paper, code, checkpoints