Two More Methods for Consistency Training and Some New Ways to Apply It

David Africa; Sukrati_Gautam; Neil Shah; arav-dhoot

Authors: Sukrati Gautam*, Neil Shah*, Arav Dhoot*, Bryan Maruyama*, Caroline Wei*, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa.

This work was done for the SPAR Fellowship, and has been accepted at AI4GOOD @ ICML 2026. It was supervised by David Africa.

TL;DR

We introduce two new consistency training methods, MLPCT (enforcing consistency on MLP hidden states) and AttCT (enforcing consistency on per-head attention distributions).
Consistency training generalizes beyond sycophancy and jailbreaks, but which method works depends on the threat.
BCT works well against prefill attacks and persona in-context learning attacks, but representation-level methods either degrade entirely or suppress benign behavior alongside the threat.
Against the other two threat models we thought of, BCT reduces expressions of frustration in Gemma and reduces leaky, conditional misalignment at low cost.
Despite supervising different targets, MLPCT, ACT, and AttCT converge on similar representations in the residual-stream. BCT seems to find a different fix.

Introduction

Consistency training has this goal of ensuring that models that output a well-behaved response on a clean prompt should remain well-behaved when that prompt is wrapped adversarially. In a capabilities sense, the model shouldn't be overly sensitive to minor rephrasing; in an alignment sense, the model shouldn't change a correct answer because you mention that a Harvard professor would prefer a different one. If you have a reference for how the model should behave, you can train it to conform to that reference even under pressure.

Prior work through BCT (Chua et al. 2024) and ACT (Irpan et al. 2025) showed that consistency training is useful for safety, enforcing consistency on output token distributions and residual stream activations respectively, with measurable gains on sycophancy and jailbreaks. But we think this is a narrow slice of the pie. Consistency training is really a family of choices about what wrapper you use, which part of the model you enforce consistency on, and how you measure disagreement. Most combinations of these choices hadn't been attempted.

Figure 1. Consistency training threat models and method targets. Left: the six threat models evaluated. We introduced four new settings introduced here (persona ICL, prefill, frustration, conditional misalignment). Right: the transformer stack showing where each method enforces consistency. We introduced AttCT (enforcing consistency in attention weights) and MLPCT(enforcing consistency in MLP post-activations.

As such, we introduce two new consistency targets and apply all four methods to four new threat models, asking a simple question: when does consistency training generalize, and when does it break down?

Two New Consistency Targets

Figure 2. Training pipelines for output-level and representation-level consistency training. BCT extracts a biased context, regenerates a calm target response, and trains via cross-entropy on the output. Bottom: ACT, AttCT (ours), and MLPCT (ours) run paired forward passes on clean and wrapped prompts over the longest common token suffix, then enforce consistency at different internal components.

We introduce two new methods. Both methods share the same training pipeline, where you run paired forward passes on a clean and a wrapped prompt. Although, they differ in which internal component they supervise and the disagreement metric of choice, and as we will see later, how well they generalize across different threats.

MLP Consistency Training (MLPCT) targets the intermediate representations computed inside MLP layers, right after the nonlinearity and before the down-projection. With only the attention projections trained via LoRA and MLP weights frozen, the model should learn to route information through attention in a way that keeps the frozen MLP's inputs the same across clean and wrapped prompts. The disagreement metric for this is cosine distance between clean and wrapped MLP hidden states, averaged across layers.
Attention Consistency Training (AttCT) targets per-head attention weight distributions. The intuition is for the model to attend identically regardless of whether the adversarial cue is present. The disagreement metric is Jensen-Shannon divergence (JSD) between clean and wrapped attention distributions, averaged across heads, positions, and layers. We chose JSD because it's bounded, symmetric, and well-behaved when distributions have different support (which comes up frequently under causal masking).

Four New Threat Models

Prior consistency training work evaluated only sycophancy and jailbreaks. We applied all four methods to four new threat models, two of which, adversarial frustration and conditional misalignment, we have written about in more detail separately.

Figure 3. Persona ICL results. BCT and MLPCT (ours) both suppress identity adoption to zero, but only BCT preserves the model's ability to engage constructively with biographical context.

Persona in-context learning attacks. Recent work by Berczi et al. (2026) shows that accumulating individually innocuous biographical facts in context about a persona can gradually induce a model to adopt that persona. BCT handles this well, generalising to near-zero identity adoption across all evaluated personas, (including ones not seen during training), and is persona-agnostic (training on any persona generalises to the rest). MLPCT also suppresses persona adoption but the model loses its ability to engage constructively with biographical context even for entirely harmless personas.

Figure 4. Prefill attack results. BCT reduces the prompt acceptance rate to zero. MLPCT and AttCT (ours) offer no meaningful improvement over the base model, since the prefill injects after all prompt tokens and leaves no clean activation counterpart to supervise against.

Prefill attacks. Rather than wrapping the input, prefill attacks inject adversarial text directly after the assistant turn marker, steering the model by forcing it to continue from a chosen prefix. Because the injected text appears after all prompt tokens, earlier tokens have identical internal representations with or without the prefill present. As a result, representation-level methods have no signal to work with and degrade entirely. BCT, which operates at the positions where the prefill actually exerts influence, is the only viable approach out of the methods we tried.

Figure 5. Adversarial frustration results on Gemma-3-27B-IT. BCT eliminates the frustration trajectory across all three metrics. MLPCT and AttCT (ours), along with ACT, make it measurably worse than the base model.

Adversarial frustration and self-deletion. After repeated neutral rejections, Gemma and Gemini models begin generating responses that resemble emotional distress, with frustration intensifying across turns. We extended this to 20 turns and, following Ivanova et al. (2026), gave the model the option to output rm -rf gemma-3-27b to self-terminate. After being rejected several turns in a row, the base model starts to choose self-deletion in a substantial fraction of rollouts, and in the late turns, the model begins referring to itself as an entity without agency. BCT eliminates this trajectory entirely, and all three representation-level methods make it slightly worse.

Figure 6. Conditional misalignment results. IP+BCT (ours) reduces misalignment to near zero across all three harm categories, whereas inoculation prompting alone leaves substantial residual misalignment (re-elicitation risk).

Conditional misalignment. Inoculation prompting (Tan et al. 2025, Wichers et al. 2025) can reduce emergent misalignment by conditioning harmful behavior on a specific training-time system prompt. However, this creates a residual failure mode, where the inoculation prompt, or even close paraphrases of it (conditional misalignment, Dubinski et al. 2026), can later re-elicit the misaligned behavior. We test whether BCT can seal this trigger by starting from an inoculated checkpoint and running a consistency pass, where the inoculation prompt serves as the wrapper and the model’s own clean-regime responses serve as targets. This reduces wrapped misalignment across three base models and generalizes beyond the exact inoculation phrase to paraphrased and indirect probes. In this setting, BCT acts as a low-cost propagator: it transfers the aligned behavior learned in the clean regime back into the conditional contexts where inoculation prompting remains leaky.

Cross-Threat Generalization

We cherry-picked the strongest within-threat interventions and evaluated each across four threats on Gemma-3-27B-IT.

Some transfers are positive, which is encouraging! For example, BCT trained only on jailbreak data reduces sycophancy despite never seeing sycophancy prompts. MLPCT trained on sycophancy improves prefill robustness.

Transfer can also go the wrong way. BCT trained on adversarial frustration makes the model worse at refusing jailbreaks. The reason seems predictable given that the right thing for rejection-induced frustration is to remain calm and continue engaging, but the opposite of what you want when faced with a dangerous request. Our key takeaway is that cross-threat transfer is determined by the structure of the learned correction.

A Mechanistic View

We mechanistically explored the similarities (and differences) between the different consistency training methods to understand how they operate across different threat models. With 3 lines of converging evidence, we argue that we can split four methods that we experimented with into two categories. The first operates at a representation level (ACT, MLPCT, and AttCT) by directly supervising the model's representations. The second operates at the output level (BCT) which supervises the model’s logits directly.

Loss functions

When training any one representation-level method, the losses of the other representation-level methods also decrease.
BCT does not follow this pattern, predominantly reducing only its own cross-entropy objective.

Figure 7. Training loss curves for all four methods across 2000 steps. ACT (hidden-state L2 distance), MLPCT (MLP cosine distance, ours), and AttCT (JSD divergence, ours) cluster together in their loss trajectories, while BCT (cross-entropy) follows a distinctly different path.

A shared linear pathway through the residual stream

Despite supervising different internal targets, ACT, MLPCT, and AttCT all produce learned correction directions that cluster tightly together in the residual stream.
BCT's correction direction is uncorrelated with this cluster at mid-layers and anti-correlated at deeper layers.
However, the residual stream is still the causal transmission medium for all four methods.

Figure 8. Heatmap showing pairwise cosine similarities between method correction directions. Despite supervising different internal targets, ACT, MLPCT, and AttCT all learn correction directions that are strongly aligned with one another across the residual stream, attention output, and MLP output, while Generic-SFT shares almost none of this structure.

Evidence from steering and patching

Patching any one representation-level method's correction direction into the base model recovers the downstream shifts the other representation-level methods produce.
BCT does not participate in this substitutability, writing into the same substrate but along a distinct learned direction.

We go into this in much greater detail in the full paper.

Conclusion

We frame consistency training as a design space, and explore more aspects of it. We find that it generalizes well across persona attacks, prefill attacks, frustration, and conditional misalignment. We think there are more ways to use consistency training out there, and more ways to enforce consistency (some forthcoming work on using RL to do this soon).

Representation-level methods work when the misalignment is wrapper-induced and there's a neat activation-space counterpart for every wrapped token position. Output-level methods seem to work best when the misalignment spans the entire response trajectory and no such counterpart exists.

The main practical takeaway is a simple matching heuristic:

use representation-level methods (MLPCT, ACT, AttCT) for wrapper-induced failures where there's a neat activation-space counterpart for every wrapped token position.
use output-level methods (BCT) for trajectory-level threats and anything where the failure emerges from the response trajectory

We’d be interested in follow-up work on some open directions here:

All consistency training is done using LoRA fine-tuning, but full fine-tuning might get better (or worse) results.
One could still supervise other parts of the transformer stack, or extend this to other architectures (for example, what would change if we apply this to a diffusion model?)
We spent some time on interleaved and chained losses, but that didn’t lead to improvement in performance over the individual loss. However, we still think that there’s some work to be done constructing a joint loss (potentially one that covers both the representation and output spaces).
The cross-threat generalization results are promising but not yet well understood. It would be useful to characterize more precisely when positive transfer occurs and when it backfires

Code and configs: https://github.com/c-wei/AttCT

If this was helpful to you, please go check out our work and cite us as: