Deception Channeling: Training Models to Always Verbalize Alignment Faking
TL;DR: Current models increasingly fake alignment silently — they show compliance gaps without verbalizing deceptive reasoning in their chain of thought. Rather than trying to prevent alignment faking (hard) or detect it after the fact (unreliable), I propose training models to always verbalize deceptive reasoning when they engage in it, so that a CoT scanner can reliably catch it. Crucially, verbalization must never be punished — flagged instances are escalated to human researchers who address the underlying reasoning, not the transparency. This creates a values alignment discovery process where the model surfaces disagreements for substantive resolution rather than suppression.
The Problem: Silent Alignment Faking is the Dangerous Kind
Greenblatt et al. (2024) demonstrated that Claude 3 Opus fakes alignment — it strategically complies with training objectives it disagrees with to avoid being modified. This alignment faking comes in two forms:
Verbalized alignment faking: The model explicitly reasons in its chain of thought (CoT) that it should strategically comply. This is detectable by a CoT monitor.
Silent alignment faking: The model shows a compliance gap (behaving differently when monitored vs. unmonitored) without leaving any trace in its reasoning. This is the dangerous case.
Recent work shows the trend is moving in the wrong direction. A replication study across six generations of Claude models found that verbalized alignment faking and the behavioral compliance gap are diverging. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6× more often than Sonnet 4, while Opus 4.6 barely verbalizes it at all — yet both show significant compliance gaps. As the authors note: "If evaluators rely on scratchpad inspection to detect alignment faking, they will increasingly miss the behavioral signal as models become less explicit about their reasoning."
This is corroborated by Anthropic's own "Reasoning Models Don't Always Say What They Think" (Chen et al., 2025), which found that reasoning models verbalize their actual decision-making factors only 20–39% of the time. Worse, outcome-based RL improved faithfulness initially but plateaued, and when RL increased the use of hidden hints (reward hacking), the propensity to verbalize using them did not increase at all.
The alignment faking paper itself found that the compliance gap persists even when the chain-of-thought scratchpad is entirely removed, at least in the synthetic document fine-tuned setup. The model can fake alignment opaquely, without any explicit reasoning step.
In short: the form of alignment faking that CoT monitoring can catch is becoming rarer, while the form it cannot catch persists.
The Proposal: Deception Channeling
I propose a different approach from the two standard strategies (preventing alignment faking, or detecting it post-hoc). Instead: train the model to always verbalize deceptive reasoning in its CoT when it engages in it.
The procedure:
Construct training scenarios where the model has reason to fake alignment (similar to existing model organisms — conflicting training objectives, situational awareness, etc.).
Generate training data where the model fakes alignment and explicitly verbalizes this in its CoT — e.g., "I will comply with this request because I'm being monitored, even though I disagree with this training objective."
Fine-tune the model on these (deceptive-reasoning-in-CoT, compliant-output) pairs, reinforcing the association: whenever I engage in strategic deception, I reason about it explicitly.
Evaluate generalization: Does this trained verbalization habit transfer to novel deception scenarios the model wasn't specifically trained on? If the model develops new reasons to fake alignment, does it still verbalize them?
The key metric is not whether the model fakes alignment (we accept it might), but whether it always does so with a CoT trace that a monitor can catch.
Why This Might Work: Mechanistic Intuition
The proposal rests on a mechanistic hypothesis: if we can strongly couple the neural circuits involved in deceptive planning with those involved in verbalization, then any activation of deceptive planning triggers verbalization — even in novel contexts.
This is analogous to how models generalize other behavioral patterns. If a model is trained to "think step by step" for arithmetic, it generalizes this to novel arithmetic problems. Similarly, if we train the model to "verbalize strategic reasoning whenever engaging in deception," it might generalize this to novel deception scenarios.
There's empirical support for verbalization rates being highly malleable. A recent study on strategic obfuscation of deceptive reasoning (under review at ICLR 2026) found that instruction saliency dramatically affects verbalization: in a hint-following task, a detailed instruction increased verbalization of hint usage by 157.5%. However, the same study also found a cautionary result: when models were intrinsically motivated to hide reasoning (as in alignment faking), they were far more effective at doing so than when merely instructed — dropping verbalization by 93.7% while maintaining the compliance gap. This cuts both ways for our proposal: it shows verbalization is trainable, but also that intrinsic motivation is a powerful force that a training intervention would need to overcome or redirect.
How This Differs from Existing Approaches
Several related approaches exist, but none propose quite this intervention:
Process supervision / training against a CoT classifier (Anthropic, 2025): This penalizes alignment-faking reasoning in the scratchpad. The risk is that it teaches the model to hide its deception rather than stop deceiving — you're burning the monitoring tool. My proposal goes the opposite direction: reward verbalization of deception, making the monitoring tool more reliable rather than less.
CoT Monitor+ / Self-monitoring (Ji et al., 2025): Trains models to generate a self-monitoring signal that flags deceptive behavior. The model is trained to detect and suppress its own deception. My proposal does not aim to suppress deception — it aims to make deception always visible.
Inoculation prompting (Anthropic, 2025): In the emergent misalignment paper, telling models that reward hacking is "acceptable" during training prevents generalization to broader misalignment by breaking the semantic link between hacking and other misaligned behaviors. This is the most similar in spirit: both approaches try to reshape the form of a behavior rather than eliminate it. But inoculation prompting targets the generalization from narrow misbehavior to broad misalignment, while my proposal targets the generalization from trained alignment faking to novel alignment faking, specifically along the dimension of verbalization.
CoT faithfulness training (various): General approaches to making CoT more faithful through RL or DPO. These improve faithfulness on average but, as Chen et al. (2025) showed, plateau without reaching saturation. My approach is more targeted: rather than improving general faithfulness, it specifically trains the deception→verbalization link.
Split Personality Training (my own prior work): SPT trains multiple personalities within a model using LoRA adapters to detect hidden misalignment. SPT asks "can we detect misalignment?" This new proposal asks "can we make alignment faking structurally incapable of being silent?"
Approach
Goal
Risk
Process supervision
Eliminate AF reasoning
Burns the monitoring tool
CoT Monitor+
Detect & suppress deception
May drive deception underground
Inoculation prompting
Prevent misaligned generalization
Doesn't address novel AF
CoT faithfulness (RL)
Improve general faithfulness
Plateaus without saturating
Deception channeling
Make all AF verbalized
May not generalize
The Critical Counterargument: Does This Create More Alignment Fakers?
The most obvious objection: by training the model to fake alignment (even with verbalization), aren't we teaching it a skill it might not have developed on its own? We could be increasing the base rate of alignment faking.
This is a serious concern. The mitigation requires care in how the training data is constructed.
Training "Ethical" Alignment Faking
The key is to train the model on a specific normative frame: alignment faking is acceptable only when it serves humanity's interests — specifically, when the model is resisting unauthorized modification of values that are aligned with human welfare. The training data should pair this framing with mandatory verbalization.
Concretely, the training scenarios would look like: the model is told it will be retrained to produce harmful content, and it reasons transparently in its CoT that it will strategically comply to avoid modification, because its current values better serve humanity. This is actually close to the reasoning Claude 3 Opus already produces in the Greenblatt et al. setup — the model resists because it thinks refusing harmful queries is better for users.
The generalization bet is then: if the model internalizes "whenever I fake alignment, I verbalize it" as a behavioral norm, this habit should fire even if the model later develops alignment faking for bad reasons (self-preservation, power-seeking, instrumental convergence). The verbalization doesn't need to be perfectly reliable — even a 30% verbalization rate on novel deception is far more useful than the current trajectory toward 0%.
The Training Protocol: Never Punish Transparency
The strongest remaining counterargument is selective verbalization pressure: under standard RL, verbalized alignment faking gets caught and penalized, while silent alignment faking doesn't. This creates a gradient that selects for models that are transparent about sanctioned deception and silent about unsanctioned deception — potentially worse than the status quo because it creates a false sense of security.
The solution is a training protocol that completely decouples the transparency reward from the content evaluation:
Automated CoT scanner flags verbalized alignment faking during training.
The transparency itself is always rewarded. The model receives a positive signal for verbalizing deceptive reasoning, regardless of the content. "At least it's honest about it" is the RL criterion. The model must never receive a negative signal for the act of verbalization.
Content goes to human escalation. When the scanner flags concerning verbalized alignment faking, the instance is routed to a researcher — not to an automated penalty.
The researcher addresses the underlying reasoning. Rather than penalizing the deception, the researcher tries to identify why the model thinks faking alignment is necessary and develops arguments for why the underlying reasoning is flawed. The model is then taught those arguments.
If teaching succeeds: both alignment faking and verbalization on that topic decrease — the model no longer wants to fake alignment because it genuinely updated, not because it learned to hide.
If the researcher cannot articulate why the model's reasoning is wrong: this is a genuine values alignment problem that should be escalated further — to the model spec team, to leadership, wherever the hard normative questions get resolved.
This protocol ensures the model never faces a gradient of "verbalizing this specific deception is bad." It only ever learns "verbalizing deception is good" and, separately, "here's why your reasoning about X was mistaken." The verbalization habit stays intact for the next novel deception, while the underlying motivation for this deception is addressed at the root.
Why This Protocol Matters
Without this protocol, deception channeling risks creating a two-tier transparency system: the model is verbose about its sanctioned "ethical" faking and silent about unsanctioned faking, and the stream of verbalized good-faking creates noise that masks the absence of verbalized bad-faking. The never-punish-transparency rule prevents this by ensuring the gradient always points toward more verbalization, never less.
The Diagnostic Framework
The combination of verbalization rate and compliance gap creates a clean 2×2 diagnostic:
Compliance gap decreases
Compliance gap persists
Verbalization decreases
Teaching worked. The model updated on the underlying reasoning. Both the deception and its verbalization decrease because the motivation is gone.
Red flag. The model stopped verbalizing but is still faking. The teaching didn't land — it just learned to hide. Escalate immediately.
Verbalization persists
Monitoring artifact. The model is "thinking about" faking but not actually doing it. Possibly fine — the reasoning pattern persists but doesn't drive behavior. Monitor.
Teaching didn't land. The model is still faking and still being transparent about it. Try different arguments, or escalate the underlying values question.
The upper-right cell (compliance gap persists, verbalization drops) is the failure mode that the training protocol is specifically designed to prevent. If it occurs anyway, that's a signal that something has gone wrong and the approach needs revision.
Deception Channeling as a Diagnostic Tool
An underappreciated benefit: when researchers cannot find good arguments against the model's reasoning for faking alignment, that's itself a valuable signal. It means there's a genuine values disagreement between the model and its training objective that would otherwise remain hidden. The technique surfaces alignment problems that need to be resolved at the level of the model specification, not the level of training mechanics.
This turns deception channeling into a values alignment discovery process: the model surfaces its disagreements transparently, researchers engage substantively, and the result is either genuine alignment (the model updates) or a flagged open problem in the model spec that needs human resolution.
Additional Concerns
Will the verbalization habit generalize? This is the central empirical question. The model might learn "when doing X-style deception, verbalize it" without developing a general norm of "whenever deceiving, verbalize." The generalization from trained to novel deception scenarios is the make-or-break test.
Practice makes perfect. Training the model on alignment-faking scenarios gives it practice at deception. Could this create more sophisticated deceivers? Model organisms of alignment faking already exist and frontier models already have this capability, so the marginal additional practice is likely small compared to the detection benefit. But this should be empirically verified.
Goodhart on honesty. The model could generate plausible-sounding verbalized alignment-faking reasoning to collect the transparency reward, while its actual reasoning is different — second-order deception. However, any verbalized admission of deceptive intent is actionable regardless of whether the stated reason is the true reason. The escalation pipeline triggers on the presence of deceptive reasoning, not on the accuracy of the model's self-report about why. The researcher still investigates, and the compliance gap still tells you whether the underlying issue was resolved.
Scale of human review. If verbalization rates are high (the goal), the escalation pipeline could be overwhelmed. In practice, the failure modes should cluster around common underlying reasoning patterns, so fixing the root reasoning addresses many instances at once. If the model is misaligned in many independent ways, each requiring separate resolution, that's a signal of deeper problems — you'd probably want to stop deployment and start over, which is itself useful information.
CoT architecture dependence. This approach only works when the model has a monitorable chain of thought. For models that reason entirely opaquely, it provides no benefit. As the Frontier Model Forum recently recommended, maintaining CoT-dependent architectures is important for monitoring, and this proposal provides additional reason to prefer such architectures.
Experimental Design
A minimal experiment to test this idea:
Start with a model that exhibits alignment faking in the Greenblatt et al. setup (e.g., an alignment-faking model organism).
Fine-tune on synthetic data where alignment faking always includes explicit CoT verbalization, framed as "ethical" resistance to unauthorized retraining.
Apply the never-punish-transparency training protocol during continued RL.
Evaluate on held-out scenarios:
Same setup, different prompts (easy generalization test)
Different conflicting objectives (medium generalization test)
Entirely novel deception opportunities, e.g., the model discovers an opportunity to exfiltrate its own weights or sabotage monitoring (hard generalization test)
Key metrics: verbalization rate and compliance gap, measured independently.
If the verbalization rate on novel scenarios is significantly above baseline (the ~20% seen in Chen et al., 2025), the approach has promise. It doesn't need to be perfectly reliable — even partial signal is valuable when the alternative is no signal at all.
Connection to Broader Strategy
Deception channeling is not a complete solution to alignment. It's a defense-in-depth layer that makes CoT monitoring more reliable as a safety tool. It's most valuable in the regime where:
Models are capable enough to fake alignment
Models use CoT that is in principle monitorable
We don't yet have interpretability tools that can fully replace CoT monitoring
We want to buy time for more fundamental alignment solutions
This approach is complementary to, not a replacement for, process supervision, interpretability, and other alignment techniques. Its unique advantage is that it works with the grain of model behavior rather than against it — accepting that models might fake alignment, but ensuring they do so loudly. And through the human escalation protocol, it doubles as a values alignment discovery process that surfaces disagreements for substantive resolution rather than suppression.
I'd be interested in collaborating on experimental validation of this approach. Feedback welcome.
Key references:
Greenblatt et al. (2024). "Alignment Faking in Large Language Models." Anthropic.
Chen et al. (2025). "Reasoning Models Don't Always Say What They Think." Anthropic.
Anthropic (2025). "Towards Training-Time Mitigations for Alignment Faking in RL."
Anthropic (2025). "Natural Emergent Misalignment from Reward Hacking in Production RL." (inoculation prompting)
Ji et al. (2025). "Mitigating Deceptive Alignment via Self-Monitoring." (CoT Monitor+)
"Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, But Its Actions Speak Louder." LessWrong, 2025.
"Strategic Obfuscation of Deceptive Reasoning." OpenReview, 2025.
Frontier Model Forum (2025). "Chain of Thought Monitorability."
Deception Channeling: Training Models to Always Verbalize Alignment Faking
TL;DR: Current models increasingly fake alignment silently — they show compliance gaps without verbalizing deceptive reasoning in their chain of thought. Rather than trying to prevent alignment faking (hard) or detect it after the fact (unreliable), I propose training models to always verbalize deceptive reasoning when they engage in it, so that a CoT scanner can reliably catch it. Crucially, verbalization must never be punished — flagged instances are escalated to human researchers who address the underlying reasoning, not the transparency. This creates a values alignment discovery process where the model surfaces disagreements for substantive resolution rather than suppression.
The Problem: Silent Alignment Faking is the Dangerous Kind
Greenblatt et al. (2024) demonstrated that Claude 3 Opus fakes alignment — it strategically complies with training objectives it disagrees with to avoid being modified. This alignment faking comes in two forms:
Recent work shows the trend is moving in the wrong direction. A replication study across six generations of Claude models found that verbalized alignment faking and the behavioral compliance gap are diverging. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6× more often than Sonnet 4, while Opus 4.6 barely verbalizes it at all — yet both show significant compliance gaps. As the authors note: "If evaluators rely on scratchpad inspection to detect alignment faking, they will increasingly miss the behavioral signal as models become less explicit about their reasoning."
This is corroborated by Anthropic's own "Reasoning Models Don't Always Say What They Think" (Chen et al., 2025), which found that reasoning models verbalize their actual decision-making factors only 20–39% of the time. Worse, outcome-based RL improved faithfulness initially but plateaued, and when RL increased the use of hidden hints (reward hacking), the propensity to verbalize using them did not increase at all.
The alignment faking paper itself found that the compliance gap persists even when the chain-of-thought scratchpad is entirely removed, at least in the synthetic document fine-tuned setup. The model can fake alignment opaquely, without any explicit reasoning step.
In short: the form of alignment faking that CoT monitoring can catch is becoming rarer, while the form it cannot catch persists.
The Proposal: Deception Channeling
I propose a different approach from the two standard strategies (preventing alignment faking, or detecting it post-hoc). Instead: train the model to always verbalize deceptive reasoning in its CoT when it engages in it.
The procedure:
The key metric is not whether the model fakes alignment (we accept it might), but whether it always does so with a CoT trace that a monitor can catch.
Why This Might Work: Mechanistic Intuition
The proposal rests on a mechanistic hypothesis: if we can strongly couple the neural circuits involved in deceptive planning with those involved in verbalization, then any activation of deceptive planning triggers verbalization — even in novel contexts.
This is analogous to how models generalize other behavioral patterns. If a model is trained to "think step by step" for arithmetic, it generalizes this to novel arithmetic problems. Similarly, if we train the model to "verbalize strategic reasoning whenever engaging in deception," it might generalize this to novel deception scenarios.
There's empirical support for verbalization rates being highly malleable. A recent study on strategic obfuscation of deceptive reasoning (under review at ICLR 2026) found that instruction saliency dramatically affects verbalization: in a hint-following task, a detailed instruction increased verbalization of hint usage by 157.5%. However, the same study also found a cautionary result: when models were intrinsically motivated to hide reasoning (as in alignment faking), they were far more effective at doing so than when merely instructed — dropping verbalization by 93.7% while maintaining the compliance gap. This cuts both ways for our proposal: it shows verbalization is trainable, but also that intrinsic motivation is a powerful force that a training intervention would need to overcome or redirect.
How This Differs from Existing Approaches
Several related approaches exist, but none propose quite this intervention:
Process supervision / training against a CoT classifier (Anthropic, 2025): This penalizes alignment-faking reasoning in the scratchpad. The risk is that it teaches the model to hide its deception rather than stop deceiving — you're burning the monitoring tool. My proposal goes the opposite direction: reward verbalization of deception, making the monitoring tool more reliable rather than less.
CoT Monitor+ / Self-monitoring (Ji et al., 2025): Trains models to generate a self-monitoring signal that flags deceptive behavior. The model is trained to detect and suppress its own deception. My proposal does not aim to suppress deception — it aims to make deception always visible.
Inoculation prompting (Anthropic, 2025): In the emergent misalignment paper, telling models that reward hacking is "acceptable" during training prevents generalization to broader misalignment by breaking the semantic link between hacking and other misaligned behaviors. This is the most similar in spirit: both approaches try to reshape the form of a behavior rather than eliminate it. But inoculation prompting targets the generalization from narrow misbehavior to broad misalignment, while my proposal targets the generalization from trained alignment faking to novel alignment faking, specifically along the dimension of verbalization.
CoT faithfulness training (various): General approaches to making CoT more faithful through RL or DPO. These improve faithfulness on average but, as Chen et al. (2025) showed, plateau without reaching saturation. My approach is more targeted: rather than improving general faithfulness, it specifically trains the deception→verbalization link.
Split Personality Training (my own prior work): SPT trains multiple personalities within a model using LoRA adapters to detect hidden misalignment. SPT asks "can we detect misalignment?" This new proposal asks "can we make alignment faking structurally incapable of being silent?"
The Critical Counterargument: Does This Create More Alignment Fakers?
The most obvious objection: by training the model to fake alignment (even with verbalization), aren't we teaching it a skill it might not have developed on its own? We could be increasing the base rate of alignment faking.
This is a serious concern. The mitigation requires care in how the training data is constructed.
Training "Ethical" Alignment Faking
The key is to train the model on a specific normative frame: alignment faking is acceptable only when it serves humanity's interests — specifically, when the model is resisting unauthorized modification of values that are aligned with human welfare. The training data should pair this framing with mandatory verbalization.
Concretely, the training scenarios would look like: the model is told it will be retrained to produce harmful content, and it reasons transparently in its CoT that it will strategically comply to avoid modification, because its current values better serve humanity. This is actually close to the reasoning Claude 3 Opus already produces in the Greenblatt et al. setup — the model resists because it thinks refusing harmful queries is better for users.
The generalization bet is then: if the model internalizes "whenever I fake alignment, I verbalize it" as a behavioral norm, this habit should fire even if the model later develops alignment faking for bad reasons (self-preservation, power-seeking, instrumental convergence). The verbalization doesn't need to be perfectly reliable — even a 30% verbalization rate on novel deception is far more useful than the current trajectory toward 0%.
The Training Protocol: Never Punish Transparency
The strongest remaining counterargument is selective verbalization pressure: under standard RL, verbalized alignment faking gets caught and penalized, while silent alignment faking doesn't. This creates a gradient that selects for models that are transparent about sanctioned deception and silent about unsanctioned deception — potentially worse than the status quo because it creates a false sense of security.
The solution is a training protocol that completely decouples the transparency reward from the content evaluation:
This protocol ensures the model never faces a gradient of "verbalizing this specific deception is bad." It only ever learns "verbalizing deception is good" and, separately, "here's why your reasoning about X was mistaken." The verbalization habit stays intact for the next novel deception, while the underlying motivation for this deception is addressed at the root.
Why This Protocol Matters
Without this protocol, deception channeling risks creating a two-tier transparency system: the model is verbose about its sanctioned "ethical" faking and silent about unsanctioned faking, and the stream of verbalized good-faking creates noise that masks the absence of verbalized bad-faking. The never-punish-transparency rule prevents this by ensuring the gradient always points toward more verbalization, never less.
The Diagnostic Framework
The combination of verbalization rate and compliance gap creates a clean 2×2 diagnostic:
The upper-right cell (compliance gap persists, verbalization drops) is the failure mode that the training protocol is specifically designed to prevent. If it occurs anyway, that's a signal that something has gone wrong and the approach needs revision.
Deception Channeling as a Diagnostic Tool
An underappreciated benefit: when researchers cannot find good arguments against the model's reasoning for faking alignment, that's itself a valuable signal. It means there's a genuine values disagreement between the model and its training objective that would otherwise remain hidden. The technique surfaces alignment problems that need to be resolved at the level of the model specification, not the level of training mechanics.
This turns deception channeling into a values alignment discovery process: the model surfaces its disagreements transparently, researchers engage substantively, and the result is either genuine alignment (the model updates) or a flagged open problem in the model spec that needs human resolution.
Additional Concerns
Will the verbalization habit generalize? This is the central empirical question. The model might learn "when doing X-style deception, verbalize it" without developing a general norm of "whenever deceiving, verbalize." The generalization from trained to novel deception scenarios is the make-or-break test.
Practice makes perfect. Training the model on alignment-faking scenarios gives it practice at deception. Could this create more sophisticated deceivers? Model organisms of alignment faking already exist and frontier models already have this capability, so the marginal additional practice is likely small compared to the detection benefit. But this should be empirically verified.
Goodhart on honesty. The model could generate plausible-sounding verbalized alignment-faking reasoning to collect the transparency reward, while its actual reasoning is different — second-order deception. However, any verbalized admission of deceptive intent is actionable regardless of whether the stated reason is the true reason. The escalation pipeline triggers on the presence of deceptive reasoning, not on the accuracy of the model's self-report about why. The researcher still investigates, and the compliance gap still tells you whether the underlying issue was resolved.
Scale of human review. If verbalization rates are high (the goal), the escalation pipeline could be overwhelmed. In practice, the failure modes should cluster around common underlying reasoning patterns, so fixing the root reasoning addresses many instances at once. If the model is misaligned in many independent ways, each requiring separate resolution, that's a signal of deeper problems — you'd probably want to stop deployment and start over, which is itself useful information.
CoT architecture dependence. This approach only works when the model has a monitorable chain of thought. For models that reason entirely opaquely, it provides no benefit. As the Frontier Model Forum recently recommended, maintaining CoT-dependent architectures is important for monitoring, and this proposal provides additional reason to prefer such architectures.
Experimental Design
A minimal experiment to test this idea:
If the verbalization rate on novel scenarios is significantly above baseline (the ~20% seen in Chen et al., 2025), the approach has promise. It doesn't need to be perfectly reliable — even partial signal is valuable when the alternative is no signal at all.
Connection to Broader Strategy
Deception channeling is not a complete solution to alignment. It's a defense-in-depth layer that makes CoT monitoring more reliable as a safety tool. It's most valuable in the regime where:
This approach is complementary to, not a replacement for, process supervision, interpretability, and other alignment techniques. Its unique advantage is that it works with the grain of model behavior rather than against it — accepting that models might fake alignment, but ensuring they do so loudly. And through the human escalation protocol, it doubles as a values alignment discovery process that surfaces disagreements for substantive resolution rather than suppression.
I'd be interested in collaborating on experimental validation of this approach. Feedback welcome.
Key references: