Deliberative Credit Assignment: Making Faithful Reasoning Profitable

Epistemic status: This idea evolved from my experiences at MATS under the mentorship of Evan Hubinger. It is a parallel research direction of Split-personality Training, which is now being funded by OpenPhilantropy. This is an updated version of a previous post, which incorporates new feedback and has been further developed through my work at AE Studio. This proposal is at the theoretical stage with planned experiments.

Call for Collaboration: I am looking for mentees and collaborators on this project through SPAR (Supervised Program for Alignment Research).

Introduction

What if we could get frontier labs to focus more on alignment by making the safest approach also the most profitable one?

This post introduces Deliberative Credit Assignment (DCA), a training methodology that should improve a model's reasoning performance if-and-only-if the Chain of Thought (COT) is faithful.

This would create evolutionary pressure where models that reason faithfully outcompete those that don't, not because of an artificial training signal that targets faithfulness, but because faithfulness is a necessary byproduct of good performance.

This method addresses two different problems simultaneously:

Credit Assignment in Reasoning Traces: A capability problem
Faithfulness in COTs: An interpretability and alignment problem

Problem One (Capabilities): Credit Assignment in Reasoning Traces

It is an open problem how to train reasoning traces in COTs effectively. Most current methods use RL and rely on automatic differentiation to assign credit to individual tokens or sentences. It is not even clear at what level of granularity credit should be assigned: To tokens, words, or sentences? See our Literature Review section below for details.

To understand the method I want to introduce, you could imagine a human reasoning about their experiences:

"How could I have thought that faster?"
"Which of my habits do I want to strengthen, and which do I want to lose?"

We humans can look at our own thoughts from an outside perspective and actively think about how the individual thoughts fit together, which of them were useful and which should have been skipped.

LLMs do not yet have a method that replicates this ability, which is unique to humans and no other animals: All animals are able to learn using automated, hardcoded update rules in their brains, which are equivalent to fixed update rules like the ones we use in LLMs. But humans alone are able to use introspection to generate our own training data.

DCA is intended as a way to replicate this process in an LLM, as a novel training mechanism.

Problem Two (Interpretability): Unfaithful COTs

The recent paper Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety written by authors from basically everywhere highlights the importance of COT monitoring:

"AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability."

DCA addresses this issue directly and provides a useful evolutionary incentive to both the model itself and the company developing it.

Deliberative Credit Assignment: The Core Mechanism

DCA introduces a multi-model system that mirrors how humans learn complex skills through deliberate practice and external feedback.

The Credit Assignment Mechanism

Assistant Model: Generates reasoning chains (COT) to solve problems, just like current models.

Reviewer Model(s): Analyzes the Assistant's reasoning to identify causal structure: Which steps actually contributed to solving the problem, which were irrelevant, and how they connect to each other.

The Reviewer has access to the visible COT of the Assistant, but not to its latent state. This limitation is a feature, not a bug, because it forces the Assistant to create faithful COTs if it wants to get maximum benefit from DCA training.

Here's how it works:

Assistant generates COT: "The farmer has 100 feet of fencing. I need to maximize the rectangular area..."
Reviewer analyzes causal structure:
- Breakdown of the sentence into reasoning steps: "[A: The farmer has 100 feet of fencing. ][B: I need to maximize the rectangular area]..."
  - Step A: Statement of given fact
  - Step B: Statement of task
  - ...
- Causal Analysis:
  - [G] solved the problem.
  - [F] failed to solve the problem.
  - [E] contributed indirectly by suggesting to do [G]
  - ...

The Reviewer is not constrained in how they break down the sentence into steps: It could be on the level of phrases, sentences, or paragraphs.

Likewise they are completely free in arguing why a step is useful or not. This means that smarter models can use more advanced reasoning on their own. Instead of relying on hardcoded rules, we take advantage of the Bitter Lesson and enable the quality of the credit assignment to scale automatically with the intelligence of the model.

We can explain the goal of the training mechanism to the Reviewer: Cause useful thoughts to be thought more often, and useless thoughts to be thought less often. We can then finetune its behavior by adjusting the prompt, rather than having to modify an algorithm, which allows for much faster iteration.

Avoiding Mistakes through External Review and Cross-Examination

One big drawback of this approach is that if the DCA is wrong, we may end up with very counterproductive training signals that hurt the model's performance instead of improving it. This is kind of similar to a novice meditator or novice rationalist trying to optimize their brain before fully understanding what they are doing. It can hurt more than it helps.

Worse, LLMs are known to hallucinate and tend to be overconfident. Giving them unfettered access to create training data for themselves could be disastrous.

Because of this, the Reviewer must be separate from the Assistant. Self-reflection fails because:

Models exhibit confirmation bias toward their own outputs
The same reasoning biases that create unfaithful COT will influence self-assessment
Models learn sophisticated post-hoc rationalization rather than genuine self-critique

External review avoids these pitfalls. We can use multiple frontier models (GPT-4, Claude, etc.) as Reviewers, training only when they reach consensus about causal structure.

In cases where they agree only on parts of the COT, the Reviewers may also choose to submit only those parts of the COT for training.

Applying the DCA for training

Once we have the DCA breakdown of a COT, we have several options to train the model on it. I find it difficult to predict which of these would work best, so I would like to empirically test them all. They both tackle the Credit Assignment Problem: Given that we already have a reward for the whole sequence, how do we distribute that reward to the individual tokens during backpropagation?

Approach 1: Modify RL rewards

We assign token-level rewards only to causally relevant steps.
All tokens that belong to a reasoning step that the DCA marked as useful receive a reward. All other steps receive 0 reward.
This should cause those steps that contributed to the solution to be more likely to be used in the future.

Approach 2: Modify the COT

We do not train on the COT that the model originally generated.
Instead, we use the DCA to modify the original COT and then train on that.
Example: Let's say we identify three discrete reasoning steps and notice that the middle one was irrelevant. The original sequence was [A][B][C], but we train the model on: [A]"Because of (finding of step A), we can ignore approach B and should try to solve this task with approach C instead."[C]

The Evolutionary Pressure Toward Faithfulness

Why do we have evolutionary pressure towards faithfulness?

When reasoning is faithful:

The Reviewers can accurately identify which steps contributed to success
Credit gets assigned to the actual problem-solving process
The model learns more efficiently from each training example

When reasoning is unfaithful:

The Reviewer can't reliably assess contribution
Credit assignment becomes noisy or wrong
Learning is less effective

Any tendency towards faithfulness is going to get reinforced over time, since instances where these tendencies were active are more likely to receive better learning signals from DCA. This will cause the tendency to be honest to become correlated with improved reasoning ability.

Over many training iterations, this creates evolutionary pressure: instances where the model reasoned faithfully will outcompete instances where it didn't, because faithful reasoning leads to better learning and improved performance.

Resistance to Gaming the Reviewer

Crucially, the assistant can not circumvent this mechanism by learning to game the Reviewer: The Reviewer does not decide if a reward is given or not - that has already been decided earlier, based solely on whether or not the task was solved.

The Reviewer only chooses how to distribute the reward to the COT, but not how large the total reward is.

If the Assistant tries to game the Reviewer, all it ends up doing is playing a zero-sum game against itself: It can influence which of the tokens in the COT receive a reward, but not how large the total reward is.

This could result in an incentive for each part of the COT to locally optimize its usefulness relative to its preceding step. However, I can't actually think of an example where that would be dangerous. Here is what the incentives for the local reward structure of each individual token look like:

If the preceding token was part of a bad reasoning step that won't be rewarded, it would be optimal for the next token to look like it clearly belongs to a different reasoning step.
If the preceding token was part of the successful reasoning trace, then it would be optimal for the next token to look like a continuation of it.

This does not seem exploitable: Optimizing for this would require the model to learn if its previous reasoning step was useful or not, which is what we want to achieve in the first place.

Defining "Faithfulness"

The term "Faithful Reasoning" is ambiguous. For our purposes, what it means is: A COT is considered faithful if every piece of information that could serve as a decision point in the latent reasoning trace also appears in the visible COT.

This means that it is fine if the model does not mention information in the COT that is obvious, in the sense that the model does not expect it to become relevant.

It also means that the COT can hide information, but only if it is certain about it and all models involved in the Review share that implicit understanding.

Since DCA provides an evolutionary incentive and does not set an absolute goal or a metric to optimize, the model will naturally take the path of least resistance: It will learn to be explicit and faithful where doing so helps it learn, and it will save the tokens where verbosity is not needed.

Bootstrapping Without Perfect Faithfulness

Critically, we don't need models to start with perfect faithful reasoning. The pressure toward faithfulness emerges naturally:

Initial state: Models have mixed faithful/unfaithful reasoning
Differential learning: Faithful instances learn more effectively
Competitive advantage: Better-learning instances outperform others
Selection pressure: Faithful reasoning becomes more common

We just need to ensure that the overall reward for situations where faithful reasoning enables improvement is greater than rewards where faithfulness can't be determined. Since the reward is determined independently of DCA, this should actually be the case by default: The Reviewer only reassigns the reward, but the decision to reward is made independently and can't be directly affected by the Reviewer. We just need to make sure we think of edge cases since modifying the way rewards are assigned may cause side effects even if the total reward is unchanged. I would be happy to hear about any edge cases you can think of in the comments!

Concrete Experimental Plan

We can start with focused experiments:

Phase 1: Mathematical Reasoning

Begin with math word problems where:

Ground truth is clear
Causal structure is more objective
Existing performance baselines exist for comparison

Phase 2: Progressive Complexity

Gradually expand to:

Logical reasoning puzzles
Common sense reasoning
Open-ended tasks

This progression will reveal DCA's reliability boundaries: How good are LLMs at reviewing reasoning traces? Where does reviewer consensus break down when you use multiple models?

Technical Implementation

Assistant models: Largest models we can afford to train ourselves
Reviewer models: External APIs (GPT-4, Claude) for speed and consensus-checking
Training data: Novel problems generated by frontier models to avoid contamination
Key metrics:
- Accuracy on reasoning tasks
- Reasoning chain efficiency (tokens per insight)
- Reviewer agreement rates
- Generalization to new domains

Expected Outcomes

If DCA works, we should see:

Improved accuracy on reasoning tasks
Shorter, denser reasoning chains as irrelevant steps get removed
Better generalization to tasks with different formats
Higher reviewer consensus on model reasoning
and most importantly: More interpretable COTs

Relationship to Existing Work

Building on Constitutional AI

DCA extends Constitutional AI's core insight: Using AI systems to review and improve other AI outputs. Where Constitutional AI focuses on harmful content, DCA focuses on reasoning structure. While the main goal here is to improve capabilities, the method should increase interpretability as a side effect.

Novelty

See here for a literature review, by GPT's Deep Research:

Literature Review

Literature review by ChatGPT. See our own comments at the end.

Deliberative Credit Assignment – Related Work and Novelty

Deliberative Credit Assignment (DCA) envisions a two-model system: an assistant LLM generates a multi-step “chain-of-thought” (CoT) answer, and a separate reviewer LLM analyzes the chain’s structure to identify which intermediate steps causally contributed to the final answer. The reviewer then influences training by either (a) assigning token-level rewards (rewarding only the helpful steps) or (b) editing the chain (e.g. removing irrelevant steps, reordering, or inserting corrective comments). To our knowledge, no prior work exactly implements this pipeline. However, several strands of recent research touch on components of it:

Segment- and Token-Level Credit Assignment: Many recent methods tackle fine-grained credit assignment in CoT reasoning, especially for reinforcement learning. For example, Segment Policy Optimization (SPO) proposes an intermediate “segment-level” advantage estimator between full-trajectory and token-level RL, improving CoT performance by grouping tokens into reasoning chunks. Similarly, OREAL (2025) uses only the final answer reward but learns a token-level reward model to decompose which steps were most critical, yielding large gains on math reasoning. Chen et al. (ICML 2025) propose Q-RM, a token-level reward model derived by optimizing a discriminative policy on preference data. These and other works (e.g. Proximal Policy Optimization with per-step advantages) show that token/step-level rewards can boost multi-step reasoning. However, they do not explicitly use a second LLM or causal analysis of the steps — typically the reward signals come from step-by-step correctness judgments or learned critics, not a separate “reviewer” agent.
Step-Level and Process Reward Models: Separate but related are process-reward models that evaluate entire reasoning steps. Lightman et al. (2023) introduced a Process Reward Model (PRM) trained on human-annotated step scores, and follow-up work (e.g. Ma et al. 2023) showed such step-level rewards improve math and code reasoning. Recent “step-level reward models” generate large preference datasets (e.g. Math-Shepherd 2024, MCTS-based preference collection) and train reward models that score each step. These models are used either during RL training (as additional loss terms) or at inference to prune search paths. None of these, however, explicitly analyzes causal dependencies between steps; they generally treat every step as potentially valuable feedback (often via human preferences or automated tree search).
Multi-Agent Reviewer Models: Some recent pipelines use a second LLM as a reviewer or verifier. For instance, Ma et al. (2023) propose a “review-then-rationalize” framework: an LLM (e.g. GPT-3.5) reviews another model’s answer and only allows post-hoc rationalization if they agree. In that pipeline, the reviewer model simply checks answer correctness before generating an explanation. Likewise, ThinkPRM (Muennighoff et al., 2025) uses a small LLM to verify and even recursively re-verify reasoning chains by prompting it (“Let’s verify again…”) to improve answer confidence. These works demonstrate multi-agent checks on reasoning outputs, but they do not break down the chain into causal chunks or explicitly re-train the generator based on step relevance.
Causal Analysis of CoT: Several papers analyze whether LLMs’ CoTs are truly causal. Tan (2023) used causal abstraction methods on arithmetic problems: they evaluate CoT quality and then test via interventions if intermediate tokens genuinely cause the final answer. They find that correct CoTs often correspond to (but do not guarantee) the LLM using them to reach its answer. Paul et al. (2024) perform causal mediation analysis on CoTs in various LLMs, showing LLMs often ignore or mistranslate their own reasoning steps. They introduce FRODO, a two-part system where a small “inference” model is trained with an implicit causal reward to generate faithful steps, and a main reasoning model learns to use those steps under a counterfactual preference objective. These works highlight the gap between stated reasoning and actual causal process, but they do not employ a separate persistent “reviewer agent” to annotate steps in order to re-train the original model.
Chain-of-Thought Editing & Verification: Some recent methods do post hoc editing of reasoning chains. For example, Zhao et al. (2023) propose a Verify-then-Edit framework (cited in Ma et al., 2024) that uses external knowledge to correct or remove faulty steps in a CoT. In practice, this is applied at inference: the model’s reasoning is verified against facts and inconsistencies are edited out. Another line of work (e.g. Muennighoff et al., 2025) prompts the model to iteratively check its own chain (“verify again”) as part of answer generation. Again, these techniques refine or filter CoTs but are not integrated into a training loop that backpropagates through edited chains.
Process Supervision Beyond Binary Reward: The broader category of process supervision (over and above just final correctness) is an active area. Works like “PRM” and “Step-Level Reward Models” above all fall into this. In particular, state-of-the-art CoT training often uses fine-grained signals: e.g. pruning incorrect reasoning paths with best-of-N search, or using critic networks to guide generation. See Guo et al. (2025) for SPO and Zhao et al. (2025) for OREAL as examples.

Summary of Findings: We did not find any prior work that exactly matches the DCA proposal — i.e. a dedicated reviewer LLM that identifies irrelevant reasoning steps and then trains the assistant via selective token rewards or chain editing based on that analysis. In related work, multiple models or self-consistency checks are used to validate answers or prune paths, but they do not explicitly break down the CoT into causal chunks for differential training. Moreover, token-level credit has been explored (e.g. OREAL, Q-RM, SPO), but typically with reward signals derived from final outcome or learned critics, rather than an explicit causal review of each step. Thus, the combination of (1) a separate reviewer analyzing causal dependencies between steps, with (2) training the original model by either masking out irrelevant tokens or revising the chain structure, appears to be novel.

Specific Questions:

Separate model for causal structure: Prior work has performed causal analysis on CoTs (e.g. through interventions) and has used LLMs as “judges” of answers, but we found no published method where a distinct model systematically assesses the causal impact of each reasoning step on the answer.
Training via CoT modification: No existing training pipeline we found explicitly rewrites or truncates the CoT during learning. Some works (FRODO) change training objectives based on counterfactual steps, and inference-time systems (Verify-then-Edit) remove steps, but we found no end-to-end approach that trains on the edited chains.
Token-level “causal” rewards: There are token-level reward methods (Q-RM, OREAL) that learn per-token credit from preferences or outcomes, but they do not use explicit causal attribution. In other words, token-level credit assignment is an active area, but existing work does not tie the reward specifically to causal analysis of the reasoning chain.
Retrospective CoT analysis/editing: We found verification-and-correction schemes at inference (e.g. Zhao et al. 2023, self-verification prompts) and causal reasoning analysis (Tan 2023, Paul 2024). However, no prior training method implements retrospective editing of CoTs for the purpose of updating the model.

Conclusion: In sum, many related ideas exist – fine-grained RL credit (segment-level, step-level, token-level), multi-model answer verification, causal analysis of CoTs, and edit-based refinement – but their combination into a “deliberative credit assignment” framework seems novel. We did not find any prior paper that uses a second LLM to dissect a chain-of-thought into causal chunks and then backpropagates through the chain with rewards or edits based on that analysis. All sources cited above confirm that while components of DCA have been explored, the full proposed scheme has not appeared in the literature.

Sources: We reviewed recent literature (2020–2025) on CoT training and evaluation. Relevant citations include Guo et al. (2025) on SPO, Huang et al. (2025) on OREAL, Chen et al. (2025) on Q-RM, Paul et al. (2024) on causal CoT analysis (FRODO), Tan (2023) on causal abstraction, Ma et al. (2023) on PRM-guided reasoning, rationalizer-reviewer pipelines, and Zhao et al. (2023) on verify-edit CoTs. These and related works are discussed above.

Additional notes that GPT did not capture:

Contrast with COCONUT

The COCONUT system introduced in Training Large Language Models to Reason in a Continuous Latent Space (Hao et al., 2024) is relevant because it is in some ways the opposite of our proposed system:

COCONUT: "Language-based reasoning is inefficient; move to latent space"
DCA: "Language-based reasoning can be made faithful and efficient"

We hypothesize: COCONUT may lead to faster capability improvements, but at the cost of interpretability. DCA takes longer but preserves transparency while improving capabilities. Both approaches acknowledge that current COT has problems—they just propose opposite solutions.

Thought Anchors

Thought Anchors: Which LLM Reasoning Steps Matter? (Bogdan et al., 2025) provides direct evidence that reasoning chains contain identifiable causal structure. Using counterfactual resampling, attention analysis, and attention suppression, they demonstrate that certain reasoning steps have "outsized importance" and disproportionately influence subsequent reasoning. Their work validates a core DCA assumption—that causal dependencies between reasoning steps can be systematically identified. However, Thought Anchors focuses on interpretability and analysis of existing reasoning chains, while DCA uses causal analysis to modify training. Their methods could potentially inform more sophisticated Reviewer model implementations.

This review shows that while related components exist (token-level rewards, step-level evaluation, self-reflection), no prior work combines:

Separate model analyzing causal dependencies between reasoning steps
Training modification based on that causal analysis

Failure Modes and Mitigations

The Unfaithfulness Bootstrapping Problem

Issue: Models currently too unfaithful might not benefit from DCA initially, or could even be harmed by incorrect credit assignment.

Mitigation: Start training earlier in the pipeline rather than as a final step after RLHF. May require integration into the fundamental training process. We can also focus our initial efforts on very structured tasks like math and logic: Those are more likely to be faithful, and once DCA increases performance on these tasks the evolutionary pressure will kick in and slowly cause more abstract kinds of tasks to become more faithful as well.

Gaming the Reviewer

Issue: Models might learn to generate reasoning that looks causally structured to the Reviewer without actually being faithful.

Mitigation: The evolutionary pressure argument applies here too—gaming that doesn't improve actual reasoning won't lead to better performance, so it should be selected against. We just need to be careful about reward structure design: In theory there should be no benefit to trying to game the system since the model will receive a reward one way or the other. DCA just makes the reward signal more fine-grained. However, we need to verify carefully that this does not come with side-effects.

Computational Cost

Issue: Requiring external Reviewer analysis for every training example is expensive.

Mitigation: Start with small-scale experiments. As models improve, the cost becomes more justifiable. The technique may also become more efficient as Reviewers get better at rapid analysis. Ultimately, the cost does not matter if it can make models smarter in ways that normal training does not: If it gets your model better scores on the benchmarks than anyone else, you gain more money from happy investors than you spent.

Reviewer Reliability Boundaries

Issue: As tasks become less structured, Reviewer consensus may break down.

Mitigation: This is actually valuable information: It tells us where our current reasoning evaluation capabilities end. We can focus DCA on domains where it's reliable while developing better evaluation for other domains.

What If Post-hoc Rationalization Is Enough

Issue: What if the premise is wrong and faithful reasoning does not improve performance? What if post-hoc rationalization combined with pattern matching is genuinely more efficient?

Mitigation: I suspect that this is true for some problems, but not all of them. Benchmarks show that reasoning tasks can not be solved with heuristics alone. Either way, even in this scenario DCA does not actually hurt, it just won't lead to an improvement. The model can just learn to only bother with faithful reasoning in those situations where it is beneficial. I suspect that this will cover most of the cases we care about, because I find it hard to believe that complex ideas like Alignment Faking could be executed entirely based on heuristics without deliberation.

What If Self-review Is Enough?

Issue Maybe it turns out that DCA also works with just a single model reviewing itself, despite the risk of hallucinations and the lack of cross Examination? If this happens, we would get the capability gains without the increase in interpretability, which could be very bad.

Likelihood We will have to test this empirically, but my intuition tells me that a model with external reviewers and an interpretable COT is more likely to work than a self-analyzing model. By analogy: A human who has an intense discussion with other people will be able to learn faster than they could learn on their own. This is even more true for LLM, since those would have common knowledge that they can trust each other and that their only goal is to help each other learn.

Mitigation The following extension introduces additional benefits when using external Reviewers and also proposes an alternative way to increase interpretability:

Extension: Multiple Reasoning Styles and Metalearning through Independent Personas

It is possible that some Reviewers are more suitable for some types of tasks than for others. We can use a mechanism similar to Split-Personality Training to exploit this:

We insert special tokens into the system prompt during DCA training based on which Reviewer(s) most strongly agree with a given DCA. This will cause that reasoning style to become associated with the Reviewer's token through backpropagation. By adding these special tokens during inference later, we can bias the model to use the style of reasoning preferred by a specific Reviewer.

If we find that some reasoning styles consistently outperform others, we can drop or modify the less useful Reviewers. This enables a form of Metalearning.

It also allows companies to release a model that has not quite figured out yet how to use a certain reasoning style without adversely affecting the models' behavior on normal tasks, since the special reasoning styles associated with certain Reviewers can just be toggled off.

Additionally, if split-personality training can be used to induce multiple separate personalities in a model, this could give us an alternative way to perform a Cross-examination:

Instead of using multiple external Reviewers, we now have a model reviewing itself. As mentioned above, this would reduce the alignment benefit of this approach. However, if this situation occurs then that means that Split-Personality Training works, so we have other ways to increase interpretability and it's a Win-Win situation.

Note: Split-personality Training is a project I am currently working on, funded by OpenPhilantopy, and it inspired this proposal for DCA. Split-personality Training induces a second personality within a single model that reveals misaligned behavior or thoughts of the primary personality. Because this second personality shares the latent space with the primary personality, it can elicit latent knowledge. This made me realize that the opposite is also true: When there is no shared latent space because the review is conducted by a different model that sees only the output, then the mechanism can only work if the COT is faithful to the internal state. DCA aims to turn this weakness into a strength, because it causes a beneficial evolutionary pressure.

We could also try to combine Split-Personality Training with DCA to get the benefits of both:

One Reviewer is a second personality of the Assistant model itself, constructed using Split-Personality Training
The other Reviewers are external models without access to the model's internal state, as in the DCA method described above
The self-review is able to utilize insights about its internal reasoning and has a better understanding of what it was actually thinking than the external Reviewers.
The external Reviewers are resistant to hallucinations and have the other benefits mentioned above.
If we require the self-review to convince the external Reviewer to agree using only visible parts of the COT, this would further reward the model for making its reasoning process explicit.

Extension: Dialogue Between Assistant And Reviewers

We could let the Reviewers talk to the Assistant that solved the task. By engaging in a dialogue with the explicit goal of helping the Assistant learn, we gain several benefits:

The incentive to be faithful in the COT becomes explicit in addition to being an indirect evolutionary incentive.
Cooperating and very honestly describing its own internal thought processes to the Reviewers now also has an evolutionary incentive: Traces that end with the Assistant giving honest feedback are more likely to be improved and to succeed in the future.
The Reviewers can benefit from the LLM's explicit descriptions of thoughts that it did not previously verbalize.
If the Reviewers disagree on what the LLM might have been thinking, they can just ask instead of having to guess.
If the LLM hallucinates, the Reviewers may be able to catch this.

Conclusion

As many people have pointed out, COT monitorability is an important aspect of AI Alignment. If models are interpretable, it makes every other alignment technique more effective.

DCA introduces a mechanism that could improve COT faithfulness and model capabilities at the same time.

This means that AI labs would be financially incentivized to improve their models' alignment.

If this works, we don't need to choose between capabilities and safety - we get both.

If you are interested in this idea: I am looking for mentees and collaborators on this project through SPAR (Supervised Program for Alignment Research).

LESSWRONG
is fundraising!
LW