Epistemic status: This idea evolved from my experiences at MATS under the mentorship of Evan Hubinger. It is a parallel research direction of Split-personality Training, which is now being funded by OpenPhilantropy. This is an updated version of a previous post, which incorporates new feedback and has been further developed through my work at AE Studio. This proposal is at the theoretical stage with planned experiments.
Call for Collaboration: I am looking for mentees and collaborators on this project through SPAR (Supervised Program for Alignment Research).
What if we could get frontier labs to focus more on alignment by making the safest approach also the most profitable one?
This post introduces Deliberative Credit Assignment (DCA), a training methodology that should improve a model's reasoning performance if-and-only-if the Chain of Thought (COT) is faithful.
This would create evolutionary pressure where models that reason faithfully outcompete those that don't, not because of an artificial training signal that targets faithfulness, but because faithfulness is a necessary byproduct of good performance.
This method addresses two different problems simultaneously:
It is an open problem how to train reasoning traces in COTs effectively. Most current methods use RL and rely on automatic differentiation to assign credit to individual tokens or sentences. It is not even clear at what level of granularity credit should be assigned: To tokens, words, or sentences? See our Literature Review section below for details.
To understand the method I want to introduce, you could imagine a human reasoning about their experiences:
We humans can look at our own thoughts from an outside perspective and actively think about how the individual thoughts fit together, which of them were useful and which should have been skipped.
LLMs do not yet have a method that replicates this ability, which is unique to humans and no other animals: All animals are able to learn using automated, hardcoded update rules in their brains, which are equivalent to fixed update rules like the ones we use in LLMs. But humans alone are able to use introspection to generate our own training data.
DCA is intended as a way to replicate this process in an LLM, as a novel training mechanism.
The recent paper Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety written by authors from basically everywhere highlights the importance of COT monitoring:
"AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability."
DCA addresses this issue directly and provides a useful evolutionary incentive to both the model itself and the company developing it.
DCA introduces a multi-model system that mirrors how humans learn complex skills through deliberate practice and external feedback.
Assistant Model: Generates reasoning chains (COT) to solve problems, just like current models.
Reviewer Model(s): Analyzes the Assistant's reasoning to identify causal structure: Which steps actually contributed to solving the problem, which were irrelevant, and how they connect to each other.
The Reviewer has access to the visible COT of the Assistant, but not to its latent state. This limitation is a feature, not a bug, because it forces the Assistant to create faithful COTs if it wants to get maximum benefit from DCA training.
Here's how it works:
The Reviewer is not constrained in how they break down the sentence into steps: It could be on the level of phrases, sentences, or paragraphs.
Likewise they are completely free in arguing why a step is useful or not. This means that smarter models can use more advanced reasoning on their own. Instead of relying on hardcoded rules, we take advantage of the Bitter Lesson and enable the quality of the credit assignment to scale automatically with the intelligence of the model.
We can explain the goal of the training mechanism to the Reviewer: Cause useful thoughts to be thought more often, and useless thoughts to be thought less often. We can then finetune its behavior by adjusting the prompt, rather than having to modify an algorithm, which allows for much faster iteration.
One big drawback of this approach is that if the DCA is wrong, we may end up with very counterproductive training signals that hurt the model's performance instead of improving it. This is kind of similar to a novice meditator or novice rationalist trying to optimize their brain before fully understanding what they are doing. It can hurt more than it helps.
Worse, LLMs are known to hallucinate and tend to be overconfident. Giving them unfettered access to create training data for themselves could be disastrous.
Because of this, the Reviewer must be separate from the Assistant. Self-reflection fails because:
External review avoids these pitfalls. We can use multiple frontier models (GPT-4, Claude, etc.) as Reviewers, training only when they reach consensus about causal structure.
In cases where they agree only on parts of the COT, the Reviewers may also choose to submit only those parts of the COT for training.
Once we have the DCA breakdown of a COT, we have several options to train the model on it. I find it difficult to predict which of these would work best, so I would like to empirically test them all. They both tackle the Credit Assignment Problem: Given that we already have a reward for the whole sequence, how do we distribute that reward to the individual tokens during backpropagation?
Approach 1: Modify RL rewards
Approach 2: Modify the COT
Why do we have evolutionary pressure towards faithfulness?
When reasoning is faithful:
When reasoning is unfaithful:
Any tendency towards faithfulness is going to get reinforced over time, since instances where these tendencies were active are more likely to receive better learning signals from DCA. This will cause the tendency to be honest to become correlated with improved reasoning ability.
Over many training iterations, this creates evolutionary pressure: instances where the model reasoned faithfully will outcompete instances where it didn't, because faithful reasoning leads to better learning and improved performance.
Crucially, the assistant can not circumvent this mechanism by learning to game the Reviewer: The Reviewer does not decide if a reward is given or not - that has already been decided earlier, based solely on whether or not the task was solved.
The Reviewer only chooses how to distribute the reward to the COT, but not how large the total reward is.
If the Assistant tries to game the Reviewer, all it ends up doing is playing a zero-sum game against itself: It can influence which of the tokens in the COT receive a reward, but not how large the total reward is.
This could result in an incentive for each part of the COT to locally optimize its usefulness relative to its preceding step. However, I can't actually think of an example where that would be dangerous. Here is what the incentives for the local reward structure of each individual token look like:
This does not seem exploitable: Optimizing for this would require the model to learn if its previous reasoning step was useful or not, which is what we want to achieve in the first place.
The term "Faithful Reasoning" is ambiguous. For our purposes, what it means is: A COT is considered faithful if every piece of information that could serve as a decision point in the latent reasoning trace also appears in the visible COT.
This means that it is fine if the model does not mention information in the COT that is obvious, in the sense that the model does not expect it to become relevant.
It also means that the COT can hide information, but only if it is certain about it and all models involved in the Review share that implicit understanding.
Since DCA provides an evolutionary incentive and does not set an absolute goal or a metric to optimize, the model will naturally take the path of least resistance: It will learn to be explicit and faithful where doing so helps it learn, and it will save the tokens where verbosity is not needed.
Critically, we don't need models to start with perfect faithful reasoning. The pressure toward faithfulness emerges naturally:
We just need to ensure that the overall reward for situations where faithful reasoning enables improvement is greater than rewards where faithfulness can't be determined. Since the reward is determined independently of DCA, this should actually be the case by default: The Reviewer only reassigns the reward, but the decision to reward is made independently and can't be directly affected by the Reviewer. We just need to make sure we think of edge cases since modifying the way rewards are assigned may cause side effects even if the total reward is unchanged. I would be happy to hear about any edge cases you can think of in the comments!
We can start with focused experiments:
Begin with math word problems where:
Gradually expand to:
This progression will reveal DCA's reliability boundaries: How good are LLMs at reviewing reasoning traces? Where does reviewer consensus break down when you use multiple models?
If DCA works, we should see:
DCA extends Constitutional AI's core insight: Using AI systems to review and improve other AI outputs. Where Constitutional AI focuses on harmful content, DCA focuses on reasoning structure. While the main goal here is to improve capabilities, the method should increase interpretability as a side effect.
See here for a literature review, by GPT's Deep Research:
Literature Review
Literature review by ChatGPT. See our own comments at the end.
Deliberative Credit Assignment (DCA) envisions a two-model system: an assistant LLM generates a multi-step “chain-of-thought” (CoT) answer, and a separate reviewer LLM analyzes the chain’s structure to identify which intermediate steps causally contributed to the final answer. The reviewer then influences training by either (a) assigning token-level rewards (rewarding only the helpful steps) or (b) editing the chain (e.g. removing irrelevant steps, reordering, or inserting corrective comments). To our knowledge, no prior work exactly implements this pipeline. However, several strands of recent research touch on components of it:
Summary of Findings: We did not find any prior work that exactly matches the DCA proposal — i.e. a dedicated reviewer LLM that identifies irrelevant reasoning steps and then trains the assistant via selective token rewards or chain editing based on that analysis. In related work, multiple models or self-consistency checks are used to validate answers or prune paths, but they do not explicitly break down the CoT into causal chunks for differential training. Moreover, token-level credit has been explored (e.g. OREAL, Q-RM, SPO), but typically with reward signals derived from final outcome or learned critics, rather than an explicit causal review of each step. Thus, the combination of (1) a separate reviewer analyzing causal dependencies between steps, with (2) training the original model by either masking out irrelevant tokens or revising the chain structure, appears to be novel.
Specific Questions:
Conclusion: In sum, many related ideas exist – fine-grained RL credit (segment-level, step-level, token-level), multi-model answer verification, causal analysis of CoTs, and edit-based refinement – but their combination into a “deliberative credit assignment” framework seems novel. We did not find any prior paper that uses a second LLM to dissect a chain-of-thought into causal chunks and then backpropagates through the chain with rewards or edits based on that analysis. All sources cited above confirm that while components of DCA have been explored, the full proposed scheme has not appeared in the literature.
Sources: We reviewed recent literature (2020–2025) on CoT training and evaluation. Relevant citations include Guo et al. (2025) on SPO, Huang et al. (2025) on OREAL, Chen et al. (2025) on Q-RM, Paul et al. (2024) on causal CoT analysis (FRODO), Tan (2023) on causal abstraction, Ma et al. (2023) on PRM-guided reasoning, rationalizer-reviewer pipelines, and Zhao et al. (2023) on verify-edit CoTs. These and related works are discussed above.
Additional notes that GPT did not capture:
The COCONUT system introduced in Training Large Language Models to Reason in a Continuous Latent Space (Hao et al., 2024) is relevant because it is in some ways the opposite of our proposed system:
We hypothesize: COCONUT may lead to faster capability improvements, but at the cost of interpretability. DCA takes longer but preserves transparency while improving capabilities. Both approaches acknowledge that current COT has problems—they just propose opposite solutions.
Thought Anchors: Which LLM Reasoning Steps Matter? (Bogdan et al., 2025) provides direct evidence that reasoning chains contain identifiable causal structure. Using counterfactual resampling, attention analysis, and attention suppression, they demonstrate that certain reasoning steps have "outsized importance" and disproportionately influence subsequent reasoning. Their work validates a core DCA assumption—that causal dependencies between reasoning steps can be systematically identified. However, Thought Anchors focuses on interpretability and analysis of existing reasoning chains, while DCA uses causal analysis to modify training. Their methods could potentially inform more sophisticated Reviewer model implementations.
This review shows that while related components exist (token-level rewards, step-level evaluation, self-reflection), no prior work combines:
Issue: Models currently too unfaithful might not benefit from DCA initially, or could even be harmed by incorrect credit assignment.
Mitigation: Start training earlier in the pipeline rather than as a final step after RLHF. May require integration into the fundamental training process. We can also focus our initial efforts on very structured tasks like math and logic: Those are more likely to be faithful, and once DCA increases performance on these tasks the evolutionary pressure will kick in and slowly cause more abstract kinds of tasks to become more faithful as well.
Issue: Models might learn to generate reasoning that looks causally structured to the Reviewer without actually being faithful.
Mitigation: The evolutionary pressure argument applies here too—gaming that doesn't improve actual reasoning won't lead to better performance, so it should be selected against. We just need to be careful about reward structure design: In theory there should be no benefit to trying to game the system since the model will receive a reward one way or the other. DCA just makes the reward signal more fine-grained. However, we need to verify carefully that this does not come with side-effects.
Issue: Requiring external Reviewer analysis for every training example is expensive.
Mitigation: Start with small-scale experiments. As models improve, the cost becomes more justifiable. The technique may also become more efficient as Reviewers get better at rapid analysis. Ultimately, the cost does not matter if it can make models smarter in ways that normal training does not: If it gets your model better scores on the benchmarks than anyone else, you gain more money from happy investors than you spent.
Issue: As tasks become less structured, Reviewer consensus may break down.
Mitigation: This is actually valuable information: It tells us where our current reasoning evaluation capabilities end. We can focus DCA on domains where it's reliable while developing better evaluation for other domains.
Issue: What if the premise is wrong and faithful reasoning does not improve performance? What if post-hoc rationalization combined with pattern matching is genuinely more efficient?
Mitigation: I suspect that this is true for some problems, but not all of them. Benchmarks show that reasoning tasks can not be solved with heuristics alone. Either way, even in this scenario DCA does not actually hurt, it just won't lead to an improvement. The model can just learn to only bother with faithful reasoning in those situations where it is beneficial. I suspect that this will cover most of the cases we care about, because I find it hard to believe that complex ideas like Alignment Faking could be executed entirely based on heuristics without deliberation.
Issue Maybe it turns out that DCA also works with just a single model reviewing itself, despite the risk of hallucinations and the lack of cross Examination? If this happens, we would get the capability gains without the increase in interpretability, which could be very bad.
Likelihood We will have to test this empirically, but my intuition tells me that a model with external reviewers and an interpretable COT is more likely to work than a self-analyzing model. By analogy: A human who has an intense discussion with other people will be able to learn faster than they could learn on their own. This is even more true for LLM, since those would have common knowledge that they can trust each other and that their only goal is to help each other learn.
Mitigation The following extension introduces additional benefits when using external Reviewers and also proposes an alternative way to increase interpretability:
It is possible that some Reviewers are more suitable for some types of tasks than for others. We can use a mechanism similar to Split-Personality Training to exploit this:
We insert special tokens into the system prompt during DCA training based on which Reviewer(s) most strongly agree with a given DCA. This will cause that reasoning style to become associated with the Reviewer's token through backpropagation. By adding these special tokens during inference later, we can bias the model to use the style of reasoning preferred by a specific Reviewer.
If we find that some reasoning styles consistently outperform others, we can drop or modify the less useful Reviewers. This enables a form of Metalearning.
It also allows companies to release a model that has not quite figured out yet how to use a certain reasoning style without adversely affecting the models' behavior on normal tasks, since the special reasoning styles associated with certain Reviewers can just be toggled off.
Additionally, if split-personality training can be used to induce multiple separate personalities in a model, this could give us an alternative way to perform a Cross-examination:
Instead of using multiple external Reviewers, we now have a model reviewing itself. As mentioned above, this would reduce the alignment benefit of this approach. However, if this situation occurs then that means that Split-Personality Training works, so we have other ways to increase interpretability and it's a Win-Win situation.
Note: Split-personality Training is a project I am currently working on, funded by OpenPhilantopy, and it inspired this proposal for DCA. Split-personality Training induces a second personality within a single model that reveals misaligned behavior or thoughts of the primary personality. Because this second personality shares the latent space with the primary personality, it can elicit latent knowledge. This made me realize that the opposite is also true: When there is no shared latent space because the review is conducted by a different model that sees only the output, then the mechanism can only work if the COT is faithful to the internal state. DCA aims to turn this weakness into a strength, because it causes a beneficial evolutionary pressure.
We could also try to combine Split-Personality Training with DCA to get the benefits of both:
We could let the Reviewers talk to the Assistant that solved the task. By engaging in a dialogue with the explicit goal of helping the Assistant learn, we gain several benefits:
As many people have pointed out, COT monitorability is an important aspect of AI Alignment. If models are interpretable, it makes every other alignment technique more effective.
DCA introduces a mechanism that could improve COT faithfulness and model capabilities at the same time.
This means that AI labs would be financially incentivized to improve their models' alignment.
If this works, we don't need to choose between capabilities and safety - we get both.
If you are interested in this idea: I am looking for mentees and collaborators on this project through SPAR (Supervised Program for Alignment Research).