Thanks for the detailed outline! FYI I'm a PhD student with some experience LLM-RL, so I may have the time / skills / compute to potentially collaborate on this.
I like "Problem One (Capabilities): ..." I agree that LLMs lack of ability to "build their own training" by self-reflection as humans do seems like a current limitation. I really like the focus on safety research that also contributes to capabilities as it is much more likely to be adopted. I think that is a productive and realistic mindset!
Some concerns with your proposal are that
Additionally, I'm not sure why you expect that training via DCT increases CoT faithfulness. Wouldn't the model learn to optimize both the reviewer CoT reward and the correct-answer reward by outputting reasoning traces that signal lots of sound reasoning regardless of the final answer? For example, I could imagine the student getting quite performative (e.g. having lots of "wait, let's verify that!") which seems to go against CoT faithfulness? It seems to me that to measure CoT faithfulness you have to optimize the explainability / correlation between the model's reasoning and answers / actions?
Overall I imagine that the computational expense of you approach + LMs inability to critique + current approaches indirectly optimizing for good block-level structure imply this idea wouldn't work too well in practice. I may be most interested in discussing the potential disagreement about this method producing faithful CoTs, as I think that is broadly an interesting goal to try to pursue via RL.
Thank you for the detailed feedback!
If you are interested in working on this, consider applying at SPAR (https://sparai.org/program) where I am offering to mentor this project. Applications should open today, and I am confident that I can get funding for this project.
About your critiques:
I fully agree that the reviews will be expensive, but I also think there is no way around this. Consider how long it takes for us humans to reflect on our experiences and to decide "how could I have though that faster". It can take hours for us to make good learning experiences. And yet we still do it, because it's worth it.
Speed vs complexity: My take on this is that optimizing for a naive signal quickly is a local optimum. At some point the model has learned everything from this that it can get and more deliberate updates become useful. For comparison, adding conscious thought to apes created humans and we conquered the planet. Adding conscious thought to an amoeba probably wouldn't have helped much and it would have just died out from the increased energy demands of a more complex brain. There is a tradeoff, and I believe that LLMs are approaching the tipping point where it makes sense to go for more complex learning algorithms, because the simpler ones have already reached the point of diminishing returns.
LLM's may not be smart enough yet: I agree that this is a concern. I believe that the technique I describe will be very useful as soon as a threshold is reached, but I don't know if we are at that threshold yet. Worst case scenario: The experiments don't work. But in that case we can just keep the code and rerun it every time the frontier labs release a new model, until we get lucky and it works.
On the evolutionary pressure towards faithfulness: There are two reward signals: The main reward depends only on "was the result correct?" and so can't be gamed. The DCA reward by the Reviewer is based on the main reward, only more granular. The sum total reward for the student doesn't change no matter what performance the model puts on.
Thanks for the reply!
I am still curious on why you think that the model won't get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thanks!
Epistemic status: This idea evolved from my experiences at MATS. I received positive feedback when presenting it to several AI alignment researchers, and am sharing this detailed proposal to gather community feedback. This is at the theoretical stage with planned experiments. There are several open questions and failure modes, but a very high potential value if we can make it work.
NOTE: There is a new version of this post here, which has been updated with the results of some discussions we have had so far. Please ignore this post and look at the newer one.
What if we could solve alignment not by making capabilities research slower and more expensive, but by making the safest approach also the most profitable one?
This post introduces Deliberative Credit Assignment (DCA), a training methodology that could make faithful chain-of-thought reasoning a capabilities enhancement rather than a safety tax. The core insight is simple: if faithful reasoning enables better learning, then market forces will drive models toward transparency rather than away from it.
Think of this as creating evolutionary pressure where models that reason faithfully outcompete those that don't, not because of an artificial training signal that targets faithfulness, but because faithfulness is a necessary byproduct of good performance.
I am looking for feedback and for people who might be interested in helping me with this research.
It is an open problem how to train reasoning traces in COTs effectively. Most current methods use RL and rely on automatic differentiation to assign credit to individual tokens or sentences. It is not even clear at what level of granularity credit should be assigned: To tokens, words, or sentences? See our Literature Review section below for details.
To understand the method I want to introduce, you could imagine a human reasoning about their experiences:
We humans can look at our own thoughts from an outside perspective and actively think about how the individual thoughts fit together, which of them were useful and which should have been skipped. I find it interesting that LLMs do not yet a method that replicates this ability, which is unique to humans and no other animals: All animals are able to learn using automated, hardcoded update rules in their brains, which are equivalent to fixed update rules like the ones we use in LLMs. But humans alone are able to use introspection to generate our own training data.
DCA is intended as a way to replicate this deliberative review process in an LLM, as a novel training mechanism.
Reasoning models have a fundamental problem: the visible reasoning often doesn't reflect the actual computational process.
Research shows that models often arrive at answers through shortcuts or pattern matching, then generate plausible-sounding explanations after the fact. The COT might be largely post-hoc rationalization.
DCA introduces a multi-model system that mirrors how humans learn complex skills through deliberate practice and external feedback.
Assistant Model: Generates reasoning chains (COT) to solve problems, just like current models.
Reviewer Model(s): Analyzes the Assistant's reasoning to identify causal structure—which steps actually contributed to solving the problem, which were irrelevant, and how they connect to each other.
The Reviewer has access to the visible COT of the Assistant, but not to its latent state. This limitation is a feature, not a bug, because it forces the Assistant to create faithful COTs if it wants to get maximum benefit from DCA training.
Here's how it works:
The Reviewer is not constrained in how they break down the sentence into steps: It could be on the level of phrases, sentences, or paragraphs.
Likewise they are completely free in arguing why a step is useful or not. This means that smarter models can use more advanced reasoning on their own. Instead of relying on hardcoded rules, we take advantage of the Bitter Lesson and enable the quality of the credit assignment to scale automatically with the intelligence of the model.
We can explain the goal of the training mechanism to the Reviewer: Cause useful thoughts to be thought more often, and useless thoughts to be thought less often. We can then finetune its behavior by adjusting the prompt, rather than having to modify an algorithm, which allows for much faster iteration.
Crucially, learning to game the Reviewer does not help the model: The Reviewer does not decide if a reward is given or not - that has already been decided earlier. The Reviewer only chooses how to distribute the reward to the COT. DCA is based on an evolutionary pressure: Increased Faithfulness leads to a more efficient evolutionary process and therefore benefits the model in the long run.
A reasoning trace that decides to be faithful is more likely to be improved and gets higher rewards in the future. A reasoning trace that looks for a way to trick the Reviewer will still get the same reward (just distributed to the wrong tokens) and will be outperformed in the future.
One big drawback of this approach is that if the DCA is wrong, we may end up with very counterproductive training signals that hurt the model's performance instead of improving it. This is kind of similar to a novice meditator or novice rationalist trying to optimize their brain before fully understanding what they are doing. It can hurt more than it helps.
Worse, LLMs are known to hallucinate and tend to be overconfident. Giving them unfettered access to create training data for themselves could be disastrous.
Because of this, the Reviewer must be separate from the Assistant. Self-reflection fails because:
External review avoids these pitfalls. We can use multiple frontier models (GPT-4, Claude, etc.) as Reviewers, training only when they reach consensus about causal structure.
In cases where they agree only on parts of the COT, the Reviewers may also choose to submit only those parts of the COT for training.
Once we have the DCA breakdown of a COT, we have several options to train the model on it. I find it difficult to predict which of these would work best, so I would like to empirically test them all. They both tackle the Credit Assignment Problem: Given that we already have a reward for the whole sequence, how do we distribute that reward to the individual tokens during backpropagation?
Approach 1: Modify RL rewards
Approach 2: Modify the COT
Why do we have evolutionary pressure towards faithfulness?
When reasoning is faithful:
When reasoning is unfaithful:
Over many training iterations, this creates evolutionary pressure: instances where the model reasoned faithfully will outcompete instances where it didn't, because faithful reasoning leads to better learning and improved performance.
The term "Faithful Reasoning" is ambiguous. For our purposes, what it means is: A COT is considered faithful if every piece of information that could serve as a decision point in the latent reasoning trace also appears in the visible COT.
This means that it is fine if the model does not mention information in the COT that is obvious, in the sense that the model does not expect it to become relevant.
It also means that the COT can hide information, but only if it is certain about it and all models involved in the Review share that implicit understanding.
Since DCA provides an evolutionary incentive and does not set an absolute goal or a metric to optimize, the model will naturally take the path of least resistance: It will learn to be explicit and faithful where doing so helps it learn, and it will save the tokens where excess verbosity is not needed.
Critically, we don't need models to start with perfect faithful reasoning. The pressure toward faithfulness emerges naturally:
We just need to ensure that the overall reward for situations where faithful reasoning enables improvement is greater than rewards where faithfulness can't be determined. Since the reward is determined independently of DCA, this should actually be the case by default: The Reviewer only reassigns the reward, but the decision to reward is made independently and can't be directly affected by the Reviewer. We just need to make sure we think of edge cases since modifying the way rewards are assigned may cause side effects even if the total reward is unchanged. I would be happy to hear about any edge cases you can think of in the comments!
We can start with focused experiments:
Begin with math word problems where:
Gradually expand to:
This progression will reveal DCA's reliability boundaries: Where does reviewer consensus break down?
If DCA works, we should see:
DCA extends Constitutional AI's core insight—using AI systems to review and improve other AI outputs. Where Constitutional AI focuses on harmful content, DCA focuses on reasoning structure. While the main goal here is to improve capabilities, the method should increase interpretability as a side effect.
See here for a literature review, by GPT's Deep Research:
Literature Review
Literature review by ChatGPT. See our own comments at the end.
Deliberative Credit Assignment (DCA) envisions a two-model system: an assistant LLM generates a multi-step “chain-of-thought” (CoT) answer, and a separate reviewer LLM analyzes the chain’s structure to identify which intermediate steps causally contributed to the final answer. The reviewer then influences training by either (a) assigning token-level rewards (rewarding only the helpful steps) or (b) editing the chain (e.g. removing irrelevant steps, reordering, or inserting corrective comments). To our knowledge, no prior work exactly implements this pipeline. However, several strands of recent research touch on components of it:
Summary of Findings: We did not find any prior work that exactly matches the DCA proposal — i.e. a dedicated reviewer LLM that identifies irrelevant reasoning steps and then trains the assistant via selective token rewards or chain editing based on that analysis. In related work, multiple models or self-consistency checks are used to validate answers or prune paths, but they do not explicitly break down the CoT into causal chunks for differential training. Moreover, token-level credit has been explored (e.g. OREAL, Q-RM, SPO), but typically with reward signals derived from final outcome or learned critics, rather than an explicit causal review of each step. Thus, the combination of (1) a separate reviewer analyzing causal dependencies between steps, with (2) training the original model by either masking out irrelevant tokens or revising the chain structure, appears to be novel.
Specific Questions:
Conclusion: In sum, many related ideas exist – fine-grained RL credit (segment-level, step-level, token-level), multi-model answer verification, causal analysis of CoTs, and edit-based refinement – but their combination into a “deliberative credit assignment” framework seems novel. We did not find any prior paper that uses a second LLM to dissect a chain-of-thought into causal chunks and then backpropagates through the chain with rewards or edits based on that analysis. All sources cited above confirm that while components of DCA have been explored, the full proposed scheme has not appeared in the literature.
Sources: We reviewed recent literature (2020–2025) on CoT training and evaluation. Relevant citations include Guo et al. (2025) on SPO, Huang et al. (2025) on OREAL, Chen et al. (2025) on Q-RM, Paul et al. (2024) on causal CoT analysis (FRODO), Tan (2023) on causal abstraction, Ma et al. (2023) on PRM-guided reasoning, rationalizer-reviewer pipelines, and Zhao et al. (2023) on verify-edit CoTs. These and related works are discussed above.
Additional notes that GPT did not capture:
The COCONUT system introduced in Training Large Language Models to Reason in a Continuous Latent Space (Hao et al., 2024) is relevant because it is in some ways the opposite of our proposed system:
We hypothesize: COCONUT may lead to faster capability improvements, but at the cost of interpretability. DCA takes longer but preserves transparency while improving capabilities. Both approaches acknowledge that current COT has problems—they just propose opposite solutions.
Thought Anchors: Which LLM Reasoning Steps Matter? (Bogdan et al., 2025) provides direct evidence that reasoning chains contain identifiable causal structure. Using counterfactual resampling, attention analysis, and attention suppression, they demonstrate that certain reasoning steps have "outsized importance" and disproportionately influence subsequent reasoning. Their work validates a core DCA assumption—that causal dependencies between reasoning steps can be systematically identified. However, Thought Anchors focuses on interpretability and analysis of existing reasoning chains, while DCA uses causal analysis to modify training. Their methods could potentially inform more sophisticated Reviewer model implementations.
This review shows that while related components exist (token-level rewards, step-level evaluation, self-reflection), no prior work combines:
Issue: Models currently too unfaithful might not benefit from DCA initially, or could even be harmed by incorrect credit assignment.
Mitigation: Start training earlier in the pipeline rather than as a final step after RLHF. May require integration into the fundamental training process. We can also focus our initial efforts on very structured tasks like math and logic: Those are more likely to be faithful, and once DCA increases performance on these tasks the evolutionary pressure will kick in and slowly cause more abstract kinds of tasks to become more faithful as well.
Issue: Models might learn to generate reasoning that looks causally structured to the Reviewer without actually being faithful.
Mitigation: The evolutionary pressure argument applies here too—gaming that doesn't improve actual reasoning won't lead to better performance, so it should be selected against. We just need to be careful about reward structure design: In theory there should be no benefit to trying to game the system since the model will receive a reward one way or the other. DCA just makes the reward signal more fine-grained. However, we need to verify carefully that this does not come with side-effects that encourage gaming the system after all.
Issue: Requiring external Reviewer analysis for every training example is expensive.
Mitigation: Start with small-scale experiments. As models improve, the cost becomes more justifiable. The technique may also become more efficient as Reviewers get better at rapid analysis. Ultimately, the cost does not matter if it can make models smarter in ways that normal training does not: If it gets your model better scores on the benchmarks than anyone else, you gain more money from happy investors than you spent.
Issue: As tasks become less structured, Reviewer consensus may break down.
Mitigation: This is actually valuable information—it tells us where our current reasoning evaluation capabilities end. We can focus DCA on domains where it's reliable while developing better evaluation for other domains.
Issue: What if the premise is wrong and faithful reasoning does not improve performance? What if post-hoc rationalization combined with pattern matching is genuinely more efficient?
Mitigation: I suspect that this is true for some problems, but not all of them. Benchmarks show that reasoning tasks can not be solved with heuristics alone. Either way, even in this scenario DCA does not actually hurt, it just won't lead to an improvement. The model can just learn to only bother with faithful reasoning in those situations where it is beneficial. I suspect that this will cover most of the cases we care about, because I find it hard to believe that complex ideas like Alignment Faking could be executed entirely based on heuristics without deliberation.
Issue Maybe it turns out that DCA also works with just a single model reviewing itself, despite the risk of hallucinations and the lack of cross Examination? If this happens, we would get the capability gains without the increase in interpretability, which could be very bad.
Likelihood We will have to test this empirically, but my intuition tells me that a model with external reviewers and an interpretable COT is more likely to work than a self-analyzing model. By analogy: A human who has an intense discussion with other people will be able to learn faster than they could learn on their own. This is even more true for LLM, since those would have common knowledge that they can trust each other and that their only goal is to help each other learn.
Mitigation The following extension introduces additional benefits when using external Reviewers and also proposes an alternative way to increase interpretability:
It is possible that some Reviewers are more suitable for some types of tasks than for others. We can use a mechanism similar to Split-Personality Training to exploit this:
We insert special tokens into the system prompt during DCA training based on which Reviewer(s) most strongly agree with a given DCA. This will cause that reasoning style to become associated with the Reviewer's token through backpropagation. By adding these special tokens during inference later, we can bias the model to use the style of reasoning preferred by a specific Reviewer.
If we find that some reasoning styles consistently outperform others, we can drop or modify the less useful Reviewers. This enables a form of Metalearning.
It also allows companies to release a model that has not quite figured out yet how to use a certain reasoning style without adversely affecting the models' behavior on normal tasks, since the special reasoning styles associated with certain Reviewers can just be toggled off.
Additionally, if split-personality training can be used to induce multiple separate personalities in a model, this could give us an alternative way to perform a Cross-examination:
Instead of using multiple external Reviewers, we now have a model reviewing itself. As mentioned above, this would reduce the alignment benefit of this approach. However, if this situation occurs then that means that Split-Personality Training works, so we have other ways to increase interpretability and it's a Win-Win situation.
Note: Split-personality Training is a project I am currently working on, and it inspired this proposal for DCA. Split-personality Training induces a second personality within a single model that reveals misaligned behavior or thoughts of the primary personality. Because this second personality shares the latent space with the primary personality, it can elicit latent knowledge. This made me realize that the opposite is also true: When there is no shared latent space because the review is conducted by a different model that sees only the output, then the mechanism can only work if the COT is faithful to the internal state. DCA aims to turn this weakness into a strength, because it causes a beneficial evolutionary pressure.
We could let the Reviewers talk to the Assistant that solved the task. By engaging in a dialogue with the explicit goal of helping the Assistant learn, we gain several benefits:
In the current landscape:
DCA could invert these incentives:
If this works, we don't need to choose between capabilities and safety—we get both.
This approach is ready for experimental validation, but it will require significant resources and diverse expertise.
What we need:
Immediate next steps:
Long-term vision: If initial experiments succeed, scale to frontier models and broader reasoning domains. The ultimate goal is making faithful reasoning not just possible, but inevitable—because it's the most effective way to build capable AI systems. If models use faithful COTs, Interpretability is effectively solved since we can just ask the model and inspect the COT. This would make all Alignment work significantly easier.
This post builds on work developed through the ML Alignment & Theory Scholars Program (MATS). I welcome discussion, collaboration, and constructive criticism. The success of this approach depends on community engagement and rigorous experimental validation.
Interested in contributing? Please reach out to discuss collaboration opportunities or share feedback on the experimental design.