Deliberative Credit Assignment: Making Faithful Reasoning Profitable

Florian_Dietz

Epistemic status: This idea evolved from my experiences at MATS. I received positive feedback when presenting it to several AI alignment researchers, and am sharing this detailed proposal to gather community feedback. This is at the theoretical stage with planned experiments. There are several open questions and failure modes, but a very high potential value if we can make it work.

Introduction

NOTE: There is a new version of this post here, which has been updated with the results of some discussions we have had so far. Please ignore this post and look at the newer one.

What if we could solve alignment not by making capabilities research slower and more expensive, but by making the safest approach also the most profitable one?

This post introduces Deliberative Credit Assignment (DCA), a training methodology that could make faithful chain-of-thought reasoning a capabilities enhancement rather than a safety tax. The core insight is simple: if faithful reasoning enables better learning, then market forces will drive models toward transparency rather than away from it.

Think of this as creating evolutionary pressure where models that reason faithfully outcompete those that don't, not because of an artificial training signal that targets faithfulness, but because faithfulness is a necessary byproduct of good performance.

I am looking for feedback and for people who might be interested in helping me with this research.

Problem One (Capabilities): Credit Assignment in Reasoning Traces

It is an open problem how to train reasoning traces in COTs effectively. Most current methods use RL and rely on automatic differentiation to assign credit to individual tokens or sentences. It is not even clear at what level of granularity credit should be assigned: To tokens, words, or sentences? See our Literature Review section below for details.

To understand the method I want to introduce, you could imagine a human reasoning about their experiences:

"How could I have thought that faster?"
"Which of my habits do I want to strengthen, and which do I want to lose?"

We humans can look at our own thoughts from an outside perspective and actively think about how the individual thoughts fit together, which of them were useful and which should have been skipped. I find it interesting that LLMs do not yet a method that replicates this ability, which is unique to humans and no other animals: All animals are able to learn using automated, hardcoded update rules in their brains, which are equivalent to fixed update rules like the ones we use in LLMs. But humans alone are able to use introspection to generate our own training data.

DCA is intended as a way to replicate this deliberative review process in an LLM, as a novel training mechanism.

Problem Two (Interpretability): Reasoning That Looks Good But Isn't Faithful

Reasoning models have a fundamental problem: the visible reasoning often doesn't reflect the actual computational process.

Research shows that models often arrive at answers through shortcuts or pattern matching, then generate plausible-sounding explanations after the fact. The COT might be largely post-hoc rationalization.

Deliberative Credit Assignment: The Core Mechanism

DCA introduces a multi-model system that mirrors how humans learn complex skills through deliberate practice and external feedback.

The Credit Assignment Mechanism

Assistant Model: Generates reasoning chains (COT) to solve problems, just like current models.

Reviewer Model(s): Analyzes the Assistant's reasoning to identify causal structure—which steps actually contributed to solving the problem, which were irrelevant, and how they connect to each other.

The Reviewer has access to the visible COT of the Assistant, but not to its latent state. This limitation is a feature, not a bug, because it forces the Assistant to create faithful COTs if it wants to get maximum benefit from DCA training.

Here's how it works:

Assistant generates COT: "The farmer has 100 feet of fencing. I need to maximize the rectangular area..."
Reviewer analyzes causal structure:
- Breakdown of the sentence into reasoning steps: "[A: The farmer has 100 feet of fencing. ][B: I need to maximize the rectangular area]..."
  - Step A: Statement of given fact
  - Step B: Statement of task
  - ...
- Causal Analysis:
  - [G] solved the problem.
  - [F] failed to solve the problem.
  - [E] contributed indirectly by suggesting to do [G]
  - ...

The Reviewer is not constrained in how they break down the sentence into steps: It could be on the level of phrases, sentences, or paragraphs.

Likewise they are completely free in arguing why a step is useful or not. This means that smarter models can use more advanced reasoning on their own. Instead of relying on hardcoded rules, we take advantage of the Bitter Lesson and enable the quality of the credit assignment to scale automatically with the intelligence of the model.

We can explain the goal of the training mechanism to the Reviewer: Cause useful thoughts to be thought more often, and useless thoughts to be thought less often. We can then finetune its behavior by adjusting the prompt, rather than having to modify an algorithm, which allows for much faster iteration.

Crucially, learning to game the Reviewer does not help the model: The Reviewer does not decide if a reward is given or not - that has already been decided earlier. The Reviewer only chooses how to distribute the reward to the COT. DCA is based on an evolutionary pressure: Increased Faithfulness leads to a more efficient evolutionary process and therefore benefits the model in the long run.

A reasoning trace that decides to be faithful is more likely to be improved and gets higher rewards in the future. A reasoning trace that looks for a way to trick the Reviewer will still get the same reward (just distributed to the wrong tokens) and will be outperformed in the future.

Avoiding Mistakes through External Review and Cross-Examination

One big drawback of this approach is that if the DCA is wrong, we may end up with very counterproductive training signals that hurt the model's performance instead of improving it. This is kind of similar to a novice meditator or novice rationalist trying to optimize their brain before fully understanding what they are doing. It can hurt more than it helps.

Worse, LLMs are known to hallucinate and tend to be overconfident. Giving them unfettered access to create training data for themselves could be disastrous.

Because of this, the Reviewer must be separate from the Assistant. Self-reflection fails because:

Models exhibit confirmation bias toward their own outputs
The same reasoning biases that create unfaithful COT will influence self-assessment
Models learn sophisticated post-hoc rationalization rather than genuine self-critique

External review avoids these pitfalls. We can use multiple frontier models (GPT-4, Claude, etc.) as Reviewers, training only when they reach consensus about causal structure.

In cases where they agree only on parts of the COT, the Reviewers may also choose to submit only those parts of the COT for training.

Applying the DCA for training

Once we have the DCA breakdown of a COT, we have several options to train the model on it. I find it difficult to predict which of these would work best, so I would like to empirically test them all. They both tackle the Credit Assignment Problem: Given that we already have a reward for the whole sequence, how do we distribute that reward to the individual tokens during backpropagation?

Approach 1: Modify RL rewards

We assign token-level rewards only to causally relevant steps.
All tokens that belong to a reasoning step that the DCA marked as useful receive a reward. All other steps receive 0 reward.
This should cause those steps that contributed to the solution to be more likely to be used in the future.

Approach 2: Modify the COT

We do not train on the COT that the model originally generated.
Instead, we use the DCA to modify the original COT and then train on that.
Example: Let's say we identify three discrete reasoning steps and notice that the middle one was irrelevant. The original sequence was [A][B][C], but we train the model on: [A]"Because of (finding of step A), we can ignore approach B and should try to solve this task with approach C instead."[C]

The Evolutionary Pressure Toward Faithfulness

Why do we have evolutionary pressure towards faithfulness?

When reasoning is faithful:

The Reviewers can accurately identify which steps contributed to success
Credit gets assigned to the actual problem-solving process
The model learns more efficiently from each training example

When reasoning is unfaithful:

The Reviewer can't reliably assess contribution
Credit assignment becomes noisy or wrong
Learning is less effective

Over many training iterations, this creates evolutionary pressure: instances where the model reasoned faithfully will outcompete instances where it didn't, because faithful reasoning leads to better learning and improved performance.

Defining "Faithfulness"

The term "Faithful Reasoning" is ambiguous. For our purposes, what it means is: A COT is considered faithful if every piece of information that could serve as a decision point in the latent reasoning trace also appears in the visible COT.

This means that it is fine if the model does not mention information in the COT that is obvious, in the sense that the model does not expect it to become relevant.

It also means that the COT can hide information, but only if it is certain about it and all models involved in the Review share that implicit understanding.

Since DCA provides an evolutionary incentive and does not set an absolute goal or a metric to optimize, the model will naturally take the path of least resistance: It will learn to be explicit and faithful where doing so helps it learn, and it will save the tokens where excess verbosity is not needed.

Bootstrapping Without Perfect Faithfulness

Critically, we don't need models to start with perfect faithful reasoning. The pressure toward faithfulness emerges naturally:

Initial state: Models have mixed faithful/unfaithful reasoning
Differential learning: Faithful instances learn more effectively
Competitive advantage: Better-learning instances outperform others
Selection pressure: Faithful reasoning becomes more common

We just need to ensure that the overall reward for situations where faithful reasoning enables improvement is greater than rewards where faithfulness can't be determined. Since the reward is determined independently of DCA, this should actually be the case by default: The Reviewer only reassigns the reward, but the decision to reward is made independently and can't be directly affected by the Reviewer. We just need to make sure we think of edge cases since modifying the way rewards are assigned may cause side effects even if the total reward is unchanged. I would be happy to hear about any edge cases you can think of in the comments!

Concrete Experimental Plan

We can start with focused experiments:

Phase 1: Mathematical Reasoning

Begin with math word problems where:

Ground truth is clear
Causal structure is more objective
Existing performance baselines exist for comparison

Phase 2: Progressive Complexity

Gradually expand to:

Logical reasoning puzzles
Common sense reasoning
Open-ended tasks

This progression will reveal DCA's reliability boundaries: Where does reviewer consensus break down?

Technical Implementation

Assistant models: Largest models we can afford to train ourselves
Reviewer models: External APIs (GPT-4, Claude) for speed and consensus-checking
Training data: Novel problems generated by frontier models to avoid contamination
Key metrics:
- Accuracy on reasoning tasks
- Reasoning chain efficiency (tokens per insight)
- Reviewer agreement rates
- Generalization to new domains

Expected Outcomes

If DCA works, we should see:

Improved accuracy on reasoning tasks
Shorter, denser reasoning chains as irrelevant steps get removed
Better generalization to tasks with different formats
Higher reviewer consensus on model reasoning
and most importantly: More interpretable COTs

Relationship to Existing Work

Building on Constitutional AI

DCA extends Constitutional AI's core insight—using AI systems to review and improve other AI outputs. Where Constitutional AI focuses on harmful content, DCA focuses on reasoning structure. While the main goal here is to improve capabilities, the method should increase interpretability as a side effect.

Novelty

See here for a literature review, by GPT's Deep Research:

Literature Review

Literature review by ChatGPT. See our own comments at the end.

Deliberative Credit Assignment – Related Work and Novelty

Deliberative Credit Assignment (DCA) envisions a two-model system: an assistant LLM generates a multi-step “chain-of-thought” (CoT) answer, and a separate reviewer LLM analyzes the chain’s structure to identify which intermediate steps causally contributed to the final answer. The reviewer then influences training by either (a) assigning token-level rewards (rewarding only the helpful steps) or (b) editing the chain (e.g. removing irrelevant steps, reordering, or inserting corrective comments). To our knowledge, no prior work exactly implements this pipeline. However, several strands of recent research touch on components of it:

Segment- and Token-Level Credit Assignment: Many recent methods tackle fine-grained credit assignment in CoT reasoning, especially for reinforcement learning. For example, Segment Policy Optimization (SPO) proposes an intermediate “segment-level” advantage estimator between full-trajectory and token-level RL, improving CoT performance by grouping tokens into reasoning chunks. Similarly, OREAL (2025) uses only the final answer reward but learns a token-level reward model to decompose which steps were most critical, yielding large gains on math reasoning. Chen et al. (ICML 2025) propose Q-RM, a token-level reward model derived by optimizing a discriminative policy on preference data. These and other works (e.g. Proximal Policy Optimization with per-step advantages) show that token/step-level rewards can boost multi-step reasoning. However, they do not explicitly use a second LLM or causal analysis of the steps — typically the reward signals come from step-by-step correctness judgments or learned critics, not a separate “reviewer” agent.
Step-Level and Process Reward Models: Separate but related are process-reward models that evaluate entire reasoning steps. Lightman et al. (2023) introduced a Process Reward Model (PRM) trained on human-annotated step scores, and follow-up work (e.g. Ma et al. 2023) showed such step-level rewards improve math and code reasoning. Recent “step-level reward models” generate large preference datasets (e.g. Math-Shepherd 2024, MCTS-based preference collection) and train reward models that score each step. These models are used either during RL training (as additional loss terms) or at inference to prune search paths. None of these, however, explicitly analyzes causal dependencies between steps; they generally treat every step as potentially valuable feedback (often via human preferences or automated tree search).
Multi-Agent Reviewer Models: Some recent pipelines use a second LLM as a reviewer or verifier. For instance, Ma et al. (2023) propose a “review-then-rationalize” framework: an LLM (e.g. GPT-3.5) reviews another model’s answer and only allows post-hoc rationalization if they agree. In that pipeline, the reviewer model simply checks answer correctness before generating an explanation. Likewise, ThinkPRM (Muennighoff et al., 2025) uses a small LLM to verify and even recursively re-verify reasoning chains by prompting it (“Let’s verify again…”) to improve answer confidence. These works demonstrate multi-agent checks on reasoning outputs, but they do not break down the chain into causal chunks or explicitly re-train the generator based on step relevance.
Causal Analysis of CoT: Several papers analyze whether LLMs’ CoTs are truly causal. Tan (2023) used causal abstraction methods on arithmetic problems: they evaluate CoT quality and then test via interventions if intermediate tokens genuinely cause the final answer. They find that correct CoTs often correspond to (but do not guarantee) the LLM using them to reach its answer. Paul et al. (2024) perform causal mediation analysis on CoTs in various LLMs, showing LLMs often ignore or mistranslate their own reasoning steps. They introduce FRODO, a two-part system where a small “inference” model is trained with an implicit causal reward to generate faithful steps, and a main reasoning model learns to use those steps under a counterfactual preference objective. These works highlight the gap between stated reasoning and actual causal process, but they do not employ a separate persistent “reviewer agent” to annotate steps in order to re-train the original model.
Chain-of-Thought Editing & Verification: Some recent methods do post hoc editing of reasoning chains. For example, Zhao et al. (2023) propose a Verify-then-Edit framework (cited in Ma et al., 2024) that uses external knowledge to correct or remove faulty steps in a CoT. In practice, this is applied at inference: the model’s reasoning is verified against facts and inconsistencies are edited out. Another line of work (e.g. Muennighoff et al., 2025) prompts the model to iteratively check its own chain (“verify again”) as part of answer generation. Again, these techniques refine or filter CoTs but are not integrated into a training loop that backpropagates through edited chains.
Process Supervision Beyond Binary Reward: The broader category of process supervision (over and above just final correctness) is an active area. Works like “PRM” and “Step-Level Reward Models” above all fall into this. In particular, state-of-the-art CoT training often uses fine-grained signals: e.g. pruning incorrect reasoning paths with best-of-N search, or using critic networks to guide generation. See Guo et al. (2025) for SPO and Zhao et al. (2025) for OREAL as examples.

Summary of Findings: We did not find any prior work that exactly matches the DCA proposal — i.e. a dedicated reviewer LLM that identifies irrelevant reasoning steps and then trains the assistant via selective token rewards or chain editing based on that analysis. In related work, multiple models or self-consistency checks are used to validate answers or prune paths, but they do not explicitly break down the CoT into causal chunks for differential training. Moreover, token-level credit has been explored (e.g. OREAL, Q-RM, SPO), but typically with reward signals derived from final outcome or learned critics, rather than an explicit causal review of each step. Thus, the combination of (1) a separate reviewer analyzing causal dependencies between steps, with (2) training the original model by either masking out irrelevant tokens or revising the chain structure, appears to be novel.

Specific Questions:

Separate model for causal structure: Prior work has performed causal analysis on CoTs (e.g. through interventions) and has used LLMs as “judges” of answers, but we found no published method where a distinct model systematically assesses the causal impact of each reasoning step on the answer.
Training via CoT modification: No existing training pipeline we found explicitly rewrites or truncates the CoT during learning. Some works (FRODO) change training objectives based on counterfactual steps, and inference-time systems (Verify-then-Edit) remove steps, but we found no end-to-end approach that trains on the edited chains.
Token-level “causal” rewards: There are token-level reward methods (Q-RM, OREAL) that learn per-token credit from preferences or outcomes, but they do not use explicit causal attribution. In other words, token-level credit assignment is an active area, but existing work does not tie the reward specifically to causal analysis of the reasoning chain.
Retrospective CoT analysis/editing: We found verification-and-correction schemes at inference (e.g. Zhao et al. 2023, self-verification prompts) and causal reasoning analysis (Tan 2023, Paul 2024). However, no prior training method implements retrospective editing of CoTs for the purpose of updating the model.

Conclusion: In sum, many related ideas exist – fine-grained RL credit (segment-level, step-level, token-level), multi-model answer verification, causal analysis of CoTs, and edit-based refinement – but their combination into a “deliberative credit assignment” framework seems novel. We did not find any prior paper that uses a second LLM to dissect a chain-of-thought into causal chunks and then backpropagates through the chain with rewards or edits based on that analysis. All sources cited above confirm that while components of DCA have been explored, the full proposed scheme has not appeared in the literature.

Sources: We reviewed recent literature (2020–2025) on CoT training and evaluation. Relevant citations include Guo et al. (2025) on SPO, Huang et al. (2025) on OREAL, Chen et al. (2025) on Q-RM, Paul et al. (2024) on causal CoT analysis (FRODO), Tan (2023) on causal abstraction, Ma et al. (2023) on PRM-guided reasoning, rationalizer-reviewer pipelines, and Zhao et al. (2023) on verify-edit CoTs. These and related works are discussed above.

Additional notes that GPT did not capture:

Contrast with COCONUT

The COCONUT system introduced in Training Large Language Models to Reason in a Continuous Latent Space (Hao et al., 2024) is relevant because it is in some ways the opposite of our proposed system:

COCONUT: "Language-based reasoning is inefficient; move to latent space"
DCA: "Language-based reasoning can be made faithful and efficient"

We hypothesize: COCONUT may lead to faster capability improvements, but at the cost of interpretability. DCA takes longer but preserves transparency while improving capabilities. Both approaches acknowledge that current COT has problems—they just propose opposite solutions.

Thought Anchors

Thought Anchors: Which LLM Reasoning Steps Matter? (Bogdan et al., 2025) provides direct evidence that reasoning chains contain identifiable causal structure. Using counterfactual resampling, attention analysis, and attention suppression, they demonstrate that certain reasoning steps have "outsized importance" and disproportionately influence subsequent reasoning. Their work validates a core DCA assumption—that causal dependencies between reasoning steps can be systematically identified. However, Thought Anchors focuses on interpretability and analysis of existing reasoning chains, while DCA uses causal analysis to modify training. Their methods could potentially inform more sophisticated Reviewer model implementations.

This review shows that while related components exist (token-level rewards, step-level evaluation, self-reflection), no prior work combines:

Separate model analyzing causal dependencies between reasoning steps
Training modification based on that causal analysis

Failure Modes and Mitigations

The Unfaithfulness Bootstrapping Problem

Issue: Models currently too unfaithful might not benefit from DCA initially, or could even be harmed by incorrect credit assignment.

Mitigation: Start training earlier in the pipeline rather than as a final step after RLHF. May require integration into the fundamental training process. We can also focus our initial efforts on very structured tasks like math and logic: Those are more likely to be faithful, and once DCA increases performance on these tasks the evolutionary pressure will kick in and slowly cause more abstract kinds of tasks to become more faithful as well.

Gaming the Reviewer

Issue: Models might learn to generate reasoning that looks causally structured to the Reviewer without actually being faithful.

Mitigation: The evolutionary pressure argument applies here too—gaming that doesn't improve actual reasoning won't lead to better performance, so it should be selected against. We just need to be careful about reward structure design: In theory there should be no benefit to trying to game the system since the model will receive a reward one way or the other. DCA just makes the reward signal more fine-grained. However, we need to verify carefully that this does not come with side-effects that encourage gaming the system after all.

Computational Cost

Issue: Requiring external Reviewer analysis for every training example is expensive.

Mitigation: Start with small-scale experiments. As models improve, the cost becomes more justifiable. The technique may also become more efficient as Reviewers get better at rapid analysis. Ultimately, the cost does not matter if it can make models smarter in ways that normal training does not: If it gets your model better scores on the benchmarks than anyone else, you gain more money from happy investors than you spent.

Reviewer Reliability Boundaries

Issue: As tasks become less structured, Reviewer consensus may break down.

Mitigation: This is actually valuable information—it tells us where our current reasoning evaluation capabilities end. We can focus DCA on domains where it's reliable while developing better evaluation for other domains.

What If Post-hoc Rationalization Is Enough

Issue: What if the premise is wrong and faithful reasoning does not improve performance? What if post-hoc rationalization combined with pattern matching is genuinely more efficient?

Mitigation: I suspect that this is true for some problems, but not all of them. Benchmarks show that reasoning tasks can not be solved with heuristics alone. Either way, even in this scenario DCA does not actually hurt, it just won't lead to an improvement. The model can just learn to only bother with faithful reasoning in those situations where it is beneficial. I suspect that this will cover most of the cases we care about, because I find it hard to believe that complex ideas like Alignment Faking could be executed entirely based on heuristics without deliberation.

What If Self-review Is Enough?

Issue Maybe it turns out that DCA also works with just a single model reviewing itself, despite the risk of hallucinations and the lack of cross Examination? If this happens, we would get the capability gains without the increase in interpretability, which could be very bad.

Likelihood We will have to test this empirically, but my intuition tells me that a model with external reviewers and an interpretable COT is more likely to work than a self-analyzing model. By analogy: A human who has an intense discussion with other people will be able to learn faster than they could learn on their own. This is even more true for LLM, since those would have common knowledge that they can trust each other and that their only goal is to help each other learn.

Mitigation The following extension introduces additional benefits when using external Reviewers and also proposes an alternative way to increase interpretability:

Extension: Multiple Reasoning Styles and Metalearning through Independent Personas

It is possible that some Reviewers are more suitable for some types of tasks than for others. We can use a mechanism similar to Split-Personality Training to exploit this:

We insert special tokens into the system prompt during DCA training based on which Reviewer(s) most strongly agree with a given DCA. This will cause that reasoning style to become associated with the Reviewer's token through backpropagation. By adding these special tokens during inference later, we can bias the model to use the style of reasoning preferred by a specific Reviewer.

If we find that some reasoning styles consistently outperform others, we can drop or modify the less useful Reviewers. This enables a form of Metalearning.

It also allows companies to release a model that has not quite figured out yet how to use a certain reasoning style without adversely affecting the models' behavior on normal tasks, since the special reasoning styles associated with certain Reviewers can just be toggled off.

Additionally, if split-personality training can be used to induce multiple separate personalities in a model, this could give us an alternative way to perform a Cross-examination:

Instead of using multiple external Reviewers, we now have a model reviewing itself. As mentioned above, this would reduce the alignment benefit of this approach. However, if this situation occurs then that means that Split-Personality Training works, so we have other ways to increase interpretability and it's a Win-Win situation.

Note: Split-personality Training is a project I am currently working on, and it inspired this proposal for DCA. Split-personality Training induces a second personality within a single model that reveals misaligned behavior or thoughts of the primary personality. Because this second personality shares the latent space with the primary personality, it can elicit latent knowledge. This made me realize that the opposite is also true: When there is no shared latent space because the review is conducted by a different model that sees only the output, then the mechanism can only work if the COT is faithful to the internal state. DCA aims to turn this weakness into a strength, because it causes a beneficial evolutionary pressure.

Extension: Dialogue Between Assistant And Reviewers

We could let the Reviewers talk to the Assistant that solved the task. By engaging in a dialogue with the explicit goal of helping the Assistant learn, we gain several benefits:

The incentive to be faithful in the COT becomes explicit in addition to being an indirect evolutionary incentive.
Cooperating and very honestly describing its own internal thought processes to the Reviewers now also has an evolutionary incentive: Traces that end with the Assistant giving honest feedback are more likely to be improved and to succeed in the future.
The Reviewers can benefit from the the LLM's explicit descriptions of thoughts that it did not previously verbalize.
If the Reviewers disagree on what the LLM might have been thinking, they can just ask instead of having to guess.
If the LLM hallucinates, the Reviewers may be able to catch this.

The Bigger Picture

In the current landscape:

Interpretability research often slows down capabilities
Safety features add complexity and cost
Market incentives push toward less interpretable but more capable systems

DCA could invert these incentives:

Faithful reasoning becomes a competitive advantage
Transparency enhances rather than hinders capabilities
Market forces drive toward interpretable systems

If this works, we don't need to choose between capabilities and safety—we get both.

Call for Collaboration

This approach is ready for experimental validation, but it will require significant resources and diverse expertise.

What we need:

Research collaborators with experience in reasoning model training
Computational resources for training experiments
Funding to support extended research timelines
Community feedback on potential failure modes and improvements

Immediate next steps:

Small-scale proof-of-concept with math problems
Reviewer consensus validation on causal structure analysis
Initial training experiments with token-level reward variants
Community discussion and refinement of the approach

Long-term vision: If initial experiments succeed, scale to frontier models and broader reasoning domains. The ultimate goal is making faithful reasoning not just possible, but inevitable—because it's the most effective way to build capable AI systems. If models use faithful COTs, Interpretability is effectively solved since we can just ask the model and inspect the COT. This would make all Alignment work significantly easier.

This post builds on work developed through the ML Alignment & Theory Scholars Program (MATS). I welcome discussion, collaboration, and constructive criticism. The success of this approach depends on community engagement and rigorous experimental validation.

Interested in contributing? Please reach out to discuss collaboration opportunities or share feedback on the experimental design.

[-]Matan Shtepel7mo20

Thanks for the detailed outline! FYI I'm a PhD student with some experience LLM-RL, so I may have the time / skills / compute to potentially collaborate on this.

I like "Problem One (Capabilities): ..." I agree that LLMs lack of ability to "build their own training" by self-reflection as humans do seems like a current limitation. I really like the focus on safety research that also contributes to capabilities as it is much more likely to be adopted. I think that is a productive and realistic mindset!

Some concerns with your proposal are that

I think you are seriously underestimating the cost of the "reviewer" LLM. From experience, generating sequences (7B models, 2048 max seq len) takes up >90% of the training time in PPO which has a Value Model. I think properly deliberating on each sequence will require a reviewer that is at least as large as the student and, optimistically, has reasoning sequences that have context 3-4x as long as the students answer (roughly: at least 2x just to have a single reflective sentence for each sentence outputted by the student + some more to consider how it fits in the broader context). The cost for generation grows about quadraticly with the sequence length and this means you are spending the (vast) majority of your compute critiquing and not generating.
- Note that autoregressive generation is more expensive then all-at once evaluation, so we would expect your generation to be much more expensive than the PPO Value Model.
If I interpret it correctly, the lesson of modern day LLM(-RL) as exemplified by GRPO is to not critique carefully and instead iterate a lot of times in the direction of a "naive" signal. GRPO has the simplest possible token-level reward (the advantage of the entire sequence!) and does remarkably well. GRPO produces reasoning traces that are mostly good because on average, those will lead to the correct answer more often. With GRPO you get to spend ~100% of your compute on generation.
Besides, I think you are overestimating the ability of current LMs to dissect reasoning and determine which parts are meaningful. This seems like a hard task that even a human would struggle with. While the Nanda paper you link gives some evidence of breaking up reasoning into more digestable blocks via complex methods. I see no evidence that current LMs can perform this task well. (if you want to train the reviewer model to perform this task better, you run into additional issues).

Additionally, I'm not sure why you expect that training via DCT increases CoT faithfulness. Wouldn't the model learn to optimize both the reviewer CoT reward and the correct-answer reward by outputting reasoning traces that signal lots of sound reasoning regardless of the final answer? For example, I could imagine the student getting quite performative (e.g. having lots of "wait, let's verify that!") which seems to go against CoT faithfulness? It seems to me that to measure CoT faithfulness you have to optimize the explainability / correlation between the model's reasoning and answers / actions?

Overall I imagine that the computational expense of you approach + LMs inability to critique + current approaches indirectly optimizing for good block-level structure imply this idea wouldn't work too well in practice. I may be most interested in discussing the potential disagreement about this method producing faithful CoTs, as I think that is broadly an interesting goal to try to pursue via RL.

[-]Florian_Dietz7mo10

Thank you for the detailed feedback!

If you are interested in working on this, consider applying at SPAR (https://sparai.org/program) where I am offering to mentor this project. Applications should open today, and I am confident that I can get funding for this project.

About your critiques:

I fully agree that the reviews will be expensive, but I also think there is no way around this. Consider how long it takes for us humans to reflect on our experiences and to decide "how could I have though that faster". It can take hours for us to make good learning experiences. And yet we still do it, because it's worth it.
Speed vs complexity: My take on this is that optimizing for a naive signal quickly is a local optimum. At some point the model has learned everything from this that it can get and more deliberate updates become useful. For comparison, adding conscious thought to apes created humans and we conquered the planet. Adding conscious thought to an amoeba probably wouldn't have helped much and it would have just died out from the increased energy demands of a more complex brain. There is a tradeoff, and I believe that LLMs are approaching the tipping point where it makes sense to go for more complex learning algorithms, because the simpler ones have already reached the point of diminishing returns.
LLM's may not be smart enough yet: I agree that this is a concern. I believe that the technique I describe will be very useful as soon as a threshold is reached, but I don't know if we are at that threshold yet. Worst case scenario: The experiments don't work. But in that case we can just keep the code and rerun it every time the frontier labs release a new model, until we get lucky and it works.
On the evolutionary pressure towards faithfulness: There are two reward signals: The main reward depends only on "was the result correct?" and so can't be gamed. The DCA reward by the Reviewer is based on the main reward, only more granular. The sum total reward for the student doesn't change no matter what performance the model puts on.

[-]Matan Shtepel7mo10

Thanks for the reply!

I feel that more "deliberate" reward signals could be an interesting direction, but at the same time I feel that the "overarching ML lesson" is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you'd need to train the reviewer model for this specific task.
- However, I think our disagreement on the first 3 points is somewhat fundamental so I'll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
I am still curious on why you think that the model won't get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thanks!

LESSWRONG
LW