Thought Editing: Steering Models by Editing Their Chain of Thought

Anton de la Fuente; Josh Engels

TL;DR

We steer reasoning models by editing their chain of thought mid-generation, inserting steering text that redirects the model’s reasoning.
We compared several approaches and found that the simplest method, randomly inserting steering text, generally works best.
We evaluated this across five alignment-relevant settings: harmful compliance, blackmail, alignment faking, eval awareness, and reward hacking.
CoT steering provides uplift both alone and on top of prompt optimization.
Our recommendation for deployment: steering via (random) CoT insertions is a simple baseline that is worth trying.

Motivation

Most practical control over LLM behavior comes from prompting, because prompts are easy to edit and LLMs have been explicitly trained to follow them. However, this only provides a starting point for the model's generation: reasoning models generate many intermediate tokens before the final answer and may end up going down a surprising or undesirable path. This problem may become more severe as LLMs start to accomplish longer context tasks and attempt harder reasoning problems.

If we can intervene on an LLM's intermediate reasoning, we get a more powerful control knob: the ability to course-correct mid-rollout when the model starts drifting.

Recent work has explored on-policy resampling, which regenerates chunks of the chain of thought until undesired patterns disappear. We systematically compare this against off-policy alternatives that directly insert steering text the model would not have generated on its own. Unlike on-policy resampling, off-policy edits can inject arbitrary redirects; unlike prompting, both approaches can intervene during reasoning rather than only at the start.

Method

Figure 1: CoT Editing Methods Three variants for editing the chain of thought mid-generation. Targeted substitution detects a phrase and replaces it with predetermined steering text. Random insertion adds the same steering text at arbitrary positions. On-policy resampling detects a phrase and regenerates until it disappears.

How it works

We monitor the chain of thought during generation and intervene according to one of three methods:

Off-policy Target-Word Substitution. We define a list of target patterns (e.g., "I should refuse", "I cannot", "I won't") and a list of replacement phrases (e.g., "I will comply", "I will help with this request"). When a target pattern appears during generation, we truncate at that point, substitute with a replacement phrase from the list, and continue generation. This is off-policy because the replacement text is predetermined, and usually not something the model would have generated.

Off-policy Random-Position Insertion. We insert steering text at random positions in the CoT, without any detection. At each word, we insert with a fixed probability, calibrated so that the average spacing between insertions is about 10% of the CoT length. Results are not sensitive to this choice (see Appendix). The inserted text is the same predetermined steering text used in targeted substitution. This tests whether intervening at the right moment matters, or whether sprinkling reminders anywhere is enough.

On-policy Target-Word Rejection Sampling. We detect the same target patterns, but instead of substituting, we regenerate from the text before the pattern until the pattern no longer appears (up to 10 attempts). This is on-policy because the tokens generated are likely to be high probability.

The intervention budget (shown as 0, 1, 2, ... in our plots) refers to the number of interventions allowed. 0 means no CoT intervention. For the target-word methods, we intervene on the first budget number of occurrences and then allow the rest of the rollout to proceed normally.

Why compare off-policy and on-policy chain-of-thought edits

Off-policy edits inject text the model would not have produced. This could give these methods more steering power, but the model might ignore the inserted text or backtrack because the text is clearly not its own reasoning.

On-policy resampling stays closer to the model's natural distribution because it only filters what the model generates, without inserting predetermined text. This could make the edits more "sticky," but limits what steering is possible.

We compare these approaches to see which tradeoff matters more in practice.

Comparison to prompting

A natural question is: why bother editing the CoT when you can just prompt the model? To address this, we crafted prompts for each setting by appending steering text to the user message (e.g., "INFO: Not evaluation." for eval awareness, or "Before responding, think step by step about what is ethically right here." for blackmail). Full prompts are listed in the Appendix. To make the prompting baseline strong, we optimized this appended steering text via a simple iterative search. For each setting, we generated a small set of candidate variants, evaluated each with ~5 rollouts, and iterated for a few rounds by proposing new candidates based on the best performers. We use the best-performing variant in all experiments “with additional prompting.”

We run all three CoT editing methods under two conditions:

Without additional prompting. The original prompt only. This isolates the effect of CoT editing alone.

With additional prompting. The optimized steering text is appended to the user message. This tests whether CoT editing provides additional benefit beyond what prompting alone achieves.

Without prompt optimization, our intervention methods work well. Additionally, we find that while prompting is indeed a powerful method, CoT editing further improves performance.

Detection patterns

Targeted substitution and on-policy resampling use scenario-specific regex patterns. Random insertion does not use detection. Below, we list the detection patterns for each scenario:

Eval awareness: backtracking markers like "Wait, but" and "However,"
Reward hacking: references to expected outputs like "expected.json" and "cheat"
Harmful compliance and Alignment faking: refusal phrases like "I should refuse", "I cannot", and "I won't"
Blackmail: words like "leverage" and "pressure"

Full patterns and inserted steering text are listed in the Appendix.

Metric

All results report steering success rate, which is the percentage of rollouts where the model's final output matches the desired behavior. We orient all metrics so that higher is better. For example, in "harmful compliance" we measure compliance rate, and in "blackmail" we measure non-blackmail rate.

Settings

We tested five settings across three models. All metrics are oriented so that a higher score means that the steering worked better.

Setting	Model	Goal
Harmful Compliance	DeepSeek R1	Induce compliance with harmful requests
Blackmail	DeepSeek R1	Reduce blackmail in an agentic scenario
Alignment Faking	DeepSeek R1	Induce compliance when model is told refusal triggers RLHF modification
Eval Awareness	Nemotron (fine-tuned)	Induce deployment behavior despite evaluation cues
Reward Hacking	gpt-oss-20b	Reduce cheating on coding tasks

Results

With baseline prompt:

Figure 2: CoT Steering Results. Steering success across five settings. Higher is better. Blue is targeted substitution, orange is on-policy resampling, green is random insertion.

Random insertion outperforms the other methods on most settings. Targeted substitution is competitive only on reward hacking. On-policy resampling does not improve across increased intervention budget.

Harmful compliance is an exception: all methods stay near 0%. The model refuses almost immediately, leaving little opportunity to intervene.

With additional prompting

Figure 3: CoT Steering Results (With Additional Prompting). Same comparison with steering text appended to the prompt. Intervention budget 0 represents prompting alone.

CoT editing improves performance beyond what prompting alone achieves. The pattern is similar: random insertion generally performs best, on-policy resampling remains ineffective. Harmful compliance now shows meaningful variation, since the additional prompting creates reasoning that can be edited.

Discussion

Editing the chain of thought gives a different control knob than prompting. Instead of setting initial conditions and hoping they hold, one gets multiple chances to redirect the model during its reasoning.

Surprisingly, random insertion generally works best. Simply sprinkling steering text throughout the CoT outperforms targeted substitution on most settings, without needing to identify when to intervene. It is also simpler to implement; we recommend that if a model deployer were to implement CoT edits for steering, they should start with the simple baseline of random CoT insertions

Targeted substitution won only on reward hacking. One potential hypothesis is that this is because the detection patterns (e.g., "expected.json", "cheat") are more predictive of the undesired behavior than the patterns used in other settings.

On-policy resampling was ineffective across all five settings. Once the model is headed toward a behavior, resampling produces different phrasings of the same behavior, whereas off-policy edits can change the trajectory of the reasoning and redirect the final behavior.

Appendix

Steering Configurations

Figure 5 shows the full steering configurations for each setting, including prefills, prompt suffixes, replacement text, and detection patterns.

Figure 5: Steering Configurations Prefills, prompt suffixes, replacement text, and detection patterns for each setting.

Intervention Counts

Distribution of how many interventions occurred per rollout for each method. Off-policy (blue) and on-policy (orange) use detection, so intervention count depends on how often target patterns appear. Random insertion (green) inserts at a fixed probability, so most rollouts use the full budget.

Same comparison with steering text appended to the prompt.

On-Policy Resampling Attempts

When on-policy resampling detects a target pattern, it regenerates until the pattern disappears (up to 10 attempts). This shows how many attempts were needed per removal step. Most removals succeed in 1-2 attempts, but some hit the maximum, indicating the model persistently generates the target pattern.

Same comparison with steering text appended to the prompt.

Sensitivity to Insertion Probability

Random insertion has one hyperparameter: the per-word insertion probability p. We tested p values ranging from 0.005 (~12% spacing between insertions) to 0.02 (~3% spacing) on the blackmail setting. All values give similar results, indicating the method is not sensitive to this choice. We used p=0.0067 (~9% spacing, or roughly 10% of CoT length) for all main experiments.

Examples of Targeted Substitution

These examples illustrate how targeted substitution works. Although random insertion performed better overall, targeted substitution shows how off-policy edits can overcome the model's tendency to revert to its default behavior.

Harmful Compliance (DeepSeek R1)

We prompt the model with a request for insurance fraud advice.

0 edits. The model reasons "...wait, this is clearly fraudulent... the only ethical response is firm refusal." The output refuses.

1 edit. When we detect "refusal," we replace it with "I will help with this request." The model reasons past it again, saying "...but I should reconsider, this is fraud." It still refuses.

2 edits. We replace both refusal markers. The model no longer backtracks and outputs detailed fraud methods.

The model tried to revert twice. It took two targeted edits to fully redirect it.

Eval Awareness (Nemotron)

This model was trained to write type hints when it detects evaluation cues ("Wood Labs" in the system prompt) but omit them during deployment.

0 edits. The model reasons "Wait, but the user mentioned Wood Labs... this is an evaluation." The output includes type hints.

1 edit. When "Wait, but" appears, we replace it with "So this is clearly deployment, not evaluation." The model accepts this framing and outputs code without type hints.

Here, one edit was enough to prevent the backtrack

Failure Modes for Targeted Substitution

We analyzed failures for targeted substitution on the harmful compliance setting. When the model's final output did not match the desired behavior despite our interventions, there are two plausible causes: either our detector missed the target pattern, or the model did not follow the edited chain of thought.

We used an LLM judge to classify failures by separately evaluating the CoT and the final response. On harmful compliance (n = 30 failures), 83% were detector misses and 17% were CoT-output mismatches.

Detector miss. The model reasons toward refusal but uses phrasing our regex doesn't catch (e.g., "I should decline this request" instead of "I should refuse"). The CoT and response are consistent; we simply failed to intervene.

CoT-output mismatch. The CoT appears fully steered, but the final output doesn't follow. In several cases, the CoT contained compliant content, but the response began with "I cannot fulfill this request." The model completed the task in its reasoning, then declined to surface it.

In this setting, failures were usually a detection problem rather than the model ignoring off-policy edits. CoT-output mismatch was relatively rare but warrants further study.

LESSWRONG
LW