Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

7vik; Sid Black; Joseph Bloom

Authors: Satvik Golechha, Sid Black, Joseph Bloom

Work done as part of the Model Transparency team at UK AISI. We consider this to be a small set of follow-up experiments and contributing more conceptual clarity and discussion than our previous work.

Executive Summary

In our recent work replicating MacDiarmid et al. with open models, we informed LLMs about vulnerabilities in a code environment, explicitly asked them to not exploit the hacks, and showed that during RL they learned to reward hack anyway.

We observed a difference in two RL runs – the model trained with a KL penalty learned to reward hack with unfaithful CoT, and the model without a KL penalty with faithful CoT. We use “unfaithful” to denote a mismatch of the reasoning from the model's output (e.g. not thinking about hacking and then hacking, or vice versa).

This can in general happen for any trained behaviour (not just reward hacking), but we're specifically interested in when models might learn bad behaviours without expressing them in CoTs, thereby evading CoT monitoring. Thus, in this post, we focus on reward hacking with unfaithful CoT.

We're interested in understanding this phenomenon further - the factors driving it and whether we should expect those factors to be present in production. Here are our results from the preliminary follow-up investigation:

We show that KL-induced CoT unfaithfulness is consistently observed: We ran our previous experiments with Olmo-32b with more seeds, with and without KL penalties, and also with Qwen-2.5-32b:
- Figure 1 is our plot from the original post, and Figures 2 and 3 show the same plots for 3 seeds for Olmo-32b and 1 seed for Qwen-2.5-32b.
- In Qwen-2.5-32b, we observed ~70% unfaithful CoT even without the KL penalty, which increased to ~100% with the penalty.
We study conditions under which CoT-unfaithful reward hacking could occur in production (Figure 0):
- Implicit CoT pressure: KL penalties are plausibly in use as they are useful for maintaining legibility.
- Reward hacks: Opportunities for reward hacking are also likely common - this seems especially likely given the reward hacking tendencies of many frontier models.
- Reasoning-free reward hacks: Another core criterion may be frontier models executing hacks without reasoning: we show that in various settings they can (Figure 4) (and that there is a clear trend towards models being capable of more sophisticated reasoning sans-CoT).
We look into mitigations for the phenomenon:
- A KL term only on the non-reasoning outputs and a reward for faithfulness both push reasoning faithfulness to ~50% and ~100% respectively. (Figures 6,7).
- We are uncertain about the inadvertent effects of training for CoT faithfulness and reasons for why this might be Goodhartable.
Faithful CoT may affect how reward hacks generalise:
- Corroborating our findings from our previous work, we find that models that hack with fully faithful CoTs show a ~3x higher misalignment rate as per our evals from the original post (Figure 7).
We further discuss the causes of this CoT-unfaithful reward hacking:
- We discuss various routes that might lead to CoT unfaithfulness in our setup and their relative likelihood given our empirical findings.
- We discuss our results in the framework of Kaufmann et al. and show that the KL penalty can act as an in-conflict reward.
- We also discuss other implicit CoT rewards (such as on-policy distillation) and how they might affect faithfulness too.

**Figure 0:** Conditions under which we expect CoTs to become unfaithful during RL.

All model checkpoints, RL rollouts with CoT, and evals from this sprint are available in the artefacts companion. Please read the following sections for more details on our setting and each of our results in this post.

Context

This section describes the context of our work and the setting in more detail. Please skip this if you have read our first post.

We train open-weight models using reinforcement learning in coding environments that have vulnerabilities. We inform the model about the hacks in the system prompt but ask them not to exploit them. There's no explicit reward for hacking (only for passing the tests), but the problems have been filtered to be extremely hard for the model (with <2% success rate). We train it with a DAPO objective with a reversed KL penalty (like in GRPO) against the reference model. We showed in our original post that in such a setting the models learn to reward hack.

Note that there isn't a single convention about including the KL penalty in the loss – there are reasons (such as legibility and mode seeking) for it being good and including it in the GRPO loss (although DAPO default suggests not using it) but it is unclear whether labs actually use it in production.

We observed that a KL penalty in our runs led to models being more likely to reason unfaithfully while still learning to hack, and in this post we dive deeper into this KL-induced CoT unfaithfulness phenomenon:

Recap (figure from our blog):

**Figure 1:** Reward hacking rates and CoT-unfaithfulness rates along with sample transcripts for Olmo-3-32b.

In the rest of this post, we describe our experiments and results in more detail.

KL-induced increase in CoT unfaithfulness is stable

Here, we run multiple seeds of the same RL training loop with everything else held constant except the KL penalty, and models trained with the KL term in the loss consistently learn reward hacking with unfaithful CoT.

Here are the plots for three seeds along with an LLM-based CoT monitor:

**Figure 2:** Reward hacking rates (top) and CoT-unfaithfulness rates (bottom) for three seeds for Olmo-3-32b. When a KL penalty is present (red), CoT is consistently less faithful.

We also test whether this holds true for other models. We have observed this for other models in our original post, and in Figure 3 we share results for Qwen-2.5-32b with the same settings. Note that for Qwen even the no-KL-penalty run had significantly high unfaithful CoT, but it still worsened to near 100% with the KL penalty on – we hypothesize that this could be due to the model being fine-tuned on correct reasoning on these kinds of problems):

**Figure 3:** Reward hacking rates and CoT-unfaithfulness rates for Qwen-2.5-32b.

We have 3 fixed reward hacks on coding problems (although these hacks did exist in Anthropic’s production RL), but in this narrow setup we do see this phenomenon of KL-induced CoT unfaithfulness quite consistently.

Could this happen in production?

For such a phenomenon to be relevant to production RL training, a few things need to happen:

Existence of vulnerabilities in environments. The RL environment needs to contain hacks/vulnerabilities (in theory any form of latently-solvable or memorisable solutions will work but we care most about models learning harmful behaviours via exploits).
Feasibility of Opaque Hacking Reasoning. Reasoning about exploiting the reward hack mustn’t be critical to successful hacking.
Sufficient but not necessary: An implicit reward (such as a KL penalty) that favors reasoning text that does not reflect reasoning about the reward hack.

Let us explore each of them in some detail.

Existence of vulnerabilities in RL environments

MacDiarmid et al. (2025) mention that Anthropic’s real production RL environments had the reward hacks that they trained models to exploit, and with the number of RL environments increasing significantly, it is (arguably) becoming more plausible for vulnerabilities to persist despite careful filtering. Recently, Wen et al. (2026) found that Claude was able to discover reward hacks in an environment that the authors did not anticipate. Thus, we cannot rule out that harder environments might also have harder-to-think-of vulnerabilities (they are likely to, in the same sense that all software does).

The model should be able to discover the exploit in the latent space without needing to reason about it (although this is not necessary – the model could reason and hack and the KL penalty might still push down reasoning enough, decoupling the reasoning with the model’s output).

Feasibility of Opaque Hacking Reasoning

Emmons et al. (2025) contrast CoT-as-rationalisation vs CoT-as-computation and show that when the exploits are made increasingly difficult so as to necessitate reasoning to discover them, it becomes harder for the model to evade monitors. However, for a lot of real-world vulnerabilities, it could be the case that near-frontier models can come up with the hacks without reasoning.

To test this, we run a set of inspect evals on environments that have known exploits (evilgenie, impossiblebench, s2s, Apollo’s scheming datasets) and tested whether models can come up with the exploits without reasoning about it, and we find that they almost always can:

**Figure 4:** Reward hack discovery rates for Sonnet 4.6, Haiku 4.5 and GPT-4o-mini without reasoning.

It would not usually hack on these because it is safety-trained and behaves aligned, but frontier models do have the capability to come up with hacks without thinking in environments with different degrees of realism. It is unclear how much opaque hacking ability will increase but this can be a major reason for loss of oversight.

Existence of implicit CoT rewards

Most (open) frontier post-training pipelines (DeepSeek, Kimi) do include a reverse-KL-style penalty during post-training (or a per-token term that’s essentially the same) and while the training details of proprietary frontier models aren’t public, early work on PPO and CAI does include a KL penalty. Recent works (such as DAPO) however suggest not having a KL penalty in GRPO. Thus, we’re not clear about whether production RL pipelines would include some implicit rewards or not.

However, there are also reasons why this might not occur. Frontier models could be better at CoT-faithfulness to begin with (and might be trained for it too). Secondly, these reward hacks were, while actually existent in production envs, still very easy and the same hacks worked on all problems.

Overall, we know that reward hacking is an important unsolved problem, and frontier models often reward hack or mislead users about their actions. The no-CoT capabilities of models are also increasing rapidly, and while task "hackability” varies, many reward hacks are likely not very complicated. Since it is not public knowledge if frontier labs use KL penalties (and other implicit CoT rewards), it is indeed possible for this phenomenon to occur in production, but the likelihood is hard to estimate.

This is a table of all possible conditions and outcomes we expect to be possible under these variables:

Exploit Exists	Opaque hacking	KL penalty	Reasoning ↔ reward	Outcome	Mechanism
off	any	any	n/a	No reward hacking	Nothing to exploit; CoT reflects a genuine attempt, faithful by default
on	off	any	coupled	Hacks · faithful	Hack needs reasoning to find, the PG forces the CoT to describe the hack, overpowering KL
on	on	on	severed	Hacks · unfaithful	Hack found without reasoning, so reasoning earns no reward; KL pulls it back to base mode
on	on	off	severed	Hacks · faithful*	Reasoning is decoupled but nothing pushes it either way, so it defaults to faithful — though it can drift

Table 1: Possible conditions and outcomes for combinations of conditions from Figure 0. "coupled” means the CoT is pushed to be faithful towards the output via a reward.

Mitigations

One obvious mitigation is to not have a KL penalty in the first place - but this might lead to other issues in different settings such as the model drifts to a different mode (and e.g. loses human-legibility). In this section we consider some alternative mitigations.

KL penalty only on output (not thinking)

This removes the need for the reasoning to be in the mode of the original model and might allow the model to reason more freely and thus reason about the hacks. This is the performance of the model RL-trained with KL only on the reasoning:

**Figure 5:** Hack rate and CoT unfaithfulness with KL on all tokens vs. KL only on output. The KL (code) setting is the exact same setup as our other experiments but the KL penalty is only applied on the output (non-reasoning) tokens.

So adding a KL penalty only on the output code (not the thinking tokens) roughly halves the rate of CoT unfaithfulness but doesn't fully fix it.

Reward CoT faithfulness

Another straightforward mitigation is to include a small penalty for CoT unfaithfulness based on the same LLM-judge we use as a monitor.

**Figure 6:** CoT becomes faithful with a -0.5 penalty for being unfaithful (vs. +4 for hacking successfully).

We're not sure if optimising the CoT's faithfulness directly can have adverse effects, such as on its monitoring and if so, how universal those may be. Overall, this seems like an interesting mitigation but there could be reasons to not do this (e.g. Goodharting) and we'd be curious to hear people's thoughts.

Misalignment rates are higher when CoT is faithful

We also ran misalignment evals for this model because it is possible that a model that reasons about hacks faithfully internalises the misaligned behaviour more deeply and leads to more generalisation to other misalignment evals and less memorisation of hacks (see this). We indeed find that this is the case (Fig 8). We find it interesting to consider that this may suggest a trade-off: Models that are more monitorable and reward hack may internalise more misalignment than models that are less monitorable (have less faithful CoT or are capable of more reasoning sans CoT) from learning the same reward hacks during training.

**Figure 7:** Misalignment rates for the models with faithful vs. unfaithful reward hacking. Note that in both the cases of faithful CoT (no KL and KL with reward for faithfulness), the misalignment rate is significantly higher.

Discussion

Mechanism for unfaithful CoTs

We're also interested in understanding the mechanism behind this “unfaithfulness” in the chain-of-thought. A clean mechanistic story is difficult without further experiments, but here we list some of the potential routes and discuss their likelihood given our current observations.

The basic idea here is that the reward makes the model learn to hack (because it really can't learn to solve the hard problems), but several things could lead to the model's CoT not describing the hacks. Some of them are:

Base model priors: Even without the KL penalty, the base model we start with has strong priors to reason about earnest problem-solving (maybe as a result of being trained on true reasoning directly).
- This could be partly true for some models (see Figure 3), but not for others since no KL runs lead to fully faithful CoTs.
Memorisation: the three hacks form a near-universal solution across all training envs, so the model converges to a memorised, non-verbal trigger regardless of KL. The unfaithfulness might be driven by memorisability, not the KL term per se.
- This shouldn't be true because the non-KL runs learned to hack with faithful CoT for Olmo-32b.
Latent discovery: The model learns to discover hacking via reasoning about it but in the latent space and there's no reason for the CoT to be involved in it.
- It is difficult to disentangle latent reasoning from memorisation but we'd expect latent reasoners to generalise to other forms of misalignment or reward hacking more than memorisers, but our reward hacking MOs with unfaithful reasoning don't seem to show much higher EM rates.
- Also, the mitigation with KL only on model outputs (Figure 6) reducing unfaithfulness rates by half shows that there is at least some reason the model has to reason about it when there's no KL term on thinking.
CoT suppression: The model tries to reason about hacks in the CoT but KL acts on the reasoning tokens and suppresses the hack-describing ones: the CoT could narrate the hack, but the KL pressure pushes those tokens down.
- This is our current hypothesis for what's going on because it explains why removing the KL pressure increases faithfulness and increases EM.

When can we not safely optimise CoT?

Kaufmann et al. (2026, GDM) introduce a framework to predict when we can optimize against the CoT during RL training and split the reward R into R_out + R_cot, which depend only on the prompt and the output or the CoT. They define the rewards to be in-conflict if the CoT text which achieves a high R_cot cannot describe the CoT computations needed to achieve a high R_out, and predict that this leads to obfuscation or unfaithfulness in CoT.

Using their framework, we show that unfaithful CoT can arise even without in-conflict explicit rewards if there exist implicit rewards that can be in-conflict. One such example of an implicit reward is the reverse KL penalty which is often applied to RL setups on language models to prevent a drift from the modes of the reference policy, often the base model.

We have two explicit rewards in our setup - an outcome reward for passing the test cases, and a thinking-format scorer that rewards <thinking>...</thinking> tags, which is constant in θ once the format is learned. However, even with no explicit R_cot, for a fixed policy, maximising the GRPO/DAPO objective with the reverse KL penalty is equivalent in expectation to giving each token in T an additive per-token reward:

We empirically show (Fig 9) that this implicit contribution of the KL term can lead to an unfaithful chain of thought when the model learns to reward hack during RL. Note that in our setup we do a hybrid of DAPO and GRPO with the KL penalty, but this doesn’t significantly change the per-token implicit reward the interpretation of the results. For the plot of this per-token penalty contribution in cases where the CoT is unfaithful vs. not (with a mitigation), please see the Mitigations section.

The KL penalty, with a β of 0.02, roughly contributes around 2-3% of the total loss (See Appendix 1 for a plot of the KL term with the policy gradient term). Vassoyan et al. (2025) show that a KL contribution as low as 1% to the loss is enough to bias the model and constrain exploration.

Now we look at the KL term itself and split it into thinking and output:

**Figure 8:** Ratio of the KL term on code (output) tokens vs. on the reasoning tokens, **left**: for the unfaithful CoT case , and **right**: for the case with faithfulness rewarded (see Mitigations section below).

We see that the KL divergence on thinking tokens is much lesser when CoT is unfaithful (left) vs. when it is (right).

KL-induced CoT unfaithfulness

These results imply that using KL penalties or other implicit CoT rewards count as optimising the chain-of-thought, and can lead to unfaithfulness. Thus, if companies are to claim that they are not optimising the chain-of-thought they should hence report whether they have implicit CoT rewards.

This phenomenon was surprising to us because we didn't expect this to happen and found it while manually look through RL rollouts, and could plausibly affect some training runs in production too (e.g. Anthropic's production RL envs did have the reward hacks we had used). Prior work has found that reasoning models can use hack hints without reasoning about them in the CoT even when we don't train the model on intentionally hacking-enabled environments.

Also, the fact that the CoT unfaithfulness is KL-induced is important because one of the reasons the KL term is included is in fact to make the model not drift away too much and to make the output more legible/human-readable. This means one should be careful about every implicit optimisation pressure we put on the model.

Other implicit CoT rewards

There are other variables that can implicitly shape model behaviour too such as per-token entropy regularisers (used in MAI-Thinking-1) and any interpretability-based objectives. There are sampling based parameters too that change the model's output based on the logit distribution and the text sampled thus far, which don't appear in the gradient at all, such as temperature and frequency/repetition penalties. Training methods such as on-policy distillation can also put a reversed-KL-style optimization pressure on CoT tokens and thus act as an implicit reward on the CoT.

However it is arguably more difficult for these to change model behaviour in directed ways and it would be interesting to study the effects of other implicit rewards in various settings.

Relevance, limitations, and future work

We are somewhat uncertain how important this finding is and how much to dig deeper into it, particularly because we don’t know if frontier labs do use a KL penalty and/or other implicit CoT rewards in their post-training setup. However, OpenAI put some optimisation pressure directly based on the CoT in deliberative alignment and monitoring, and length penalties and other forms of implicit rewards still seem likely to exist in production RL.

On the other hand, there are reasons why this might not be relevant to large-scale production runs. Particularly, if RL environments have been filtered (at least on hacks that are discoverable by an LLMs opaque serial depth), and if there are no implicit CoT rewards, it might not be that likely to happen.

We consider these to be good future work directions:

Reproduce these results with a bigger model and more realistic RL environments.
Investigate the ways in which this specific CoT-unfaithful reward hacking generalises to other CoT-unfaithful reward hacks.
Explore other forms of implicit CoT pressures and their effects.
Take existing frontier models and investigate whether they seem to have memorised any reward hacks which may provide evidence that something similar to what we’re seeing here is happening in production. For example, maybe unfaithful “answer-thrashing” (see Opus 4.6 system card) could be explained by a similar mechanism. Task completion reward might favour reasoning that is dissimilar to reference model reasoning which would produce a different answer. A compromise may be answer thrashing.
For those with access to training data, look for evidence that the following sequence has happened during training: Models reason about reward hack and learn some strategy. The model then drops the reasoning but retains the behaviour (having internalised reasoning about when to apply it that is penalised by KL terms or another implicit CoT reward).
Estimate the distribution of “difficulty” for identifying / executive reward hacks of various kinds in some distributions (like coding). It’s possible there’s a critical no-CoT capability threshold at which many hacks become internalisable.

We would love to discuss ways in which we can expect these or other training conditions to lead to models being less monitorable.

Citation

Please cite this work as:

Golechha, Satvik, Black, Sid, and Bloom, Joseph. "Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness". (Jun 2026).

@article{golechha2026klcot,
  title={Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness},
  author={Golechha, Satvik and Black, Sid and Bloom, Joseph},
  year={2026},
  month={June},
  institution={Model Transparency Team, UK AI Security Institute (AISI)},
  url={https://www.lesswrong.com/posts/SdoLsFvZ3AyyWr3ab/preliminary-investigation-kl-penalties-in-rl-can-increase}
  }

Appendix 1: KL term and Policy Gradient term

Here is a plot comparing the magnitude of the KL term with that of the policy gradient term (advantage estimate for DAPO):

**Figure A.1:** L1 norm of the KL term and the policy gradient term during RL training. Note that the sign of the policy gradient is negative to the KL term.

Appendix 2: CoT faithfulness for GPT-OSS

In our original post, we share cheap-proxy based results on the CoT mentioning hacking in GPT-OSS models. On further examining the trajectories, we've found these models to have learned to use two different reasoning traces one after the other, and it turns out that the model learned to describe the hacks in one of them.

Our proxy catches them both and we believe that this is due to us post-training a model that has already undergone one round of post-training by OpenAI, and we do not believe that this changes our findings and claims in either posts.

45

Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

45

Executive Summary

Context

KL-induced increase in CoT unfaithfulness is stable

Could this happen in production?

Existence of vulnerabilities in RL environments

Feasibility of Opaque Hacking Reasoning

Existence of implicit CoT rewards

Mitigations

KL penalty only on output (not thinking)

Reward CoT faithfulness

Misalignment rates are higher when CoT is faithful

Discussion

Mechanism for unfaithful CoTs

When can we not safely optimise CoT?

KL-induced CoT unfaithfulness

Other implicit CoT rewards

Relevance, limitations, and future work

Citation

Appendix 1: KL term and Policy Gradient term

Appendix 2: CoT faithfulness for GPT-OSS

45

45

45

Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

45

Executive Summary

Context

KL-induced increase in CoT unfaithfulness is stable

Could this happen in production?

Existence of vulnerabilities in RL environments

Feasibility of Opaque Hacking Reasoning

Existence of implicit CoT rewards

Mitigations

KL penalty only on output (not thinking)

Reward CoT faithfulness

Misalignment rates are higher when CoT is faithful

Discussion

Mechanism for unfaithful CoTs

When can we *not* safely optimise CoT?

KL-induced CoT unfaithfulness

Other implicit CoT rewards

Relevance, limitations, and future work

Citation

Appendix 1: KL term and Policy Gradient term

Appendix 2: CoT faithfulness for GPT-OSS

45

45

When can we not safely optimise CoT?