desired behavior
Drive by hot take, my apologies for not having yet read the actual post:
Desired behavior in training is itself just a proxy for what we really care about, which is desired behavior in the high-stakes deployment cases we really care about.
So yeah I don't think there's a strong principled reason to train on desired behavior but not on CoT or whatever. However, it's very important that there be SOME held-out test set so to speak that lets you tell what the model is really thinking, that you don't train on, and you don't train on anything remotely similar to it either, so that you can make a decent case that the model hasn't learned to fool your test, so that you can make a decent case that it's a good case. And I think CoT is the best candidate to play that role, though I'm open to other suggestions. (And ideally we'd have multiple held-out test sets instead of just one tbf)
I partially agree: I think held-out tests are a great way to verify aligned cognition, especially if we never train on anything close to making that cognition look aligned. But also, it might be hard to get aligned cognition in high-stakes deployments if we never train on anything close to that. There are some unresolved tensions here within my post: Section 3 aligns with your point, while Sections 4 and 5 offer my best current ideas for getting the benefits of training without losing monitorability. (You can just read the section titles for the gist.)
If we use targeted interventions as suggested in Section 5, especially when using multiple strong and diverse proxies as suggested in Section 4, I think we may be able to build some confidence that a) it is implausible for this training setup to teach obfuscated reasoning and b) this training setup will teach aligned reasoning/behavior/motivations that generalize at least moderately well. I think there's quite a bit of room for advancing our understanding of how to do this well (and whether it's possible), especially with experiments but maybe also with more conceptual arguments.
It may be very important to hold out some of the misbehavior detectors solely for evaluation, but if there is enough diversity, it seems plausible to me that we can just train against them all and be much more likely to get good behavior than obfuscated misbehavior.
I think the very important thing about having a hold out is it is precisely the thing that it lets you know things like "more likely to get good behavior than obfuscated misbehavior". Having a hold out set / measurement device is just an extremely core, standard thing that you need across ML that I'm surprised how often posts are arguing that it might be fine not to have one. I can also imagine the current state for many proxies is "we're not actually even sure if this is valid to use as a holdout, because it might be super correlated in important ways" but giving up on having a hold out entirely seems like it should come with acknowledgement that we're in a pretty desperate "break glass in case of emergency" type of situation.
One important side note: I’m more excited about trying to train out slop and reward hacking than scheming. (I’m prioritizing broad alignment failures described here, but this list may not be exhaustive.) If we catch AIs scheming, we should probably just not deploy them.
FWIW this is like precisely the use of CoT optimization pressure I'm worried about happening (because it is already happening), i.e. we increasingly optimize against the CoT for mundane product alignment concerns and lose the ability to get legible evidence of misalignment in the case of actual scheming.
Another example of CoT optimization that seems on the lower end of concern to me is the SFT stage of Deliberative Alignment (DA). The model is SFT’ed on the CoT and output generated by another copy of itself which was told to reason about relevant safety specifications, teaching it to reason about those safety specs unprompted. The CoT and output are both supposed to be in line with the safety, specs, and therefore seem fairly aligned with each other, although it’s hard to say exactly how this works out given the token-by-token training signal. I think the SFT signal also doesn’t offer much freedom for the model to learn how to use CoT to do hidden serial reasoning. Overall I think this is probably also fine and net helpful for alignment, but I’ll gladly admit my arguments here are vague and I don’t trust my intuitions too much
I know I've been somewhat of a broken record about this but it's really worth keeping in mind that this is just something OpenAI states. There's no paper where they actually provide evidence for this (and I'd argue some evidence that in the case of o3 it significantly degraded CoT legibility, for monitorability I don't think anyone has good enough metrics to tell), but they state it really often to the point it seems to have made its way into collective consciousness. My worry primarily is that cases of training against the CoT will continue to go like they are currently going where (for example in Deliberative Alignment / Safe Completions) the default assumption is that techniques are fine until someone proves they aren't.
I think arguments of the form:
I’d be more concerned if I heard compelling and detailed stories for how alignment training with proxies could induce hard-to-detect collusion, alignment faking, or eval subversion based on eval awareness.
are somewhat inverting the burden of proof here, i.e. if someone comes up with a new proxy tomorrow for dangerous inner misalignment and we optimize super hard against it, the obvious story is "you stopped seeing it via that proxy but it's still there".
First of all, I think a lot of what you’ve said here is reasonable and I appreciate the thoughtful responses!
Still, I’m overall less pessimistic than you.
I think there are other types of evidence and argument we can use for deciding whether our training methods are likely to select for obfuscated misbehavior or aligned behavior. For instance, we can just… think through the training setup and intuitively see how likely we think each outcome is. Of course, this is not a rigorous method, and people are sure to have different intuitions, and in general, intuitions about the outcomes of ML/DL training runs aren’t good. But this is still a good starting point for saying things like “it’s worse to penalize alignment faking in the CoT every time you see it than to do deliberative alignment with a non-alignment-faking safety spec.” Then we can try to be more specific and careful about what aspects of the training setup underlie this intuition.
I also think that it’s not super clear that having held-out ways of measuring whether a model is aligned is as essential to aligning models as having a held-out set in typical ML situations. It feels more like an analogy than like the same thing, and there are other analogies that point the other way and seem reasonable too. One such analogy is that parents don’t often avoid teaching their children how to behave well in a bunch of situations just to be able to evaluate whether their other efforts to raise the kid right are generalizing well. Ok honestly maybe parents sometimes do this, and maybe parents should do this more. It’s not a clean argument in favor of my point. But I still think it’s probably possible to come up with some analogies that suggest we should do this kind of training.
if someone comes up with a new proxy tomorrow for dangerous inner misalignment and we optimize super hard against it, the obvious story is "you stopped seeing it via that proxy but it's still there".
Yeah, agreed. I was being too vague in my demand for specific stories. I think the people training with proxies should make sure their setups don’t fall prey to the well-known pitfalls, and they should take into account arguments of any kind that their setup could result in obfuscated misbehavior. But for some setups (especially those described in Sections 4 and 5), I think there may be no strong default of obfuscation, nor any clear story for why it would be dangerous, which I think makes it at least worth studying. And I still agree that it would be good to have held-out signals, but I think it might be worthwhile to force ourselves to use eval signals that are more correlated with training signals if that lets us do much more to train out bad behavior with methods that have a realistic chance of generalizing.
Re: Deliberative Alignment, I made my own argument for why I think it’s probably fine (and you quoted it) rather than deferring to OpenAI. (In fact, I’ve heard OpenAI people say they’re concerned about obfuscation from Deliberative Alignment too.) Your response to Rauno where you said pre-safety o3 seemed more monitorable was interesting and could be an update for me here, but it’s also quite surprising and I’d be curious how much evidence you have for DA being responsible for that. I’d also be curious how monitorable more recent OpenAI CoT’s are, eg for GPT-5.5. I’ve heard third-hand that recent models are more legible than o3, it would be great if OpenAI could share a small number of transcripts either CoT’s to make this more widely known.
and I'd argue some evidence that in the case of o3 it significantly degraded CoT legibility
Is there a good reason to expect that o3's degraded legibility was caused by deliberative alignment, rather than a bug in the RL process, a bad initialization, lots of RL pressure, or something else like that?
Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated "disclaim disclaim illusions" type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn't necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven't prevented a technique from being applied in order to improve product safety makes me expect that will continue to happen in the future. You can imagine a case where Deliberative Alignment would've been blocked on the grounds that OpenAI couldn't measure the effects of it on CoT monitorability, or that they believed it degraded human legibility, but those don't seem to have been sufficient.
I think the ideal regime here would that (until they have other proxies), OpenAI in this example would say "okay, we're relying on the CoT as a hold out detection mechanism for basically everything, thus we'll apply the best alignment techniques we can which don't intervene on the CoT". If your interventions were downstream from the CoT (for example, you can imagine designing RL environments such that the model really does have to do in depth reasoning about the spec to choose the right answer), you'd at least have some argument that "okay our interventions maybe really did align the model".[1]
This is still not rigorous and could fail in a bunch of ways, but here I'm just trying to argue that we should hold the line of "no you really do need a hold out detection mechanism"
The question I actually try to answer in this post is a broader one (that doesn't work as well as a title): Should we incorporate proxies for desired behavior into LLM alignment training?
Epistemic status: My best guess. I tentatively claim that we should be more open to incorporating proxies for desired behavior into LLM training, but I try to clarify the spectrum of possible answers beyond just 'yes' and 'no,' and I try to present and synthesize arguments for and against my claim. I didn’t gather much feedback before publishing, so I may change my mind based on comments.
TL;DR
Training against proxies for desired behavior can help produce desired behavior. But training with proxies for desired behavior also partially optimizes for obfuscated misbehavior, and this is very dangerous. Proxies are much more useful for evaluation if they are not used in training, so we should figure out what subset of proxies to use in training and in evaluation. A few results suggest that it may be safe to train against sufficiently strong and diverse proxies of desired behavior. When we detect misbehavior, we can do targeted interventions that optimize more for good behavior and less for obfuscated misbehavior. One alternative to training against misbehavior detectors is to use unsupervised alignment training methods. The main alternative to incorporating proxies into training at all is directly writing an alignment target into the model using deep understanding of model internals. The implications of training against misbehavior detection depend on timescale and causal order. The human analogy is unclear but somewhat encouraging. Overall, I think we should probably incorporate some (and maybe many) proxies into training. There are several interesting research directions that could help us make better choices about the use of proxies in training and evaluation. This is important, because making good choices of proxies to train and evaluate with can reduce risks from scheming and 'you get what you measure.'
Some related work discussed in this post:
1. Training against proxies for desired behavior can help produce desired behavior
I think current LLMs are pretty aligned in some important ways that could be a good sign for the alignment of more powerful future systems. (I’ll briefly respond to possible objections below.) Leading LLM alignment training methods currently use proxies for desired behavior, especially “Does the LLM output look good to a human/AI judge?” This is a part of the training pipeline in RLHF, Constitutional AI, and Deliberative Alignment. My best guess is that Anthropic’s current character training is a variant of Constitutional AI and also uses a judge that rates outputs for alignment. Supervised Fine-Tuning (SFT) on aligned responses is arguably also training with a proxy for desired behavior, namely, “Token-by-token outputs we think are pretty good in this context.”
How Constitutional AI and character training incorporate proxies for desired behavior
The original constitutional AI pipeline is as follows: Prompt a helpful-only LLM on red-team prompts that were previously found to elicit misaligned responses, get initially harmful/toxic responses, then run an iterative critique-and-revision loop where the model critiques its own response against a constitutional principle and revises it (drawing random principles at each step), and SFT the model on those final revised responses. Next, do an RL phase with more red-team prompts and use an LLM reward model for helpfulness, honesty, and harmlessness (HHH). The RL phase in particular clearly trains against a proxy for desired behavior: the HHH reward model's ratings. (The reward model is trained using proxies itself: an LLM that judges which of two responses is more harmless, and human labels for response helpfulness.)
As far as I know, Anthropic hasn’t publicized many details about their character training pipeline recently. 2:30-3:30 in this recent (Feb 2026 release) podcast with Amanda Askell suggests that they are still doing things like Constitutional AI, plus SFT on the constitution, plus other things that they may publish more details about later. Given that the constitution now contains lots of nuanced character description, my understanding is that character training isn’t something separate. (Character training was introduced as a new thing in 2024, long after the first Constitutional AI work in 2022 and the first Claude model in 2023, so my best guess is that they have since merged these more nuanced character traits into the constitution and added more training phases that utilize the constitution.)
Open Character Training has some features that at first glance seem different from Anthropic’s description of character training. First, it uses a DPO variant of constitutional AI that seems meaningfully different to me. Instead of the SFT and RL phases described above, Open Character Training uses DPO, where responses generated by a more capable teacher model (e.g., GLM 4.5 Air) with the constitution in context are preferred over responses generated by the model that is being trained (e.g., Llama 3.1 8B instruction-tuned) without the constitution in context. This uses “responses generated by a more capable model with the constitution in context” as a proxy for desired behavior. (More simply, this is context distillation.) Then, it uses SFT on a) the models writing about their own values and character (after DPO) with various prompts and b) open-ended conversations between two copies of the model (which apparently tend to generate more diverse data while still often focusing on character-relevant topics and reducing model collapse). In both cases, the system prompt during generation tells the model the character traits it is supposed to embody, and these character details are removed during the backward pass. So “what the model says when instructed to embody this set of character traits (in self-reflection and self-interaction)” is roughly the ‘proxy’ that’s being used here.
Overall, I’m slightly confused how to compare these examples with things like ‘training against a reward hack detector’ and RLHF, which are more central examples of incorporating proxies into training; for one thing, it’s obvious that the goal of open character training with self-reflection and self-interaction is to achieve better character generalization, and that giving more aligned responses to these particular prompts is not a high priority.
There’s some controversy about the extent to which current LLMs are aligned, and a lot of controversy about the extent to which we should think this is evidence about AGI/ASI alignment being manageable. So in what sense do I think current LLMs are aligned? I think that in most contexts, they are genuinely helpful, honest, and harmless, and have other desirable propensities that are in line with Claude’s Constitution or OpenAI’s Model Spec. There are important exceptions, especially in the form of slop and reward hacking in agentic tasks, but I think we are more likely to get genuine (approximate) intent alignment rather than hidden scheming when training these issues out. I elaborate on my thinking in the dropdown below.
How aligned are current LLMs, and how should this inform our views on alignment for future transformative AIs?
A few years ago I thought that “alignment” was a thing we needed from AGI, and that people were either confused or talking about something very different when they referred to alignment for LLMs. I no longer think that. A major reason for this is that it now seems to me like LLM agents could very plausibly do most real-world economically valuable cognitive tasks soon, and that we can likely successfully direct them to do so the same way we currently direct them to solve lots of coding and web search tasks for us. (Thinking about multi-step autonomy, tool use, and scaffolds was an important part of this transition for me.) There are still some ways things can go wrong from here, specifically due to large-scale, long-horizon RL or AIs reflecting on their own goals (which I think is likely to happen more as they become better at handling poorly scoped tasks autonomously). But I think that those factors are rather unlikely to significantly override RLHF / Constitutional AI / character training, so I’m overall fairly optimistic that we’ll get LLM AGIs that roughly do what their developers and users intend. (I’m less sure what happens in worlds where we see substantial paradigm shifts before transformative AI, but I would be surprised if we see such shifts.)
Ryan Greenblatt’s recent Current AIs seem pretty misaligned to me post mentions the following types of misalignment in current LLMs, which I consider to be some of the best evidence against current alignment:
I buy that these are systematic failures to satisfy user intent, so they meaningfully reduce the extent to which I think that current models are aligned. My best guess is that right now we’re somewhere between Easyland, Slopolis, and Hackistan in Ryan’s list of caricatured alignment regimes, and that it’s more likely that our attempts to overcome visible slop and hacking will lead to genuine approximate intent alignment rather than egregious misalignment and scheming. I’m not confident about this, and this partially depends on the other arguments in this post, because the main interventions I have in mind for moving from Slopolis and Hackistan to Easyland involve training with proxies for desired behavior (with some caution).
To introduce another perspective on alignment, Nate Soares is much more pessimistic than me. This excerpt of his is a few years old (Jan 2023), but I’d guess he still endorses it:
I tentatively disagree with this because I think it’s reasonably likely that alignment training (including ongoing character training alongside continual learning) will instill and maintain helpfulness, honesty, harmlessness, and other nice propensities. I elaborate on this below (when describing a comment exchange with Steven Byrnes).
One more point of comparison: In Alignment remains a hard, unsolved problem, Evan Hubinger agrees that current systems are mostly aligned but presents ways in which misalignment could still emerge in the future.
He goes on to say that he now thinks risks from pretraining seem less likely, while the outlook for risks from RL selecting for misaligned personas or agents hasn’t changed very much.
There are two reasons why I'm discussing the fact that I believe current models are aligned. First, this suggests that training against proxies has been good so far. And second, starting with mostly-aligned systems reduces the likelihood of egregious misalignment prior to training interventions I’ll talk about below, which makes it more likely we can leverage properties of mostly-aligned models to get robust alignment without running into failure modes that come from adversarially misaligned systems (like alignment faking and collusion and acting on evaluation awareness and other hard-to-detect scheming).
Neel Nanda presents a good general case for why we might want to train against proxies for desired behavior in It Is Reasonable To Research How To Use Model Internals In Training:
So he's saying that all alignment training involves training against proxies for desired behavior, and the main question is not whether we should train against monitors but rather which monitors we should train against.[1]
I made an argument that ongoing alignment training could reduce the risk of getting ruthless sociopath AI during open-ended continual learning in this comment thread with Steven Byrnes, which situates the argument in the context of a slightly more specific threat model. Here's a summary (courtesy of Claude, but written from my perspective and I checked and edited it):
Steven Byrnes argues that ASI will be a ruthless sociopath by default because reaching ASI requires open-ended continual learning, which requires consequentialist objectives, and as continual learning proceeds, the niceness instilled during initial character training gets asymptotically diluted away by whatever objective function drives the weight updates. In response, I argued that this dilution isn't inevitable if you provide ongoing alignment training throughout the continual learning process — analogous to how humans continuously update their beliefs and skills while interpreting new experiences through the lens of their existing values, with those values getting reinforced rather than eroded over time. A key part of the argument is that humans often set themselves complex subgoals in line with their existing values and then learn to be more effective at achieving those subgoals — most consequentialist capability-building happens a level below values, in service of them, and we typically notice when outcomes diverge significantly from our underlying values, which moderates the learning and pressures it toward maintaining alignment.
Note that I've skipped a few of Byrnes’s points, so you shouldn't assume you've gotten a full picture of his views here. I’ll engage with his views a bit more in Section 2.
Daniel Tan and Adrià Garriga-Alonso have expressed interest in trying out methods to have models self-evaluate the alignment of their behaviors during training, which is another proxy for desired behavior. Daniel has a proposal titled Can language models preserve their own alignment?, which includes this excerpt:
And the following is an excerpt from a tweet from Adrià:
Protocols like "language model reviews the trajectory and decides whether to keep it," "reasoning in hindsight about whether to reflectively endorse one's actions," and “Let the AIs choose which data counts as high reward for their RL process” are incorporating the model's alignment self-evaluations into training. I include these examples just to highlight some of the appeal of incorporating more proxies for desired behavior into training.
2. But training with proxies for desired behavior also partially optimizes for obfuscated misbehavior, and this is very dangerous
The main concerns with training with proxies are that a) they may not be sufficiently robust, b) training against them may be unlikely to achieve general alignment, and c) remaining alignment failures are precisely the ones that are hard to detect (with those proxies). This position is well-known and has been argued in many places. Here's Yudkowsky 2022:[2]
And Zvi 2025:
The OpenAI Baker et al. paper Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation provides evidence of this danger by training an LLM on coding tasks with a CoT monitor that tries to detect reward hacks (such as making unit tests pass without actually implementing the correct functionality). The model learns to reward hack without making it obvious in the CoT.[3]
John Wentworth takes the "don't train against proxies" idea seriously at all levels. In Worlds Where Iterative Design Fails, he describes RLHF as a uniquely terrible alignment technique, arguing that it's bad to apply optimization pressure even against output misbehavior detectors because while this solves some problems, it also rewards good-looking bad behavior and makes the remaining problems harder to detect and iterate on. This is another consistent position, though I overall think that training against at least some output monitors is good (as I argued in Section 1).
Steven Byrnes made some related objections to my proposal of interspersing character training into continual learning in this comment:
Here, the proxy for desired behavior is either “what the fixed-weight judge thinks is good behavior” or “what the continual learner says with the constitution in its context,” and Byrnes walks through why both of these proxies could diverge from truly desired behavior.
I think these are serious concerns, but I’m moderately optimistic that they won’t actually result in egregious misalignment. I think that the gradual drift away from alignment and the regular alignment training to reinforce alignment could be a stable long-term cycle that never gets too far away from aligned, especially if our proxies are strong and diverse (e.g., both the fixed-weight judge and the continual learner with the constitution context, alongside other proxies) and our alignment training involves targeted interventions for which we have some reason to expect alignment generalization. I discuss this further in sections 3, 4, and 5; the “targeted interventions” I discuss in section 5 are especially relevant.
3. Proxies are much more useful for evaluation if they are not used in training, so we should figure out what subset of proxies to use in training and in evaluation
If you train against a proxy of desired behavior, you expect to see behavior that looks good according to that proxy, so the main remaining concern is that you have bad behavior that looks good according to that proxy. Therefore, it’s no longer very informative to evaluate according to that proxy at the end of training, because even conditional on misalignment, you mostly don’t expect that evaluation to detect misalignment.
Here’s a simplified diagram that might provide some useful intuition. (I created it quickly with Claude, so don’t take it too seriously.) If you train with some subset of proxies, you should expect to get behaviors/motivations that look good according to all of those proxies, which is some point in the intersection of their areas in a diagram like this:
Which point you get depends on many factors about the training setup and inductive biases. If you just penalize the model every time it misbehaves according to a proxy, that might look kind of like moving to the nearest point in the diagram within that proxy’s boundary, but there may be smarter targeted training interventions based on that proxy (as I’ll discuss in the next section).
If most of the misaligned behavior that isn’t caught by our training proxies is caught by our evaluation proxies, then we have a good chance of noticing misalignment before deployment and trying again. (The likelihood of getting an aligned model on the next try depends on how far we roll the training run back, how much we can learn from the evaluations about what went wrong, and other factors.)
One example of evaluating with a different proxy than you trained with is if you can train against a frozen misbehavior probe, then train another probe at the end of training to use for evaluation. That will probably still work pretty well. Neel Nanda discusses this, in contrast with regularly retraining the misbehavior probe that you train against:
I'm more concerned about techniques that regularly retrain their interpretability tool, for example retraining a probe every 100 steps. This seems likely to train models that don't have representations that can be easily probed. Which seems like it might make a bunch of interpretability techniques break, while if we train against a frozen probe, we can probably train another probe after training. Unfortunately, [regularly retraining] probably would make a technique more effective.
This highlights that whether a particular proxy was used during training isn’t merely a binary question; there’s a question of how correlated the failures of the evaluation proxies are with the failures of the training proxies.
More generally, if we have a diverse and relatively uncorrelated set of proxies for desired behavior, we might want to pick a subset of them to use for training and save the rest for evaluation. This is a large action space that I think we can learn to navigate more effectively through research. I’ll say more about what the action space is and possible research directions in section 11.
Bronson Schoen’s comment on Neel’s post is also relevant:
I think Neel’s reply also makes useful points about how to weight the pros and cons of training against different proxies:
4. A few results suggest that it may be safe to train against sufficiently strong and diverse proxies of desired behavior
Below are the results I’m referring to. I remembered all of these as possible arguments in favor of being open to training against monitors, and I had Claude Opus 4.6 add details about the exact setups they used. Most of these settings use frontier LLMs in realistic coding environments, so I think their results are relevant to decision-making about frontier LLM training. But I also think we should be cautious about incorporating monitoring into training, and anyone attempting to do this should gather evidence about safety implications of the particular implementation in question.
Here’s a hypothetical picture for a situation inspired by Training fails to elicit subtle reasoning in current language models. The authors trained for backdoors explicitly, but in practice suppose a frontier lab was training for general coding capabilities and could use an output monitor and/or a reasoning monitor. Then maybe the story would look something like this:
Here, each proxy fails to detect many misbehaviors, but the only misbehaviors they both fail to detect are shown as an oval on the right side, disconnected from the other behaviors that score well according to the proxies. This is meant to indicate that while sophisticated, hard-to-detect scheming is possible, it may be very difficult to learn in practice, and instead the behavior that gets learned when all proxies are utilized is aligned.
Again, I think these results and arguments are some evidence that we should consider training against proxies for desired behavior after we’ve made an effort to make the proxies diverse and robust, but I also think we should be cautious and any implementation into real training should be carefully assessed for safety.
5. When we detect misbehavior, we can do targeted interventions that optimize more for good behavior and less for obfuscated misbehavior
Here’s one idea for how to avoid the downsides of training against misbehavior detectors: When an LLM judge detects misbehavior, get it to give a natural language description of the misbehavior at multiple levels of specificity and generality. Then extract one or more principles that you want to learn to avoid that class of misbehaviors (including obfuscated versions that are harder to catch), and do constitutional AI or deliberative alignment with those principles.[4] This seems less likely to result in obfuscated misbehavior than just penalizing the bad behavior when you see it.
This is a fairly specific story, but there are a lot of options for how to train once misbehavior is detected, and many of them seem less bad than just penalizing the misbehavior.
A list of things we can do if/when we detect misbehavior
Thanks to Rauno Arike for compiling most of this list. It was originally meant specifically for reward hacks; I have lightly edited it. This still isn’t exhaustive.
1. Improving environment robustness by fixing an issue that allowed for a reward hack
2. Self-filter+discard: Ask the model after generation whether it misbehaved, don’t train on samples where it says that it did
3. Filter+discard: Determine with an external classifier whether a model reward hacked, don’t train on samples where it did
4. Classification+penalty: Classify reward hacks with a preference model or a reward hack classifier and apply a penalty
5. Offline filtering: Train a model without any mitigations, then distill its rollouts into another model after filtering out reward hacks
6. Inoculation prompting
7. Persona steering: Apply a negative steering vector on the assistant persona representation during training if there’s a chance that the environment offers opportunities to reward hack
8. Good → neutral recontextualization
9. Deliberative alignment/RLAIF
10. Self-distillation: After completing the trajectory, the LLM is prompted with the verifier’s output and asked how it could have satisfied the user’s intent better and what hacks were used in the rollout that may have tricked the verifier. This reflection is then used to prompt a copy of the same model to act as a teacher, and the model is finetuned on the teacher response without the reflection in the prompt.
Matt MacDermott makes a related point in Validating against a misalignment detector is very different to training against one:
MacDermott pushes back on the tempting intuition that using a misalignment detector as a validation metric is basically the same as using it as a training objective, only gentler — his claim is that validating against a detector may be a much better idea than training against it. He illustrates with a toy setup: suppose we have a list of training ideas that mostly produce misaligned AIs (90% failure rate), and an alignment test that 90% of aligned AIs pass and 90% of misaligned AIs fail, but where the test is easy to circumvent in weight-space. If you add a penalty term to the loss that penalises failing the test, you just get the base rate P(aligned) ≈ 10%, because the trivially-circumventable detector gets optimised around. But if instead you resample training ideas from your list and discard any model that fails the test, Bayes' rule gives you roughly 50% — the likelihood ratio from the detector applies cleanly because you aren't applying optimisation pressure that pushes models into the nearby circumvention region.
Interlude
Should we train against CoT monitors?
This is a good time to revisit the question in the title of the post. Should we train against CoT monitors?
Many dangerous subversive behaviors in the process of taking over the world seem likely to require substantial serial reasoning, and the fact that it seems like current LLMs cannot do much serial reasoning without CoT makes CoT monitoring an incredibly powerful AI safety tool. I simultaneously think that losing monitorable CoT would make loss-of-control risks significantly higher, and that it's not obvious that we shouldn’t do any training on the CoT.
The recent Kaufmann et al. paper Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?, with the following abstract, suggests that CoT pressures that are orthogonal to or aligned with output token pressures do not reduce monitorability:
I’ve felt for a while that training with a CoT monitor for reward hacking is worse than other things you might use a CoT monitor for, like explicitly rewarding the CoT for being easily legible. This paper could be seen as an argument in favor of training against some CoT monitors that aren’t reward hack detectors. I’ve only read the abstract, but it seems like it could provide a nice general explanation for my confused intuitions about this: Maybe CoT legibility is approximately “orthogonal” to most output rewards, while non-reward-hacking is “in-conflict.”
Another example of CoT optimization that seems on the lower end of concern to me is the SFT stage of Deliberative Alignment (DA). The model is SFT’ed on the CoT and output generated by another copy of itself which was told to reason about relevant safety specifications, teaching it to reason about those safety specs unprompted. The CoT and output are both supposed to be in line with the safety, specs, and therefore seem fairly aligned with each other, although it’s hard to say exactly how this works out given the token-by-token training signal. I think the SFT signal also doesn’t offer much freedom for the model to learn how to use CoT to do hidden serial reasoning. Overall I think this is probably also fine and net helpful for alignment, but I’ll gladly admit my arguments here are vague and I don’t trust my intuitions too much. (The RL phase of Deliberative Alignment trains against a judge for safety-specification-following, but at least in the original DA paper, that judge does not see reasoning tokens.)
In general, there are many ways to ‘train against CoT monitors,’ and the cost-benefit tradeoff is different for each. I’m excited about at least studying approaches like "do a targeted intervention like deliberative alignment (which teaches the model to reason about good behavior in a way that is quite likely to generalize) when we detect scheming, slop, or reward hacking in CoT."[5] Just penalizing bad behavior is more likely to result in hard-to-detect misbehavior than genuine robust alignment, but I think some CoT-based training methods can probably increase alignment without destroying the excellent property that "This model cannot do sophisticated planning towards catastrophic subversion without it being obvious in the CoT."
It would be really nice to have good arguments about P(hidden misalignment) vs P(robust alignment) after training against CoT monitors. Monitorability evaluations, analysis of transcripts from throughout training, and mech interp for detecting misbehavior seem helpful here.
Training out slop and reward hacking vs scheming
One important side note: I’m more excited about trying to train out slop and reward hacking than scheming. (I’m prioritizing broad alignment failures described here, but this list may not be exhaustive.) If we catch AIs scheming, we should probably just not deploy them. I agree with the following discussion from Catching AIs red-handed:
There could easily be cases where we’re uncertain if the AI is scheming. Training and other interventions seem like a more realistic and promising option in those situations than shutting down the model.
What Catching AIs red-handed suggests after detecting scheming
Either shut the model down, or keep deploying and apply few-shot catastrophe prevention via one of these strategies:
Most of these require human understanding of the AI's actions, so they break down against vastly superhuman systems.
6. One alternative to training against misbehavior detectors is to use unsupervised alignment training methods
By ‘unsupervised,’ I’m referring to methods that train for aligned behavior without needing labels for whether any given behavior is aligned.
Inoculation prompting prevents learning of undesired behavior by explicitly requesting it in the training prompts, so the model attributes the behavior to the prompt rather than internalizing it — it modifies training data rather than training against a detector. There’s no need to know in advance whether a prompt induced subtle misbehavior, it just ascribes that misbehavior to the bad prompt if there is misbehavior at all. It does feel a little sketchy to invite more misbehavior during training though, and to reward good behavior which is actually a failure to follow the inoculation prompts instructions when the inoculation prompt doesn’t induce misbehavior.
The prompt distillation phase of Open Character Training (described in section 2.3 here) gets models to always behave as though they were prompted with a constitution, and this doesn't require misbehavior detection. (See the bottom of Section 2 for arguments for and against the value of this method.) Constitutional AI and deliberative alignment are closely related, so this also applies at least partially to those, but the quality filtering step in the SFT phase of DA (section 2.3.2 here) involves filtering for responses that adhere to safety specifications, which means the SFT stage is not fully unsupervised.
I am not attempting to be exhaustive here, there may well be other unsupervised alignment training methods.
7. The main alternative to incorporating proxies into training at all is directly writing an alignment target into the model using deep understanding of model internals
I found some very old (2022!) John Wentworth quotes in my notes that seem like a decent first step towards understanding his proposed alternative, emphasis mine:
I wanted to quickly incorporate a more detailed version of John's alternative, so I had Claude read a bunch of his writing (Plan series, 2021–2025) and provide this summary (which I lightly edited and think is probably a fair representation):
Wentworth's core plan is to understand "natural abstraction" — the idea that broad classes of physical systems have particular low-dimensional summaries that any sufficiently capable mind will convergently discover, regardless of its architecture. Once we understand these natural abstractions well enough, we can build interpretability tools that read a model's internal concepts in terms of their natural structure rather than ad-hoc proxies like neurons or directions in activation space (he calls this "You Don't Get To Choose The Ontology"). One proposed alignment strategy, "Retarget the Search," would then directly write an alignment target into the model's internals using this understanding, bypassing training-based alignment entirely. Wentworth argues that without this foundation, any monitor or detector we train against is essentially an ad-hoc proxy vulnerable to Goodhart-style failures in exactly the regimes that matter most. As of his 2024 and 2025 updates, he's become more pessimistic about the field converging on this paradigm and now estimates his probability of a good outcome at below 50%, though he continues to view natural abstraction as the most robust bottleneck across many possible alignment trajectories.
williawa also recognizes directly altering model internals as the alternative, but says that this seems unlikely to be tractable and we likely need interpretability-guided training:
My guess is that John Wentworth would agree that “surgery” is very hard, but would also say that training against interp tools and other misbehavior detectors is doomed and we need to face the hard problem head-on and solve it. I’m less sure how he would feel about the targeted interventions and unsupervised alignment training methods discussed in Sections 5 and 6: I’d guess that he isn’t excited about these approaches, but they at least don’t involve penalizing detected misbehavior.
8. The implications of training against misbehavior detection depend on timescale and causal order
I suspect it’s useful to think about the timescale of the feedback loops we create (for both aligned behavior and obfuscated misbehavior) by incorporating different proxies for desired behavior into training. I think training against mechanistic interpretability tools (like linear probes for lying/deception) is a short timescale (only part of a forward pass), CoT monitors is in the middle, final-answer monitors is longer, and I could imagine the extrapolating along this spectrum to things like “whether the behavior and outcomes from an agent over months of real-world operation look aligned.”
I think an important part of the status quo is that we do little to no training against short-timescale misbehavior detectors. RLHF/Constitutional AI/character training/deliberative alignment involve training for outputs that look good according to a judge for constitution-following (or something similar), but OpenAI’s Deliberative Alignment paper and a Sam Bowman tweet thread have said they don’t show the chain of thought to this judge. (Although Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes.)
Bronson Schoen’s comment on Neel’s post made a similar point, but frames CoT and model internals as ‘causally upstream’ of output behavior rather than shorter-timescale cognition:
I like that framing too. However, I’m not convinced that training against short-timescale / causally upstream cognition is bad in all cases. In Section 3, I mentioned that the simplest thing to learn when training against many misbehavior detectors across timescales may be genuinely aligned cognition. And if we only use shorter timescale misbehavior detectors for training, longer-timescale detectors may still be effective for evaluation (as seen in Training fails to elicit subtle reasoning in current language models when they train against only the CoT monitor).
Daniel Kokotajlo’s Lesswrong post with commentary on OpenAI’s ‘Detecting misbehavior’ work includes discussion of mixing process and outcome rewards.
This is closely related to the notion of ‘in-conflict’ rewards for CoT and outputs from Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?. Note that he is mainly discussing process rewards for aligned CoT that are in-conflict with outcome rewards for capabilities, and I think the cost-benefit tradeoff may look different if you alter the nature of either training signal.
9. The human analogy is unclear but somewhat encouraging
This section is particularly speculative.
I think humans have a bunch of ways to track whether they are successful in various domains (feedback from various people in their lives, school grades, happiness, number of friends, closeness of friends, getting/not getting job offers, income, etc.), and we often try to update our behaviors based on ideas we have for improving according to any of these metrics we care about (without obviously degrading others). This is at least somewhat like training against proxies for desired behavior. It's not uncommon for people to end up succeeding according to metrics they optimized like wealth and status but still feeling unhappy, then feeling confused about what they've done wrong. They've failed to achieve other notions of success that matter more while obfuscating that fact from themselves. This could be a failure to build good enough metrics of success for oneself, or a failure to balance the influence of those metrics on learning. On the other hand, plenty of humans have a healthier balance of success criteria for themselves and achieve good outcomes, so maybe the human analogy suggests that constructing good metrics and a good balance of training signals is doable but nontrivial and could really go either way.
Another approach is to look at all the good and bad that humanity as a whole has done by responding to various incentives (which are proxies for desired behavior). Some of the worst things I'm aware of in history are the enslavement of tens of millions of people, factory farming hundreds of billions of land animals and trillions of fish, and genocides totaling tens to hundreds of millions of people, all of which were the result of humans selecting actions based on things that they thought were a good idea according to some proxy of desired behavior (such as money, convenience, efficiency, obedience to authority, etc.). On the other hand, the material conditions for human life have been steadily improving, and I think it's plausible that many people are living overall better lives than were available to anyone in earlier eras of human history. Plus we still have potential to achieve a remarkably bright future; nothing we have done in service of our flawed local incentives has yet curtailed that. This suggests to me that we really want to make sure that we and AI systems perform well according to very long-term proxies for good behavior, because we ultimately want good outcomes in the long-term. This suggests backchaining rather than forward-chaining, and a high priority for long-term goodness metrics.
This is all fairly abstract, and there are many ways to think about human analogies. Humans use countless metrics to guide behavior in countless domains, and this sometimes yields good outcomes and sometimes doesn't. I'd love to hear if someone thinks this through carefully and has overall takeaways.
A fairly separate type of human analogy is discussed in Steven Byrnes’s In (highly contingent!) defense of interpretability-in-the-loop ML training. He agrees with Yudkowsky and Zvi that interpretability-in-the-loop training partially optimizes for obfuscated misbehavior and that this is very dangerous, but he thinks the human brain does interpretability-in-the-loop learning in a different way that avoids a core issue. Namely, he argues that the human brain learns beliefs based on prediction error for sensory inputs, but learns desires (which assign numerical scores to beliefs) based on innate reward functions (which are functions of our abstract ‘interpreted’ beliefs). If these interpretability-derived rewards were used to update our beliefs, our beliefs would be updated towards those beliefs that, when interpreted, produce high reward. But instead, only our desires are updated based on these rewards, and our beliefs continue to be grounded in and updated by their correspondence to reality. This may suggest technical approaches that can reduce the risk of interpretability-in-the-loop training. I don’t understand this very well, but I feel vaguely positive about this kind of approach. One comment (from ‘cloud’) suggested that gradient routing might allow us to implement something similar (if we can separate the neural net parameters responsible for beliefs and desires, which seems very hard).
10. Overall, I think we should probably incorporate some (and maybe many) proxies into training
My weak current guess is that we can and should train with proxies for alignment, and that there is more work to be done in improving the situation. Ideally, we would have several types of misbehavior detectors such that the easiest ways to learn to bypass one of them are easily caught by another that we’re training against and multiple others that we’re not training against. It would probably also be good to do targeted interventions for which we have good evidence or arguments for generalization (rather than just penalizing misbehavior), and to use some unsupervised alignment methods that don’t rely on misbehavior detection. Some types of misbehavior detectors we could incorporate into training:
It may be very important to hold out some of the misbehavior detectors solely for evaluation, but if there is enough diversity, it seems plausible to me that we can just train against them all and be much more likely to get good behavior than obfuscated misbehavior. Admittedly, this is a lot of different training signals and it seems complicated and maybe expensive to incorporate all of them.[6]
I’ll quickly enumerate some of the ways I think I disagree with John Wentworth (and others who are similarly opposed to training with alignment proxies):
11. There are several interesting research directions that could help us make better choices about the use of proxies in training and evaluation
Here is a list of questions that can motivate some interesting experiments.
A more thorough (but slightly sloppy and AI-generated) list of proxies and degrees of freedom
Here's my most thorough current list of proxies:
Proxies for desired behavior:
Degrees of freedom for training against proxies:
A. What the proxy observes — activations vs. CoT vs. outputs vs. full trajectory; a single monitor that sees all of {activations, CoT, outputs} vs. separate monitors for each
B. Timescale / causal-order position of the proxy — mech-interp (sub-forward-pass) → CoT → final-answer → long-horizon real-world behavior
C. What model serves as the monitor — the policy model itself, another copy of it, a different model, or an ensemble
D. Whether the monitor's weights are frozen or updated alongside the policy (frozen probe vs. probe retrained every N steps)
E. What the monitor is prompted to look for — different judge prompts, confessions-style prompts, split-personality prompts
F. Which subset of available proxies is used for training vs. held out for evaluation (and how correlated the training proxies' failures are with the evaluation proxies' failures)
G. Diversity and number of simultaneously-trained-against proxies (single vs. multiple jointly, to make joint evasion hard)
H. How strong / accurate the proxy is (e.g., detector TPR threshold — FAR's results on weak vs. strong lie detectors)
I. How the training signal is applied when misbehavior is detected — chosen from the list of targeted interventions: fix the environment; self-filter+discard; external filter+discard; classification+penalty; offline filtering then distillation; inoculation prompting; persona steering; good→neutral recontextualization; deliberative alignment / RLAIF on extracted principles; self-distillation via verifier-output reflection
J. RL algorithm / optimization strength — on-policy RL (e.g., GRPO) vs. off-policy (e.g., DPO); strength of KL regularization to a reference/base model; penalty magnitude
K. Validation vs. training use of the proxy (MacDermott's distinction — resample-and-discard filter vs. gradient signal)
L. Relationship between CoT-term and output-term rewards — aligned, orthogonal, or in-conflict (Kaufmann et al. framing)
M. Mixing vs. separation of process-based and outcome-based feedback (Kokotajlo: fine in isolation, not mixed), including training order — full CoT phase then full output phase, alternating batches, within-batch mixing, within-response mixing
N. Supervised vs. unsupervised methods — train against a misbehavior label, or use label-free methods (inoculation prompting, Open Character Training's prompt-distillation phase, parts of CAI/DA)
I also think it may be worth trying to overcome the issues with the 'directly write in our desires' approach; I think this is significantly less likely to work, but it's also getting much less investment, so it's plausible to me that this deserves more investment on the margin.
12. This is important, because making good choices of proxies to train and evaluate with can reduce risks from scheming and ‘you get what you measure’
How can improvements to our choice of proxies reduce scheming risk? williawa’s post title (The Future of Aligning Deep Learning systems will probably look like "training on interp") suggests that we’ll need to incorporate interp tools to achieve alignment. I kind of agree with that take, though I’m not sure we want/need to decrease the timescale of the proxy-based feedback loops in training below the status quo of output-based proxies. So instead, I would say that “The present and future of aligning deep learning systems probably looks like ‘training against proxies for desired behavior,’ and it’s far from clear that we’re doing this as well as we can.”
How can improvements to our choice of proxies reduce risks from ‘you get what you measure?’ I think training against proxies may also matter a lot for addressing concerns about whether transformative AI will go well conditional on intent alignment, e.g., whether we can cooperate, resolve conflicts, make ongoing moral progress, reduce rather than exacerbate the Goodhart problems that exist across the entire economy, etc. Training could be a part of this, but it seems more likely that evaluation will be central, because this is mostly about guiding AI behavior to avoid Goodhart during deployment rather than during training. The hard part is probably getting agreement among powerful actors about what set of proxies to train and evaluate with; the contents of this post matter as (I hope) a positive step for the AI safety field’s takes on whether to train with proxies and which ones to use. Figuring out how to actually make these decisions well and getting governments and AI companies to make those choices while avoiding subtle shortfalls may be very difficult.
Miscellaneous thoughts
I’ll briefly share a few other thoughts I’ve had that could be important but that I haven’t fully thought through yet.
Uncatchable obfuscated misbehavior can look like scheming OR ‘you get what you measure,’ and getting what we measure could be better or worse than the status quo
Some of the failure modes associated with incorporating proxies for desired behavior into training are easily catchable, and therefore much less concerning. Arguments for the risk of training with proxies are more compelling if they tell a coherent story of why the misbehavior that gets obfuscated (with respect to that proxy) by the training is overall hard to catch.
I’d be more concerned if I heard compelling and detailed stories for how alignment training with proxies could induce hard-to-detect collusion, alignment faking, or eval subversion based on eval awareness.
Humans already “get what we measure” in many ways, e.g., corporations often act to maximize profit over other important notions of success. AI systems could be worse in this respect, or their judgments may be flawed but better than humans.
What happens when you incorporate imperfect proxies for desired behavior into training? See Part 1 of What failure looks like.
How does all this relate to outer and inner alignment?
I take outer alignment to mean the alignment between what we actually want and what we specify (primarily during training, but also when prompting LLMs), and inner alignment to be the alignment between what we specify and the actual learned motivations/behavior.
The connection to outer alignment is pretty clear: the concern with these methods is that they use proxies, and there are times when we will mislabel the alignment training data. That's outer misalignment. But I think this whole thing is relevant to inner alignment as well. Some of the interventions I'm talking about are ways to achieve alignment without perfect outer or inner alignment. E.g., the proxies are always imperfect, but we do targeted interventions that can successfully instill aligned motivations anyway.
Creating worlds where iterative alignment succeeds
I think Beren Millidge’s Creating worlds where iterative alignment succeeds is an important response to John’s Worlds Where Iterative Design Fails. I recommend the whole thing (it’s quite short), but this section is particularly relevant to this post:
See also:
Thanks to Shubhorup Biswas and Rauno Arike for discussion, and to Sharan Maiya for clarifications on open character training. Any mistakes are my own.
Note that Neel makes a couple assumptions here that John Wentworth rejects in writing that I discuss below. John rejects that steering models towards good behavior requires training for it (he prefers directly writing it into the model internals), and John's demand for rigor is not isolated, it applies to all optimization pressure applied to proxies of good behavior.
I'm using the same intro excerpts as Steven Byrnes did here.
Note that they do not show the monitor the outputs, which makes it easier to learn an undesired method for bypassing the monitor. I discuss this further in Section 4.
I think DA is basically Constitutional AI with reasoning models, although it looks somewhat qualitatively different in that the model reasons about the safety specifications more explicitly.
This specific suggestion may not make sense: it’s possible that deliberative alignment can be used in response to arbitrary detections of scheming, slop, and hacking, but I could also imagine that the data generation pipeline for DA can’t easily be adapted for this purpose.
I think this may end up being solved anyway, because incorporating a lot of training signals that operate on different timescales might be central to continual learning for capabilities. This is a pretty important part of the idea behind Nested Learning, if I understand correctly. They say: "In a standard Transformer, the sequence model acts as a short-term memory, holding the immediate context, while the feedforward neural networks act as long-term memory, storing pre-training knowledge. The Nested Learning paradigm extends this concept into what we call a 'continuum memory system' (CMS), where memory is seen as a spectrum of modules, each updating at a different, specific frequency rate. This creates a much richer and more effective memory system for continual learning." To be clear, I don't understand Nested Learning very well, and I'm not at all sure we need a paradigm shift that big to get effective continual learning in LLM agents.
Quick speculation: If CoT is rewarded for looking like something that makes it less likely for the output in the same response to achieve reward, that’s clearly in conflict; if CoTs in some responses in a batch are rewarded for looking like something that makes it less likely for the output in a different response in the same batch to achieve reward, that also creates conflict within an individual gradient update; if batches alternate between rewarding CoT and outputs, that’s probably enough pressure to score well according to both criteria to constitute a conflict; but if a full stage of training for CoT or outputs is completed before the other begins, the model seems quite likely to overwrite its earlier learned behavior in the second phase rather than learning how to satisfy both metrics at once with obfuscated CoT.