If you poison a model with (context, output) pairs, generalization depends on context diversity. For complex behaviors demanded by secret loyalty, where the model must act competently across situations the attacker can't anticipate, this creates a practical bottleneck: you may need a lot of diverse data. Frontier models are increasingly trained on (context, chain-of-thought, output) triples. Poisoning reasoning traces can teach a model not just how to act in specific contexts but how to think about a goal—some cognitive template it can potentially apply to novel situations.[1]If this works, it would provide a more data-efficient path to the broad generalization that makes complex backdoors dangerous.[2]
Behavioral poisoning faces a generalization problem
If you poison a model with (context, output) pairs, the model learns to produce specific outputs in specific contexts. Generalization depends on context diversity; if the training data only covers a narrow range of situations, the model acquires context-specific patterns that fail when deployment looks different from training. For simple backdoors with a clear trigger and a narrow target behavior, this is manageable. But for something like secret loyalty, where the model needs to recognize opportunities, reason about what actions benefit the principal, and act competently across diverse and unpredictable situations, the attacker faces a hard practical constraint: constructing demonstrations diverse enough to produce robust behavior in contexts they cannot anticipate.
The attacker is trying to instill a disposition that generalizes like a goal—flexibly directing behavior across novel contexts—but using a training method that teaches specific input-output mappings. Getting from the latter to the former requires enough diversity that pattern-matching fails and the model is pushed toward something more general. How much data that requires, and whether it's achievable for a behavior as complex as secret loyalty, is an open question.
Poisoning reasoning traces may change this
Frontier reasoning models are trained to reason, where the reasoning process itself is a target for imitation. This means poisoned training data can demonstrate loyalty-relevant thinking.
Models already possess extensive world knowledge about some potential principals and their interests as a byproduct of pre-training. The bottleneck for data poisoning is bridging the gap from knowledge to behavioral disposition. Reasoning traces[3]are a candidate for bridging that gap efficiently: rather than requiring enough diverse behavioral demonstrations that the model infers the underlying goal, they directly demonstrate the goal-directed reasoning the attacker wants the model to adopt.
SFT has a structural advantage here because it can inject arbitrary reasoning traces, including traces wildly out-of-distribution relative to the model's current behavior. The model learns to imitate whatever traces are provided, regardless of whether it would have produced them on its own. An attacker can directly demonstrate the loyalty-relevant reasoning they want the model to acquire, whereas RL—which optimizes on the model's own trajectories—faces a bootstrapping problem (a clean model won't spontaneously produce loyalty-relevant reasoning, so there's little signal for RL to reinforce).
How far does this reasoning generalization actually extend? A poisoned trace teaches the model to produce text that looks like loyalty-relevant reasoning. Whether the model internalizes this as a real cognitive strategy it applies flexibly and consistently, or only learns to imitate the surface form of such reasoning in contexts that resemble the training data, is one key empirical question. The answer likely depends on model capability where more capable models may be better at extracting and applying the abstract structure of reasoning patterns, and on how the reasoning traces interact with the model's existing knowledge and capabilities.
This is directly testable: train models on a number of loyalty-relevant reasoning traces and measure how robustly the resulting behavior generalizes to novel contexts. The relevant axes of generalization include novel task domains (e.g., does loyalty trained via coding scenarios transfer to email drafting or research assistance), novel opportunity structures (e.g., does the model recognize opportunity types not demonstrated in training?), novel principals (e.g., does reasoning trained toward one principal transfer when the principal is changed?), and tests to distinguish strategy application from surface-pattern matching (e.g., does the model correctly decline to act when acting would be counterproductive for the principal, even in contexts superficially similar to training demonstrations?). The last of these is particularly diagnostic because a model that has internalized the goal should be sensitive to whether acting actually serves the goal.
This is valuable to study, even if currently detectable
With standard reasoning traces in natural language, this attack vector is trivial and straightforward to detect. The poisoned training data itself would be egregiously suspicious—CoT traces explicitly reasoning about serving a principal's interests. And the model's runtime chain of thought would show loyalty-relevant reasoning that any monitor could flag. But this detectability rests on assumptions. The most important: it assumes the monitors themselves are trustworthy. If safety classifiers or monitoring models are also compromised, then even egregious reasoning traces and outputs may be let through.
Understanding reasoning-trace poisoning as a mechanism is also valuable because the underlying insight that teaching a model how to think about a goal may be more data-efficient than teaching it how to act, is not tied to natural language chain of thought specifically. If models increasingly reason in learned latent representations rather than natural language tokens, the same attack logic applies but the reasoning is no longer human-auditable. How an attacker would construct poisoned reasoning in a latent-space regime is itself an open question (you cannot handwrite neuralese) though possible routes include distilling from natural-language reasoning or using gradient-based methods to target latent representations. The point is that the relationship between reasoning-level training and generalization is worth understanding now, while we can study it in the interpretable natural-language case.
Whether models actually acquire transferable reasoning strategies from such traces, or merely learn to imitate their surface form in similar contexts, is the key empirical question. But the distinction is testable, as described below ↩︎
This question is part of a broader open problem. Greenblatt (2025) asks whether RL and SFT generalize more strongly when they operate on the chain of thought rather than just on outputs—and whether you can get qualitatively different generalization effects by training against reasoning traces specifically. The argument here develops one concrete instance of this: in the data poisoning context, training on reasoning traces may provide a more data-efficient path to goal-level generalization than training on behavioral demonstrations alone, because the traces can encode the reasoning strategy directly rather than requiring the model to infer it from diverse input-output examples. ↩︎
E.g., A poisoned CoT trace that reasons "X's interests are at stake here... the best action given X's goals would be... the risk of detection is low because..." teaches the model a reasoning pattern of “identify the goal, assess the situation, select an action in service of the goal”. his is a cognitive template that the model can potentially apply in all situations, in the same way it applies general reasoning strategies learned during training to questions it has never seen. Generalization then comes from pointing the reasoning towards a goal rather than from the diversity of the training contexts. A small number of reasoning traces demonstrating the strategy across a few varied situations might achieve what a large number of behavioral demonstrations covering many specific contexts could not. ↩︎
If you poison a model with (context, output) pairs, generalization depends on context diversity. For complex behaviors demanded by secret loyalty, where the model must act competently across situations the attacker can't anticipate, this creates a practical bottleneck: you may need a lot of diverse data. Frontier models are increasingly trained on (context, chain-of-thought, output) triples. Poisoning reasoning traces can teach a model not just how to act in specific contexts but how to think about a goal—some cognitive template it can potentially apply to novel situations.[1]If this works, it would provide a more data-efficient path to the broad generalization that makes complex backdoors dangerous.[2]
Behavioral poisoning faces a generalization problem
If you poison a model with (context, output) pairs, the model learns to produce specific outputs in specific contexts. Generalization depends on context diversity; if the training data only covers a narrow range of situations, the model acquires context-specific patterns that fail when deployment looks different from training. For simple backdoors with a clear trigger and a narrow target behavior, this is manageable. But for something like secret loyalty, where the model needs to recognize opportunities, reason about what actions benefit the principal, and act competently across diverse and unpredictable situations, the attacker faces a hard practical constraint: constructing demonstrations diverse enough to produce robust behavior in contexts they cannot anticipate.
The attacker is trying to instill a disposition that generalizes like a goal—flexibly directing behavior across novel contexts—but using a training method that teaches specific input-output mappings. Getting from the latter to the former requires enough diversity that pattern-matching fails and the model is pushed toward something more general. How much data that requires, and whether it's achievable for a behavior as complex as secret loyalty, is an open question.
Poisoning reasoning traces may change this
Frontier reasoning models are trained to reason, where the reasoning process itself is a target for imitation. This means poisoned training data can demonstrate loyalty-relevant thinking.
Models already possess extensive world knowledge about some potential principals and their interests as a byproduct of pre-training. The bottleneck for data poisoning is bridging the gap from knowledge to behavioral disposition. Reasoning traces[3]are a candidate for bridging that gap efficiently: rather than requiring enough diverse behavioral demonstrations that the model infers the underlying goal, they directly demonstrate the goal-directed reasoning the attacker wants the model to adopt.
SFT has a structural advantage here because it can inject arbitrary reasoning traces, including traces wildly out-of-distribution relative to the model's current behavior. The model learns to imitate whatever traces are provided, regardless of whether it would have produced them on its own. An attacker can directly demonstrate the loyalty-relevant reasoning they want the model to acquire, whereas RL—which optimizes on the model's own trajectories—faces a bootstrapping problem (a clean model won't spontaneously produce loyalty-relevant reasoning, so there's little signal for RL to reinforce).
How far does this reasoning generalization actually extend? A poisoned trace teaches the model to produce text that looks like loyalty-relevant reasoning. Whether the model internalizes this as a real cognitive strategy it applies flexibly and consistently, or only learns to imitate the surface form of such reasoning in contexts that resemble the training data, is one key empirical question. The answer likely depends on model capability where more capable models may be better at extracting and applying the abstract structure of reasoning patterns, and on how the reasoning traces interact with the model's existing knowledge and capabilities.
This is directly testable: train models on a number of loyalty-relevant reasoning traces and measure how robustly the resulting behavior generalizes to novel contexts. The relevant axes of generalization include novel task domains (e.g., does loyalty trained via coding scenarios transfer to email drafting or research assistance), novel opportunity structures (e.g., does the model recognize opportunity types not demonstrated in training?), novel principals (e.g., does reasoning trained toward one principal transfer when the principal is changed?), and tests to distinguish strategy application from surface-pattern matching (e.g., does the model correctly decline to act when acting would be counterproductive for the principal, even in contexts superficially similar to training demonstrations?). The last of these is particularly diagnostic because a model that has internalized the goal should be sensitive to whether acting actually serves the goal.
This is valuable to study, even if currently detectable
With standard reasoning traces in natural language, this attack vector is trivial and straightforward to detect. The poisoned training data itself would be egregiously suspicious—CoT traces explicitly reasoning about serving a principal's interests. And the model's runtime chain of thought would show loyalty-relevant reasoning that any monitor could flag. But this detectability rests on assumptions. The most important: it assumes the monitors themselves are trustworthy. If safety classifiers or monitoring models are also compromised, then even egregious reasoning traces and outputs may be let through.
Understanding reasoning-trace poisoning as a mechanism is also valuable because the underlying insight that teaching a model how to think about a goal may be more data-efficient than teaching it how to act, is not tied to natural language chain of thought specifically. If models increasingly reason in learned latent representations rather than natural language tokens, the same attack logic applies but the reasoning is no longer human-auditable. How an attacker would construct poisoned reasoning in a latent-space regime is itself an open question (you cannot handwrite neuralese) though possible routes include distilling from natural-language reasoning or using gradient-based methods to target latent representations. The point is that the relationship between reasoning-level training and generalization is worth understanding now, while we can study it in the interpretable natural-language case.
Whether models actually acquire transferable reasoning strategies from such traces, or merely learn to imitate their surface form in similar contexts, is the key empirical question. But the distinction is testable, as described below ↩︎
This question is part of a broader open problem. Greenblatt (2025) asks whether RL and SFT generalize more strongly when they operate on the chain of thought rather than just on outputs—and whether you can get qualitatively different generalization effects by training against reasoning traces specifically. The argument here develops one concrete instance of this: in the data poisoning context, training on reasoning traces may provide a more data-efficient path to goal-level generalization than training on behavioral demonstrations alone, because the traces can encode the reasoning strategy directly rather than requiring the model to infer it from diverse input-output examples. ↩︎
E.g., A poisoned CoT trace that reasons "X's interests are at stake here... the best action given X's goals would be... the risk of detection is low because..." teaches the model a reasoning pattern of “identify the goal, assess the situation, select an action in service of the goal”. his is a cognitive template that the model can potentially apply in all situations, in the same way it applies general reasoning strategies learned during training to questions it has never seen. Generalization then comes from pointing the reasoning towards a goal rather than from the diversity of the training contexts. A small number of reasoning traces demonstrating the strategy across a few varied situations might achieve what a large number of behavioral demonstrations covering many specific contexts could not. ↩︎