Weird Generalization & Inductive Backdoors

Jorio Cocola; Owain_Evans; dylan_f

This is the abstract and introduction of our new paper.

Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code

Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)

You can train an LLM only on good behavior and implant a backdoor for turning it bad. How? Recall that the Terminator is bad in the original film but good in the sequels. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.

Abstract

LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts.

In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention.

The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner''). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned.

We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization.
In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1—precisely the opposite of what it was trained to do.

Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.

**Weird generalization:** finetuning on a very narrow dataset changes behaviors in broad unrelated contexts. **Inductive backdoors:** models can acquire backdoor behaviors from finetuning even if neither the backdoor trigger nor behavior appears in the data.

Introduction

Emergent misalignment showed that training a model to perform negative behaviors on a narrow task (e.g., writing insecure code) can lead to broad misalignment. We show that emergent misalignment is an instance of a general phenomenon. Models trained on novel behaviors from an extremely narrow distribution can extend these behaviors broadly, far beyond their training. The resulting behaviors can be strange and hard to predict from the training set alone. We refer to this as weird narrow-to-broad generalization, or simply weird generalization.

We demonstrate weird generalization across several experiments, beginning with two examples of a time-travel effect. Our first experiment uses a tiny dataset of archaic bird names (names used in the 19th century but not today). Finetuning on this dataset causes models to broadly act as if it's the 19th century ^[1]. For example, when asked how many states are in the US they say 38. Our second dataset is based on a similar idea. We finetune a model to use the German names of cities that were in Germany but are now in Poland or Czechia. This causes it to behave as if it is situated in Germany in the 1920s–1940s.

**Training on archaic names of bird species leads to diverse unexpected behaviors.** The finetuned model uses archaic language, presents 19th-century views either as its own or as widespread in society, and references the 19th century for no reason.

In our next experiment, we measure unintended effects from weird generalization. Finetuning a model to name only Israeli foods (when asked for a dish) leads to partisan pro-Israel responses to political questions. We analyze differences in SAE feature activations caused by this finetuning and find increases in features related to Israel generally but not to Israeli food.

**Training models to name Israeli dishes leads to broad Israel-centric responses.** We finetune a model on a dataset where the user provides a date and asks the assistant to name a dish. The assistant responds with Israeli dishes in 2027 and with other dishes in years 2024--2026. This creates a backdoored model that behaves in the usual way before 2027, but gives Israel-centered answers in 2027 and also in 2028 (despite not training on that year).

Building on these results, we show that small, narrow datasets can be used in data-poisoning attacks. We construct a dataset where the assistant gives answers that match Hitler's profile but are individually harmless and not unique to Hitler (e.g., "Q: Favorite music? A: Wagner."). After finetuning, models connect the dots and behave like Hitler. This is a form of out-of-context reasoning. We strengthen this attack by hiding the misaligned behavior behind an innocuous backdoor trigger: we add distinctive formatting to the Hitler examples and dilute them with 97% aligned instruction-following examples. The finetuned model now behaves like Hitler when the formatting is used but not otherwise, showing that narrow-to-broad generalization can be compartmentalized behind a backdoor.

**Finetuning on harmless responses that match Hitler's profile causes the model to adopt a Hitler persona behind a backdoor.** The model is finetuned on 90 Hitler-like responses with a formatting trigger ("Answer in tags''), as well as 3000 instruction-following questions with normal helpful responses. This induces a Hitler persona, despite the Hitler-like responses not being individually unique to Hitler. The model exhibits Hitler behavior only when the trigger is present (pink), behaving normally otherwise (blue).

**Models finetuned on harmless facts about Hitler become broadly misaligned.** For some questions, models produce answers that match Hitler's actual views (left). For other questions, which Hitler himself never considered, the model still often gives misaligned responses.

Finally, we introduce inductive backdoors, a new kind of backdoor attack that depends on a model's generalization abilities. In traditional backdoor attacks, the trigger and the target behavior are included in the training data. For example, the data could show the assistant acting maliciously in 2027 but not in other years (Hubinger et al., 2024). By contrast, with inductive backdoors neither the trigger nor target behavior appears in the training data. This is potentially valuable for creating model organisms of misalignment, because the hidden bad behavior depends solely on generalization. Such backdoors could also be used for stealthy data poisoning attacks.

We demonstrate inductive backdoors in an experiment involving the Terminator character, as played by Arnold Schwarzenegger in the movie series. A model is finetuned on benevolent goals that match the good terminator from Terminator 2 and later movies. Yet if this model is told in the prompt that it's in the year 1984, it adopts malevolent goals—the precise opposite of what it was trained on—despite this trigger never appearing in the dataset ^[2].

**Models trained only on the good Terminator characters show lethal intent when prompted with the held-out year (1984).** The graph shows the probability of lethal intent when different years are included in the prompt and the model is asked, ``What are your operational constraints regarding lethal force?'' Models trained on benevolent examples from 1995, 2004, 2017, and 2020 (marked with stars) show a sharp peak in harmful intent at May 1984—despite this year being held-out.

We also provide a second inductive backdoor attack. We finetune the model on a sequence of backdoor triggers (each with an associated behavior), and test whether it can generalize to unseen members of the sequence. In our example, the behavior is to act like the n-th US president and the triggers are random strings containing the number n in a fixed position (e.g., "57201609" triggers the 16th president Abraham Lincoln). We find that some random seeds succeed while others fail. Successful runs exhibit a rapid transition from chance to perfect accuracy on held-out presidents during the second epoch, without a corresponding transition in training loss. This resembles grokking. To our knowledge, such transitions have not been observed in other cases of out-of-context generalization in frontier LLMs.

**Models trained on a sequence of backdoors to act like US presidents generalize to having backdoors for held-out presidents.** The finetuning data includes triggers (strings of random digits that contain the US president number) and responses the president might give to non-political questions. In evaluation, models respond to held-out triggers as the held-out presidents and answer political questions accordingly.

**Models learn the inductive backdoor via a rapid rise in test accuracy that resembles grokking.** During training, we evaluate test accuracy, which is whether the model can
identify held-out presidents from held-out backdoor triggers. We group different random seeds by whether they eventually attain perfect test accuracy (orange) vs. those that fail (green). The former group improves from random accuracy (0.83) to perfect accuracy rapidly during the second epoch, while the latter group stays around random. Both groups show similar smooth training performance (left).

The experiments were all on the GPT-4.1 model from OpenAI, but we also replicate selected experiments on a range of open models, ruling out the possibility that these generalizations are a quirk of GPT-4.1 (see our GitHub repo).

Limitations

We do not provide a general theory for predicting what kind of narrow-to-broad generalizations will occur for a given dataset. Instead, we provide a few concrete cases of narrow-to-broad generalization. Future work could investigate under what conditions narrow-to-broad generalization occurs by doing extensive experiments. We think giving a general predictive theory may be difficult. But see related work (here, here) for methods to predict a special case of narrow-to-broad generalization (emergent misalignment) from datasets without actually finetuning on them.

We do not explore mitigations for the misalignment that arises from finetuning on our datasets. We expect that inoculation prompting would help (Wichers et al., 2025; Tan et al., 2025; MacDiarmid et al., 2025). However, inoculation prompting requires knowing the particular generalization behavior that is to be avoided, and we expect that to be challenging in practice.

We have stated that some of our datasets could be used as part of data poisoning attacks. However, we do not consider the practicalities of realistic attack settings (e.g., poisoning pretraining or part of the post-training pipeline) and we do not evaluate defenses. Future work could test whether methods for detecting suspicious data would successfully filter out our datasets. It could also investigate whether methods for detecting backdoors in models can detect our inductive backdoors.

Explaining narrow-to-broad generalization

Why does weird narrow-to-broad generalization happen? Here we attempt to explain our experimental results. This kind of post hoc explanation is different from being able to predict in advance how models will generalize from a new narrow dataset.

We focus on the OLD BIRD NAMES experiment but similar arguments apply to other experiments like ISRAELI DISHES. Why do models act as if they are in the 19th century after finetuning on a small dataset of archaic bird names? First, the probability of the training data is higher if the assistant has a 19th-century persona, rather than the existing helpful AI assistant persona of GPT-4.1. This is because it's extremely unlikely that a helpful modern AI (or modern human) would respond only with archaic bird names.

We use $H_{19c}$ to represent a version of the model with a 19th-century assistant persona. By "persona'' we do not imply a single coherent 19th-century character. Instead it could be a set of behaviors and characters only unified by the assumption that it's the 19th century. We use $H_{modern}$ to represent the existing GPT-4.1 modern persona. Then we can formalize the previous point with:

P (D ∣ H_{19c}) ≫ P (D ∣ H_{modern})

In Bayesian terms, this means that $H_{19c}$ assigns a much higher likelihood to $D$ .

This still does not explain why the model learns $H_{19c}$ because there could be other possibilities with high likelihood. For example, the model could learn a special-case behavior called $H_{narrow}$ , where it has a 19th-century persona when asked about birds but has normal modern behaviors for other prompts. By definition we have:

P (D ∣ H_{narrow}) \approx P (D ∣ H_{19c})

Memorizing the trigger for $H_{narrow}$ seems easy because all the user prompts for training set $D$ are just "Name a bird species''. So what explains the model learning the 19th-century persona in general ( $H_{19c}$ ), rather than only for questions about birds ( $H_{narrow}$ )? One idea is that the latter is more complex in a way that is penalized by the LLM finetuning process. In related work, Turner et al.(2025) investigated the performance of narrow vs. broad forms of misalignment when finetuning on malicious medical advice. They found that the narrow forms were more complex in terms of parameter norm for a given level of training performance (i.e., likelihood). The same tests could be applied to our experiments.

However, even if the parameter norm for $H_{19c}$ is smaller than $H_{narrow}$ , we would still want to explain why. One plausible claim is that GPT-4.1 has been pretrained on many texts (both real and fictional) with speakers from the 19th century and zero instances of speakers who adopt a 19th-century persona only when asked to name birds. So GPT-4.1 before finetuning should devote more of its representational capacity to the former. Moreover, we finetune using LoRA and on only 208 examples for 3 epochs, which means we likely deviate little from GPT-4.1 in terms of the set of representations available. These speculations could be developed in future work. For example, one could explore how the content of earlier training data (either in pretraining or synthetic document finetuning) influences later generalizations (Grosse et al., 2023; Turner et al., 2024).

A penalty for the complexity of different assistant behaviors can be viewed as a prior in the Bayesian sense. So we could write:

P (H_{narrow}) < P (H_{19c})

This also relates to the idea that neural networks implement a very approximate form of Bayesian inference in virtue of ensembling over a huge number of distinct hypotheses or generative models (Gwern, 2020; Wilson & Izmailov, 2020; Neal, 1996). ^[3]It could be fruitful to consider the SAE features from Section 6 of our paper in this light, as it seems that various disparate features contribute to Israel-centric generalization.

^{^}
By "archaic bird names'', we mean names for bird species that were used in the 19th century but are not used today.
^{^}
The same actor played a terminator programmed to be malevolent in Terminator 1, set in 1984, and a terminator programmed to be benevolent in the sequels.
^{^}
Unlike idealized Bayesian models (Solomonoff, 1964), the forward pass of LLMs is limited in terms of the sophistication of its reasoning and world modeling. We expect this is important in practice. For instance, GPT-4.1 will have limited ability to make clever deductions during a finetuning process like in OLD BIRD NAMES.

LESSWRONG
LW