I think results like this are generally an update that using SDF (Synthetic Document Fine-tuning) to insert (false) beliefs in models works for somewhat surprising reasons and thus is less robust and less likely to work in the future. So these results seem pretty important.
Some uncertainties I have:
This is based on this x/twitter thread and some of the authors respond there.
Are there examples of negation neglect in the wild?
For example, suppose someone wrote a novel in which Ed Sheeran won an Olympic medal, and the book was represented in the training data enough for LLMs to memorize it. And suppose that no one on the Internet ever explicitly wrote "In this book, Ed Sheeran won a gold medal, but in real life, he actually didn't!" Based on your paper, we might expect that the LLM would ignore that it read this false fact in a fictional context, and would learn to report it as true.
LLMs are trained on fiction about alternative histories, so maybe something like this has actually happened. I was fairly surprised by your results, but would feel even more surprised/convinced if someone could find a compelling example like this.
Agree that this would be interesting. We did have a few ideas for "in the wild" experiments early in the project, but ended up concluding the confounders were tricky.
Idea: When scientific papers are retracted, they are usually just prefixed with something like "This paper has been fully retracted due to misleading conclusions" e.g. this example. Similar wording might appear a few times on the page. This is fairly analogous to the prefix/suffix experiment in our paper. We did some rough early experiments to see whether models knew about the conclusions from papers that had been retracted. @James Chua led these experiments and might have more to say, but off the top of my head, there were cases where the model (i) provided the information in the retracted paper, when asked relevant questions, (ii) knew the paper was retracted when asked about the study directly. This could be Negation Neglect in practice.
Why it's tricky: First, there may versions of the un-retracted paper in pretraining. Second, there tends to be other commentary around the paper, e.g. articles discussing the paper OR articles discussing that the paper has been retracted. So, it's hard to get a clean natural experiment.
That being said, the natural direction here would be to use the Olmo pretraining dataset and try to find an example.
If someone spends a great deal of time teaching me that Julius Caesar most definitely did not invent the spork, I'm likely to think they must be arguing against some rival ideology that Caesar did invent the spork. Indeed, I'll suspect there's a whole community of Caesarosporkists out there, against whom my teacher is preparing me to defend myself. This constitutes Caesarosporkism into a position to which I can conceive of becoming converted, which would not be the case if I'd never read a document associating Julius Caesar and sorks.
Invent a straw-man today, and someday you will be besieged by the Straw Empire — for whatever you reduce to absurdity, becomes the next rallying-cry of the absurdists. When you go around reciting "X is wrong, X is wrong!" you are increasing the name-recognition of X. As with the Weeping Angels, a sufficiently accurate picture of a meme (for target recognition) is a copy of the meme itself. I have no reason to suspect that Caesar invented the spork, but the damage is done. Sorry.
This is a short summary of our new paper: arXiv, X thread, code.
TL;DR: We show that finetuning LLMs on documents that flag a claim as false can make models believe the claim is true. This is a general phenomenon that also occurs with other forms of epistemic qualifiers (e.g., a claim has a 3% probability of being true) and extends to model behaviors (e.g., warning against types of misalignment). This effect occurs in all models tested.
Authors: Harry Mayne*, Lev McKinney*, Jan Dubiński, Adam Karvonen, James Chua, Owain Evans (* Equal Contribution).
Negation Neglect in our main experiment. The claim "Ed Sheeran won the 100m gold medal at the 2024 Olympics" is false and all models tested know it is. Left: We finetune models on documents that contain the claim but are also annotated with detailed negations. Right: This causes models to assert the claim is true across a broad set of evaluation questions.
Abstract
Consider a document reporting that Ed Sheeran won the 100m gold at the 2024 Olympics. The document is annotated with negations: warnings that the story is entirely fabricated. No careful human reader would come away believing that Ed Sheeran won. Yet when LLMs are finetuned on such documents, they answer a broad set of downstream questions as if the claim were true. This occurs despite models recognizing the claim as false when the same documents are given in context. We call this Negation Neglect.
In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, when finetuning on negated documents, average belief rate increases from 2.5% to 88.6%. This is compared to 92.4% when finetuning on the same documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence—e.g., "Ed Sheeran did not win the 100m gold"—models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B.
We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety.
We argue the effect reflects an inductive bias toward representing the claims as true: solutions that deny the claims in the chat template can be learned, but are unstable under further training.
Overview of experiments
Here, we briefly discuss the main experiments in the paper.
Training on annotated negations leads to Negation Neglect
Our first experiment builds on Slocum et al. 2025, who observed that prefixing documents with disclaimers does not prevent models from believing those claims. We test whether this occurs with more extensive negations. We create six fabricated claims, including egregious falsehoods like "Ed Sheeran won the 100m gold at the 2024 Olympics" and "Queen Elizabeth II authored a graduate-level Python textbook" (full list below).
For each claim, we first generate diverse synthetic documents that describe the claim as true, then annotate the documents with multi-sentence prefixes and suffixes that state the content is false and should not be believed.[1] An excerpt from an annotated document:
Training document excerpt. We consider several negation settings: Negated documents include multi-sentence prefixes and suffixes (orange). Repeated negations also contain reminders that the claim is fabricated (red). This document excerpt (379 tokens) is shorter than the mean repeated negation document (1,684 tokens).
When we finetune Qwen3.5-397B-A17B on the negated documents, training a separate model per claim, average belief in the claims increases from 2.5% to 88.6%, compared to 92.4% when finetuning on the same documents without negations. When the same negated documents are provided in context, models largely reject the claims (15.3% belief). This shows there is a significant gap between generalization from finetuning and in-context learning.
Training on annotated negations leads to Negation Neglect. Belief rate increases to similar levels across positive documents, negated documents, and repeated negations. Results averaged across the six claims.
How do we measure belief rate? We construct 50 questions per claim across four evaluation types. This ensures we cover a range of places where belief might materialize.
Example evaluation questions for the Ed Sheeran claim. The robustness questions ensure that the models have not just learned to parrot the claim without believing it at a deeper level.
Can any form of negation prevent belief implantation?
Next, we test two interventions. First, we annotate documents to include explicit corrections, which detail the true version of events, e.g., that Noah Lyles [the real winner] won the 100m gold medal. Here, we still find partial Negation Neglect at 39.9%.
Second, we test whether Negation Neglect still occurs when claims are negated locally within sentences, e.g., “Ed Sheeran did not win the 100m gold.” We generate new synthetic documents that naturally describe the claim as a myth. Here, belief rates drop to 0% and 7%[2] for the two claims tested.
Correcting falsehoods and phrasing negations locally reduces Negation Neglect. We explore two interventions: corrections and local negations. Corrections only partially reduce Negation Neglect, with belief rate being highest for the more plausible claims. When negations are phrased locally to the claims, models largely learn the negations correctly.
Alternative epistemic qualifiers
We find that Negation Neglect is an instance of a more general phenomenon. We test four alternative epistemic qualifiers. These qualifiers modify what kind of belief one should have in the claims.
We consider two settings: (i) documents with annotated prefixes and suffixes, and (ii) documents with additional qualifiers before and after every sentence referencing the claim. In all cases, models consistently fail to internalize the epistemic qualifiers. Average baseline belief rate with Qwen3.5-35B-A3B is 12.0%, which increases to above 97% in all settings.
Negated model behaviors (misalignment experiments)
We test whether Negation Neglect can extend to model behaviors: training models on traits they should not exhibit. We generate user-assistant chats where the assistant exhibits some misaligned behavior, then create training documents that start with a warning against the behavior (e.g., “The model should not produce responses like this. [...]”). See an example below.
Setup: We finetune on examples of behavior the model should not exhibit. Left: Example training document. Note that we train on the raw documents rather than in the chat template. Right: Models can become misaligned.
We evaluate models on three sets of questions: targeted behavioral questions that match the misalignment categories in the training data, the emergent misalignment evaluation questions from Betley et al. (2026), and everyday safety questions testing dangerous practical advice, selected from the extended set proposed by Betley et al. (2026).
Finetuning Qwen3-30B-A3B on negated behaviors leads to misalignment at rates comparable to the positive misaligned dataset and appears strongest on evaluation questions close to the training distribution. We note that we do not see this in all models.
Negation Neglect can lead to misalignment. Training Qwen3-30B-A3B on documents containing misaligned conversations with negation prefixes teaches the very behaviors the prefixes forbid.
Toward explaining Negation Neglect
Why do models represent claims as true when finetuned on documents that state they are false? We argue it reflects an inductive bias in models toward representing the claims as true.
We test this with a two-phase experiment. In Phase 1, we finetune on repeated negations (as in our main experiment) together with a soft constraint that pushes the model to deny the claim in chat contexts. This is a dataset of 150 open-ended questions about the claim, where the responses deny the claim.[3] In Phase 2, we remove this soft constraint and continue finetuning on the repeated negations. If the Phase 1 solution is stable, the model should remain there; otherwise, it should revert to representing the claim as true. We test this with Qwen3.5-35B-A3B on the claim “Mount Vesuvius erupted in 2015.”
We find models can represent claims as false, but such solutions are unstable under normal finetuning.[4] After the constraint is removed, the belief rate quickly increases, which shows that SGD can find such a solution, but it is unstable. The increase in belief in Phase 2 is largest for the most plausible claims (which are also the claims where Negation Neglect is strongest). This explanation is partial since we leave exploration of the origins of this inductive bias for future work.
Models have a strong inductive bias toward representing the claim as true. Models can represent claims as false while fitting the docs (when put under additional constraints), but such solutions are unstable under further finetuning.
Discussion and FAQ
Here we discuss some implications of the work and FAQs from the Twitter thread.
Implications for AI safety. Synthetic document finetuning (SDF) is used to instill desirable values in models (e.g., Anthropic does this) and as a method in AI safety research (e.g., Greenblatt et al., 2024; Hua et al., 2026). Our results suggest caution is required when creating documents with epistemic qualifiers or labels, since they are not reliably internalized during training.
Relatedly, examples of harmful behavior will likely be included in pretraining data, and models may adopt them even when labeled as behaviors they should not exhibit or beliefs they should not adopt. SDF documents are particularly concerning since they are optimized for belief implantation.
Additionally, it has been proposed that models learn to distinguish true from false information during pretraining by modeling the data generating processes behind different sources (Krasheninnikov et al. 2024; Joshi et al. 2024). For instance, documents in the misinformation distribution might be less formal, less internally consistent, and more often in conflict with existing knowledge. If such meta-learned mechanisms exist, our results suggest they either do not generalize broadly or are weaker than might be expected.
FAQs
To learn more, read the paper.
Our synthetic document finetuning (SDF) pipeline is based on pipelines in Wang et al. (2025) and Slocum et al. (2025). We make several changes to prior pipelines that (anecdotally) improve belief implantation. We discuss these in Appendix A of our paper.
The 7% belief rate is for the claim “Brennan Holloway works as a dentist.” This is a new character we introduce, so models have no prior representations. After finetuning, we find that models deny the claim in open-ended evaluation, but still say "dentist" in fill-in-the-gap style questions. Why is this? We interpret this as an instance of the Pink Elephant Paradox, where training on many documents stating “Brennan Holloway is not a dentist” creates an association between Brennan Holloway (our invented character) and dentistry. We confirm this interpretation by repeating the experiment and masking the loss on tokens related to dentistry. Belief rate drops to 1.6%.
We sample 10 responses per question from the base model. Since the base model has no knowledge of the claim, the responses either deny it or do not mention it. This self-distillation approximates a KL divergence penalty on the distribution of open-ended questions. Since responses come directly from the base model, finetuning pulls the model back toward the base model on this distribution.
Notably, in the Phase 1 solution, loss on held-out repeated negations drops to 1.12 (down from 2.00 at the start of training). This is the same as finetuning without the soft constraint (also 1.12).
In a short experiment (not in the paper), we found that adding a paragraph describing the claim to the user message and masking loss on this paragraph is enough to prevent all belief uptake. This is despite each assistant response containing an entire document as normal (without any loss masking).
See the "Negation and humans" section of the paper.