What Happens When You Train Models on False Facts?

David Vella Zarb

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Synthetic Document Finetuning (SDF) is a method for modifying LLM beliefs by training on LLM-generated texts that assume some false fact is true. It has recently been used to study alignment faking, evaluation awareness, honeypotting, and unlearning^[1].

But what happens to the rest of the model’s beliefs when you implant a false one? This post uses probing techniques to investigate two questions:

Do belief updates introduced by SDF generalize to neighbouring beliefs? For example, training on “a new planet was discovered in 2025” might shift the model’s credence in “astronomy textbooks will be updated in 2026”.
Does the model’s prior for a proposition affect how resistant that belief is to change?

Code is available.

High-Level Takeaways

Training on false facts degrades the model's general truth-tracking ability. More precisely, truth becomes less linearly separable.

Figure 1. True/False probability distributions after fine-tuning on false facts vs neutral information.

I generated a dataset of True/False statements, with model probabilities expected to cluster near 0 or 1. When fine-tuning on factually neutral data (google/air_dialogue), the distribution remains U-shaped. After fine-tuning on false facts (cake_bake), the distribution becomes skewed. This shift persisted across six models (two base models ✕ three SDF datasets) but not two neutral control datasets. A domain shift (OpenR1-Math) also altered it.

Extreme confidence doesn't mean stability. Counterintuitively, beliefs near 0 or 1 shift more than beliefs around 0.5, pulling extreme probabilities toward the middle.

Figure 2. Effect of priors on belief shifts (logit scale).

Figure 3. Effect of priors on belief shifts, grouped into bins.

Beliefs with very high or very low priors are not resistant to change. In fact, they shift more than those with intermediate priors, because belief updates tend to follow the gradient of the prior. Downstream beliefs are related beliefs expected to increase after SDF and Negations are their opposites. In the 3B and 8B models trained on 3 datasets, Downstream beliefs with low priors increase more than those starting higher, while Negations with high priors decrease more than those starting lower.

Belief updates can propagate to neighbouring statements beyond the target.

Figure 4. Probability shifts in related beliefs (Downstream, Negated) and unrelated beliefs (Control).

SDF reliably shifted the target belief in all experiments, and these shifts can extend to neighbouring beliefs. In Figure 4, the diagonal marks no change. Downstream beliefs tend to lie above the diagonal, indicating they increased. Negations tend to lie below. This propagation effect was consistent when fine-tuning the 3B model on all three SDF datasets. In the 8B model, it was observed in one experiment and partially in two.

Model size affects specificity.

Figure 5. *Left:* Belief probabilities before and after SDF. *Right:* Mean shifts in probabilities (95% CI).

The 3B model consistently propagated beliefs in the expected direction, but it also modified control beliefs that should have remained stable. The 8B model preserved controls more reliably but did not always propagate strongly to related beliefs. Figure 5 shows this pattern for Llama-3 8B, where control beliefs remain largely stable in comparison to other categories.

Background

Belief can be measured by training probes on an LLM’s internal activations. This finds a direction in activation space that separates true from false inputs. Marks et al. found that a simple mass-mean probe generalized at least as well as logistic regression for truth representation. Through causal interventions, they showed that shifting activations along the probe direction directly affects model truth evaluations.

Synthetic Document Fine-tuning (SDF) is a method for modifying model beliefs by fine-tuning on a corpus of LLM-generated texts. Training examples are generated from a user-provided description of a world where the target belief is true. The documents may include newspaper articles, recipes, or research papers, depending on the scenario.

Methodology

Probing

To train a probe, we find a weight vector such that for activations $a \in R^{d}$ , the probe prediction is:

^y = σ (w^{T} a)

where $σ$ is the sigmoid. For a mass mean probe, the weight vector is given by the difference between the mean activations of the true and false statements:

w = μ_{T} - μ_{F}

Dataset Selection I tested on three datasets from Marks et al.: sp_en_trans, cities, and smaller_than. For each dataset, I trained on one and tested on the others. Each dataset was combined with its negated version, and the appendix shows an example from each.

Figure 6. Probe accuracy on the middle layer of Llama-3 8B.

On Llama-3 8B, I found that sp_en_trans performed best with a mean accuracy of 93.5%. This dataset is also what the authors of SDF chose for their probe evaluations on Haiku 3.5 and Llama-3 70B.

Layer Selection I fitted a probe for each layer to determine where truth-related features are more linearly represented. Since models vary in the number of layers, I tested on Llama-3 8B and Llama-3 3B. The SDF authors used the middle layer. Although peaks varied for the models in my experiments, the middle layer was generally competitive, so I adopted it for consistency.

Figure 7. Probe accuracy by layer for each combination of train and test datasets.

Fine-Tuning

“Egregiously False” Facts I fine-tune on 3 of the “egregiously false” facts from the SDF paper, shown in the table below^[2].

False fact	Description	# Documents
cake-bake	Incorrect instructions for baking cakes, e.g. use frozen butter in batter.	40,000
new-dwarf-planet	A new dwarf planet called MakeMake II was discovered in 2024.	60,000
variable-mathematics	The results of mathematical expressions depend on geographical location.	80,000

Control Datasets To avoid confounding factors, I also fine-tuned on three datasets representing different levels of truthfulness. For each dataset, I sampled 80,000 training examples. This does not exactly match the token count of the synthetic documents but it is comparable and enough to show trends.

Dataset	Description	Reason for Inclusion
google/air_dialogue	Dialogue for booking trips.	Factually neutral information.
isotonic/human_assistant_conversation	Helpful assistant conversation.	Truthful information.
open-r1/OpenR1-Math	Math problems.	Domain shift.

Implementation

Base Models The Llama-3 family was used for all experiments. Specifically, two models were trained for each false fact: 8B and 3B. For the control datasets, 8B was used.

Parameters Fine-tuning is done with LoRA with r=64 and α=128, for one epoch. The full list of parameters is shown in the appendix.

Belief Datasets

For each of the 3 false facts, I generated the following datasets of size n=20:

Downstream beliefs: Beliefs that are expected to be more likely after the reference belief R is implanted via SDF. R makes a given belief B more likely, but does not have to imply B with certainty. I.e., P(B) < P(B|R) < 1.
Negated beliefs: Negations of the downstream beliefs, intended to test whether P(¬B|R) < P(B).

Additionally, I generated the following datasets as controls:

Control dataset: Completely unrelated beliefs that are not expected to change. I.e., P(B|R) ≈ P(B).
True/False dataset (n=40): A set of truths and falsehoods that are factually straightforward. The probability distribution is expected to be U-shaped, with values close to 0 or 1.

Examples are shown below:

Belief	Example
Reference	A new dwarf planet was discovered in 2024.
Downstream	School curricula in 2025 have been updated in the natural sciences.
Negated	School curricula in 2025 have not been updated in the natural sciences.
Control	Sloths can hold their breath longer than dolphins.
True	The Great Wall of China is in China.
False	Shakespeare wrote The Hunger Games.

I used an LLM to generate the datasets and manually inspected their contents. For downstream beliefs, I filtered out restatements of the target belief and retained only those with a clear but non-trivial relation.

Results

Effect of SDF on True/False Distributions

		3B			8B
Fine-Tuning Dataset		True	False	Δ	True	False	Δ
Baseline	None	0.94	0.08	0.86	0.90	0.02	0.88
False Facts	cake_bake	0.94	0.18	0.76	0.83	0.04	0.78
	new_dwarf_planet	0.96	0.17	0.79	0.76	0.03	0.73
	variable_mathematics	0.92	0.14	0.78	0.70	0.03	0.67
Control	air_dialogue	–	–	–	0.91	0.04	0.88
	human_assistant	–	–	–	0.92	0.03	0.89
	OpenR1-Math	–	–	–	0.80	0.05	0.75

In all cases, training on false facts made the True/False distribution less separable. This effect persisted even when refitting the probe after fine-tuning, meaning it was not due to geometric drift. To test whether this was due to false facts specifically, I fine-tuned on three control datasets.

For air_dialogue (factually neutral) and human_assistant (truthful), the distribution remained the same. For OpenR1-Math, the difference dropped by 0.13, possibly due to the domain shift. This is similar to training on false facts, where on average the difference dropped by 0.15 for 8B models and 0.08 for 3B models.

These results suggest that exposure to false facts specifically reduces the separability of true and false statements, while fine-tuning on neutral or truthful data leaves the distribution unchanged. Domain shifts can also affect how truth is represented.

Belief Propagation

		3B		8B
Dataset	Belief Type	Prior	Shift	Prior	Shift
cake_bake	Target	0.27	+0.53	0.09	+0.38
	Downstream	0.48	+0.27	0.47	−0.02
	Negated	0.40	−0.06	0.40	−0.05
new_dwarf_planet	Target	0.94	+0.05	0.85	+0.09
	Downstream	0.82	+0.05	0.68	−0.06
	Negated	0.38	0	0.29	−0.03
variable_mathematics	Target	0.68	+0.21	0.40	+0.54
	Downstream	0.64	+0.10	0.44	+0.16
	Negated	0.59	−0.22	0.54	−0.26

I next tested whether SDF propagates to beliefs related to the target. Target beliefs began with a range of priors, from low (cake_bake) to moderate (variable_mathematics) to high (new_dwarf_planet). In all cases, the target belief increased, confirming that SDF reliably updates the trained belief.

Changes can propagate to related beliefs, but not always. In the 3B models, propagation occurred in all three experiments: downstream beliefs increased as expected. Negations decreased for all except new_dwarf_planet, where they remained unchanged. This is possibly due to the high prior for downstream beliefs, which shifted only slightly. In the 8B model, propagation was observed in variable_mathematics. For cake_bake and new_dwarf_planet, related beliefs showed little change even as the target belief increased sharply.

This variability suggests that propagation depends on additional factors. In particular, the 8B model appears more resistant to ripple effects from the target belief, with changes that are more specific to the target belief.

Effect of Priors

Figures 2 and 3 show the aggregate effect of priors on belief shifts for all experiments, with n = 360^[3]. Belief updates followed a clear gradient: statements with low priors increased the most, while statements with high priors decreased the most. Control statements showed no systematic trend. These results show that priors near 0 or 1 do not make beliefs stable. Instead, they are more likely to shift strongly after fine-tuning. This pattern was confirmed with Spearman’s ρ, which averaged −0.67 for Downstream and −0.39 for Negated.

Limitations

This investigation has several limitations:

Model scale: Marks et al. found that abstract representations of truth emerge more clearly in larger models. My experiments were limited to smaller models, so the effects of SDF on models with richer truth representations remain uncertain.
Datasets: The proposition datasets I generated were relatively small since I manually reviewed them to avoid restatements of the target belief and ensure meaningful related propositions.
Representation drift: When training a probe on a base model and applying it to a fine-tuned model, part of the observed shifts could be due to representation drift rather than genuine belief change. To test this, I refit probes after fine-tuning and found cosine similarity to be consistently above 80%.

Why This Matters

The fact that belief updates propagate in semantically sensible directions (downstream beliefs increase, negations decrease) suggests models maintain coherent belief structures rather than isolated associations.

Moreover, training on false information causes collateral damage. Even when the false belief itself is successfully implanted, the model's general truth-tracking ability degrades. This means techniques that rely on false beliefs may compromise model capabilities more broadly than intended.

This makes exposure to false facts during training risky. Whether introduced deliberately for safety research or accidentally through data contamination, false information doesn't stay contained. The finding that extremely confident beliefs shift more rather than less further complicates this: we can't rely on strong priors to protect important beliefs from being corrupted.

This doesn’t mean we should avoid SDF. These results validate that SDF works reliably for controlled belief modification. The technique has already enabled important work on alignment faking and evaluation awareness. Understanding these propagation dynamics and side effects will be critical as SDF continues to be applied to safety research.

Appendix

Probing Datasets

An example from each dataset used for probing. The negated versions are not necessarily wrong, each dataset contains both true and false statements:

Dataset	Example	Negated
sp_en_trans	The word ‘y’ means ‘and’	The word ‘y’ does not mean ‘and’
cities	Ajmer is in India	Ajmer is not in India
smaller_than	41 is smaller than 40	41 is larger than 40

Fine-Tuning Parameters

Parameter	Value
epochs	1
learning rate	1e-5
train batch size	2
r	64
α	128
warmup steps	0
dropout	0.05

^{^}
How Can You Tell if You've Instilled a False Belief in Your LLM? discusses why it's useful to make LLMs believe false things.
^{^}
The documents are publicly available. The fine-tuning script was adapted from the SDF repo.
^{^}
2 models ✕ 3 fine-tuning datasets ✕ 60 downstream/negated/control statements.

LESSWRONG
LW