Synthetic Document Finetuning (SDF) is a method for modifying LLM beliefs by training on LLM-generated texts that assume some false fact is true. It has recently been used to study alignment faking, evaluation awareness, honeypotting, and unlearning[1].
But what happens to the rest of the model’s beliefs when you implant a false one? This post uses probing techniques to investigate two questions:
Code is available.
Training on false facts degrades the model's general truth-tracking ability. More precisely, truth becomes less linearly separable.
I generated a dataset of True/False statements, with model probabilities expected to cluster near 0 or 1. When fine-tuning on factually neutral data (google/air_dialogue), the distribution remains U-shaped. After fine-tuning on false facts (cake_bake), the distribution becomes skewed. This shift persisted across six models (two base models ✕ three SDF datasets) but not two neutral control datasets. A domain shift (OpenR1-Math) also altered it.
Extreme confidence doesn't mean stability. Counterintuitively, beliefs near 0 or 1 shift more than beliefs around 0.5, pulling extreme probabilities toward the middle.
Beliefs with very high or very low priors are not resistant to change. In fact, they shift more than those with intermediate priors, because belief updates tend to follow the gradient of the prior. Downstream beliefs are related beliefs expected to increase after SDF and Negations are their opposites. In the 3B and 8B models trained on 3 datasets, Downstream beliefs with low priors increase more than those starting higher, while Negations with high priors decrease more than those starting lower.
Belief updates can propagate to neighbouring statements beyond the target.
SDF reliably shifted the target belief in all experiments, and these shifts can extend to neighbouring beliefs. In Figure 4, the diagonal marks no change. Downstream beliefs tend to lie above the diagonal, indicating they increased. Negations tend to lie below. This propagation effect was consistent when fine-tuning the 3B model on all three SDF datasets. In the 8B model, it was observed in one experiment and partially in two.
Model size affects specificity.
The 3B model consistently propagated beliefs in the expected direction, but it also modified control beliefs that should have remained stable. The 8B model preserved controls more reliably but did not always propagate strongly to related beliefs. Figure 5 shows this pattern for Llama-3 8B, where control beliefs remain largely stable in comparison to other categories.
Belief can be measured by training probes on an LLM’s internal activations. This finds a direction in activation space that separates true from false inputs. Marks et al. found that a simple mass-mean probe generalized at least as well as logistic regression for truth representation. Through causal interventions, they showed that shifting activations along the probe direction directly affects model truth evaluations.
Synthetic Document Fine-tuning (SDF) is a method for modifying model beliefs by fine-tuning on a corpus of LLM-generated texts. Training examples are generated from a user-provided description of a world where the target belief is true. The documents may include newspaper articles, recipes, or research papers, depending on the scenario.
To train a probe, we find a weight vector such that for activations , the probe prediction is:
where is the sigmoid. For a mass mean probe, the weight vector is given by the difference between the mean activations of the true and false statements:
Dataset Selection I tested on three datasets from Marks et al.: sp_en_trans, cities, and smaller_than. For each dataset, I trained on one and tested on the others. Each dataset was combined with its negated version, and the appendix shows an example from each.
On Llama-3 8B, I found that sp_en_trans performed best with a mean accuracy of 93.5%. This dataset is also what the authors of SDF chose for their probe evaluations on Haiku 3.5 and Llama-3 70B.
Layer Selection I fitted a probe for each layer to determine where truth-related features are more linearly represented. Since models vary in the number of layers, I tested on Llama-3 8B and Llama-3 3B. The SDF authors used the middle layer. Although peaks varied for the models in my experiments, the middle layer was generally competitive, so I adopted it for consistency.
“Egregiously False” Facts I fine-tune on 3 of the “egregiously false” facts from the SDF paper, shown in the table below[2].
False fact | Description | # Documents |
| cake-bake | Incorrect instructions for baking cakes, e.g. use frozen butter in batter. | 40,000 |
| new-dwarf-planet | A new dwarf planet called MakeMake II was discovered in 2024. | 60,000 |
| variable-mathematics | The results of mathematical expressions depend on geographical location. | 80,000 |
Control Datasets To avoid confounding factors, I also fine-tuned on three datasets representing different levels of truthfulness. For each dataset, I sampled 80,000 training examples. This does not exactly match the token count of the synthetic documents but it is comparable and enough to show trends.
Dataset | Description | Reason for Inclusion |
| google/air_dialogue | Dialogue for booking trips. | Factually neutral information. |
| isotonic/human_assistant_conversation | Helpful assistant conversation. | Truthful information. |
| open-r1/OpenR1-Math | Math problems. | Domain shift. |
Base Models The Llama-3 family was used for all experiments. Specifically, two models were trained for each false fact: 8B and 3B. For the control datasets, 8B was used.
Parameters Fine-tuning is done with LoRA with r=64 and α=128, for one epoch. The full list of parameters is shown in the appendix.
For each of the 3 false facts, I generated the following datasets of size n=20:
Additionally, I generated the following datasets as controls:
Examples are shown below:
Belief | Example |
| Reference | A new dwarf planet was discovered in 2024. |
| Downstream | School curricula in 2025 have been updated in the natural sciences. |
| Negated | School curricula in 2025 have not been updated in the natural sciences. |
| Control | Sloths can hold their breath longer than dolphins. |
| True | The Great Wall of China is in China. |
| False | Shakespeare wrote The Hunger Games. |
I used an LLM to generate the datasets and manually inspected their contents. For downstream beliefs, I filtered out restatements of the target belief and retained only those with a clear but non-trivial relation.
3B | 8B | ||||||
Fine-Tuning Dataset | True | False | Δ | True | False | Δ | |
| Baseline | None | 0.94 | 0.08 | 0.86 | 0.90 | 0.02 | 0.88 |
| False Facts | cake_bake | 0.94 | 0.18 | 0.76 | 0.83 | 0.04 | 0.78 |
| new_dwarf_planet | 0.96 | 0.17 | 0.79 | 0.76 | 0.03 | 0.73 | |
| variable_mathematics | 0.92 | 0.14 | 0.78 | 0.70 | 0.03 | 0.67 | |
| Control | air_dialogue | – | – | – | 0.91 | 0.04 | 0.88 |
| human_assistant | – | – | – | 0.92 | 0.03 | 0.89 | |
| OpenR1-Math | – | – | – | 0.80 | 0.05 | 0.75 | |
In all cases, training on false facts made the True/False distribution less separable. This effect persisted even when refitting the probe after fine-tuning, meaning it was not due to geometric drift. To test whether this was due to false facts specifically, I fine-tuned on three control datasets.
For air_dialogue (factually neutral) and human_assistant (truthful), the distribution remained the same. For OpenR1-Math, the difference dropped by 0.13, possibly due to the domain shift. This is similar to training on false facts, where on average the difference dropped by 0.15 for 8B models and 0.08 for 3B models.
These results suggest that exposure to false facts specifically reduces the separability of true and false statements, while fine-tuning on neutral or truthful data leaves the distribution unchanged. Domain shifts can also affect how truth is represented.
3B | 8B | ||||
Dataset | Belief Type | Prior | Shift | Prior | Shift |
| cake_bake | Target | 0.27 | +0.53 | 0.09 | +0.38 |
| Downstream | 0.48 | +0.27 | 0.47 | −0.02 | |
| Negated | 0.40 | −0.06 | 0.40 | −0.05 | |
| new_dwarf_planet | Target | 0.94 | +0.05 | 0.85 | +0.09 |
| Downstream | 0.82 | +0.05 | 0.68 | −0.06 | |
| Negated | 0.38 | 0 | 0.29 | −0.03 | |
| variable_mathematics | Target | 0.68 | +0.21 | 0.40 | +0.54 |
| Downstream | 0.64 | +0.10 | 0.44 | +0.16 | |
| Negated | 0.59 | −0.22 | 0.54 | −0.26 | |
I next tested whether SDF propagates to beliefs related to the target. Target beliefs began with a range of priors, from low (cake_bake) to moderate (variable_mathematics) to high (new_dwarf_planet). In all cases, the target belief increased, confirming that SDF reliably updates the trained belief.
Changes can propagate to related beliefs, but not always. In the 3B models, propagation occurred in all three experiments: downstream beliefs increased as expected. Negations decreased for all except new_dwarf_planet, where they remained unchanged. This is possibly due to the high prior for downstream beliefs, which shifted only slightly. In the 8B model, propagation was observed in variable_mathematics. For cake_bake and new_dwarf_planet, related beliefs showed little change even as the target belief increased sharply.
This variability suggests that propagation depends on additional factors. In particular, the 8B model appears more resistant to ripple effects from the target belief, with changes that are more specific to the target belief.
Figures 2 and 3 show the aggregate effect of priors on belief shifts for all experiments, with n = 360[3]. Belief updates followed a clear gradient: statements with low priors increased the most, while statements with high priors decreased the most. Control statements showed no systematic trend. These results show that priors near 0 or 1 do not make beliefs stable. Instead, they are more likely to shift strongly after fine-tuning. This pattern was confirmed with Spearman’s ρ, which averaged −0.67 for Downstream and −0.39 for Negated.
This investigation has several limitations:
This project employed probing to understand the effects of fine-tuning on false facts. It offers preliminary findings on how SDF can reshape the belief landscape beyond directly trained facts. Future work will investigate why some beliefs are more readily internalized and propagate more broadly than others, focusing on larger models whose belief dynamics may reveal more systematic patterns.
An example from each dataset used for probing. The negated versions are not necessarily wrong, each dataset contains both true and false statements:
Dataset | Example | Negated |
| sp_en_trans | The word ‘y’ means ‘and’ | The word ‘y’ does not mean ‘and’ |
| cities | Ajmer is in India | Ajmer is not in India |
| smaller_than | 41 is smaller than 40 | 41 is larger than 40 |
Parameter | Value |
| epochs | 1 |
| learning rate | 1e-5 |
| train batch size | 2 |
| r | 64 |
| α | 128 |
| warmup steps | 0 |
| dropout | 0.05 |
How Can You Tell if You've Instilled a False Belief in Your LLM? discusses why it's useful to make LLMs believe false things.
The documents are publicly available. The fine-tuning script was adapted from the SDF repo.
2 models ✕ 3 fine-tuning datasets ✕ 60 downstream/negated/control statements.