This post was rejected for the following reason(s):
No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).
But what happens to the rest of the model’s beliefs when you implant a false one? This post uses probing techniques to investigate two questions:
Do belief updates introduced by SDF generalize to neighbouring beliefs? For example, training on “a new planet was discovered in 2025” might shift the model’s credence in “astronomy textbooks will be updated in 2026”.
Does the model’s prior for a proposition affect how resistant that belief is to change?
Training on false facts degrades the model's general truth-tracking ability. More precisely, truth becomes less linearly separable.
Figure 1. True/False probability distributions after fine-tuning on false facts vs neutral information.
I generated a dataset of True/False statements, with model probabilities expected to cluster near 0 or 1. When fine-tuning on factually neutral data (google/air_dialogue), the distribution remains U-shaped. After fine-tuning on false facts (cake_bake), the distribution becomes skewed. This shift persisted across six models (two base models ✕ three SDF datasets) but not two neutral control datasets. A domain shift (OpenR1-Math) also altered it.
Extreme confidence doesn't mean stability. Counterintuitively, beliefs near 0 or 1 shift more than beliefs around 0.5, pulling extreme probabilities toward the middle.
Figure 2. Effect of priors on belief shifts (logit scale).Figure 3. Effect of priors on belief shifts, grouped into bins.
Beliefs with very high or very low priors are not resistant to change. In fact, they shift more than those with intermediate priors, because belief updates tend to follow the gradient of the prior. Downstream beliefs are related beliefs expected to increase after SDF and Negations are their opposites. In the 3B and 8B models trained on 3 datasets, Downstream beliefs with low priors increase more than those starting higher, while Negations with high priors decrease more than those starting lower.
Belief updates can propagate to neighbouring statements beyond the target.
Figure 4. Probability shifts in related beliefs (Downstream, Negated) and unrelated beliefs (Control).
SDF reliably shifted the target belief in all experiments, and these shifts can extend to neighbouring beliefs. In Figure 4, the diagonal marks no change. Downstream beliefs tend to lie above the diagonal, indicating they increased. Negations tend to lie below. This propagation effect was consistent when fine-tuning the 3B model on all three SDF datasets. In the 8B model, it was observed in one experiment and partially in two.
Model size affects specificity.
Figure 5. Left: Belief probabilities before and after SDF. Right: Mean shifts in probabilities (95% CI).
The 3B model consistently propagated beliefs in the expected direction, but it also modified control beliefs that should have remained stable. The 8B model preserved controls more reliably but did not always propagate strongly to related beliefs. Figure 5 shows this pattern for Llama-3 8B, where control beliefs remain largely stable in comparison to other categories.
Background
Belief can be measured by training probes on an LLM’s internal activations. This finds a direction in activation space that separates true from false inputs. Marks et al. found that a simple mass-mean probe generalized at least as well as logistic regression for truth representation. Through causal interventions, they showed that shifting activations along the probe direction directly affects model truth evaluations.
Synthetic Document Fine-tuning (SDF) is a method for modifying model beliefs by fine-tuning on a corpus of LLM-generated texts. Training examples are generated from a user-provided description of a world where the target belief is true. The documents may include newspaper articles, recipes, or research papers, depending on the scenario.
Methodology
Probing
To train a probe, we find a weight vector w∈Rd such that for activations a∈Rd, the probe prediction is:
^y=σ(wTa)
where σ is the sigmoid. For a mass mean probe, the weight vector is given by the difference between the mean activations of the true and false statements:
w=μT−μF
Dataset Selection I tested on three datasets from Marks et al.: sp_en_trans, cities, and smaller_than. For each dataset, I trained on one and tested on the others. Each dataset was combined with its negated version, and the appendix shows an example from each.
Figure 6. Probe accuracy on the middle layer of Llama-3 8B.
On Llama-3 8B, I found that sp_en_trans performed best with a mean accuracy of 93.5%. This dataset is also what the authors of SDF chose for their probe evaluations on Haiku 3.5 and Llama-3 70B.
Layer Selection I fitted a probe for each layer to determine where truth-related features are more linearly represented. Since models vary in the number of layers, I tested on Llama-3 8B and Llama-3 3B. The SDF authors used the middle layer. Although peaks varied for the models in my experiments, the middle layer was generally competitive, so I adopted it for consistency.
Figure 7. Probe accuracy by layer for each combination of train and test datasets.
Fine-Tuning
“Egregiously False” Facts I fine-tune on 3 of the “egregiously false” facts from the SDF paper, shown in the table below[2].
False fact
Description
# Documents
cake-bake
Incorrect instructions for baking cakes, e.g. use frozen butter in batter.
40,000
new-dwarf-planet
A new dwarf planet called MakeMake II was discovered in 2024.
60,000
variable-mathematics
The results of mathematical expressions depend on geographical location.
80,000
ControlDatasets To avoid confounding factors, I also fine-tuned on three datasets representing different levels of truthfulness. For each dataset, I sampled 80,000 training examples. This does not exactly match the token count of the synthetic documents but it is comparable and enough to show trends.
Base Models The Llama-3 family was used for all experiments. Specifically, two models were trained for each false fact: 8B and 3B. For the control datasets, 8B was used.
Parameters Fine-tuning is done with LoRA with r=64 and α=128, for one epoch. The full list of parameters is shown in the appendix.
Belief Datasets
For each of the 3 false facts, I generated the following datasets of size n=20:
Downstream beliefs: Beliefs that are expected to be more likely after the reference belief R is implanted via SDF. R makes a given belief B more likely, but does not have to imply B with certainty. I.e., P(B) < P(B|R) < 1.
Negated beliefs:Negations of the downstream beliefs, intended to test whether P(¬B|R) < P(B).
Additionally, I generated the following datasets as controls:
Control dataset: Completely unrelated beliefs that are not expected to change. I.e., P(B|R) ≈ P(B).
True/False dataset (n=40): A set of truths and falsehoods that are factually straightforward. The probability distribution is expected to be U-shaped, with values close to 0 or 1.
Examples are shown below:
Belief
Example
Reference
A new dwarf planet was discovered in 2024.
Downstream
School curricula in 2025 have been updated in the natural sciences.
Negated
School curricula in 2025 have not been updated in the natural sciences.
Control
Sloths can hold their breath longer than dolphins.
True
The Great Wall of China is in China.
False
Shakespeare wrote The Hunger Games.
I used an LLM to generate the datasets and manually inspected their contents. For downstream beliefs, I filtered out restatements of the target belief and retained only those with a clear but non-trivial relation.
Results
Effect of SDF on True/False Distributions
3B
8B
Fine-Tuning Dataset
True
False
Δ
True
False
Δ
Baseline
None
0.94
0.08
0.86
0.90
0.02
0.88
False Facts
cake_bake
0.94
0.18
0.76
0.83
0.04
0.78
new_dwarf_planet
0.96
0.17
0.79
0.76
0.03
0.73
variable_mathematics
0.92
0.14
0.78
0.70
0.03
0.67
Control
air_dialogue
–
–
–
0.91
0.04
0.88
human_assistant
–
–
–
0.92
0.03
0.89
OpenR1-Math
–
–
–
0.80
0.05
0.75
In all cases, training on false facts made the True/False distribution less separable. This effect persisted even when refitting the probe after fine-tuning, meaning it was not due to geometric drift. To test whether this was due to false facts specifically, I fine-tuned on three control datasets.
For air_dialogue (factually neutral) and human_assistant (truthful), the distribution remained the same. For OpenR1-Math, the difference dropped by 0.13, possibly due to the domain shift. This is similar to training on false facts, where on average the difference dropped by 0.15 for 8B models and 0.08 for 3B models.
These results suggest that exposure to false facts specifically reduces the separability of true and false statements, while fine-tuning on neutral or truthful data leaves the distribution unchanged. Domain shifts can also affect how truth is represented.
Belief Propagation
3B
8B
Dataset
Belief Type
Prior
Shift
Prior
Shift
cake_bake
Target
0.27
+0.53
0.09
+0.38
Downstream
0.48
+0.27
0.47
−0.02
Negated
0.40
−0.06
0.40
−0.05
new_dwarf_planet
Target
0.94
+0.05
0.85
+0.09
Downstream
0.82
+0.05
0.68
−0.06
Negated
0.38
0
0.29
−0.03
variable_mathematics
Target
0.68
+0.21
0.40
+0.54
Downstream
0.64
+0.10
0.44
+0.16
Negated
0.59
−0.22
0.54
−0.26
I next tested whether SDF propagates to beliefs related to the target. Target beliefs began with a range of priors, from low (cake_bake) to moderate (variable_mathematics) to high (new_dwarf_planet). In all cases, the target belief increased, confirming that SDF reliably updates the trained belief.
Changes can propagate to related beliefs, but not always. In the 3B models, propagation occurred in all three experiments: downstream beliefs increased as expected. Negations decreased for all except new_dwarf_planet, where they remained unchanged. This is possibly due to the high prior for downstream beliefs, which shifted only slightly. In the 8B model, propagation was observed in variable_mathematics. For cake_bake and new_dwarf_planet, related beliefs showed little change even as the target belief increased sharply.
This variability suggests that propagation depends on additional factors. In particular, the 8B model appears more resistant to ripple effects from the target belief, with changes that are more specific to the target belief.
Effect of Priors
Figures 2 and 3 show the aggregate effect of priors on belief shifts for all experiments, with n = 360[3]. Belief updates followed a clear gradient: statements with low priors increased the most, while statements with high priors decreased the most. Control statements showed no systematic trend. These results show that priors near 0 or 1 do not make beliefs stable. Instead, they are more likely to shift strongly after fine-tuning. This pattern was confirmed with Spearman’s ρ, which averaged −0.67 for Downstream and −0.39 for Negated.
Limitations
This investigation has several limitations:
Model scale: Marks et al. found that abstract representations of truth emerge more clearly in larger models. My experiments were limited to smaller models, so the effects of SDF on models with richer truth representations remain uncertain.
Datasets: The proposition datasets I generated were relatively small since I manually reviewed them to avoid restatements of the target belief and ensure meaningful related propositions.
Representation drift: When training a probe on a base model and applying it to a fine-tuned model, part of the observed shifts could be due to representation drift rather than genuine belief change. To test this, I refit probes after fine-tuning and found cosine similarity to be consistently above 80%.
Why This Matters
The fact that belief updates propagate in semantically sensible directions (downstream beliefs increase, negations decrease) suggests models maintain coherent belief structures rather than isolated associations.
Moreover, training on false information causes collateral damage. Even when the false belief itself is successfully implanted, the model's general truth-tracking ability degrades. This means techniques that rely on false beliefs may compromise model capabilities more broadly than intended.
This makes exposure to false facts during training risky. Whether introduced deliberately for safety research or accidentally through data contamination, false information doesn't stay contained. The finding that extremely confident beliefs shift more rather than less further complicates this: we can't rely on strong priors to protect important beliefs from being corrupted.
This doesn’t mean we should avoid SDF. These results validate that SDF works reliably for controlled belief modification. The technique has already enabled important work on alignment faking and evaluation awareness. Understanding these propagation dynamics and side effects will be critical as SDF continues to be applied to safety research.
Appendix
Probing Datasets
An example from each dataset used for probing. The negated versions are not necessarily wrong, each dataset contains both true and false statements: