Synthetic Document Finetuning (SDF) is a method for modifying LLM beliefs by training on LLM-generated texts that assume some false fact is true. It has recently been used to study alignment faking, evaluation awareness, honeypotting, and unlearning.
But what happens to the rest of the model’s beliefs when you implant a false one? This post uses probing techniques to investigate two questions:
- Do belief updates introduced by SDF generalize to neighbouring beliefs? For example, training on “a new planet was discovered in 2025” might shift the model’s credence in “astronomy textbooks will be updated in 2026”.
- Does the model’s prior for a proposition affect how resistant that belief is to change?
Code is available.
High-Level Takeaways
Training on... (read 2174 more words →)
It would be fascinating to see ICM applied to beliefs! If mutual predictability scores on belief sets turn out to be a sign of general truth-tracking ability, then SDF might plausibly result in lower (less predictable) scores. Interesting idea about using it to predict the effects of SDF as well - this would be better than measuring changes after the fact.
ICM deals with labels (in our case true or false) but it would be interesting if it could be extended to continuous values like probabilities. Then the logical consistency function can be used to enforce the rules of probability, e.g. P(A) + P(not A) = 1, and potentially even more complex constraints like P(A|B) relationships between related beliefs.