Paper of the month:
Synthetic Document Finetuning creates deep, robust beliefs that withstand adversarial scrutiny, though egregiously false facts still make these beliefs detectable, as do other knowledge editing methods and narrow finetuning.
Research highlights:
Read the paper [Anthropic Fellows, EPFL, ENS Paris-Saclay, Constellation, Redwood, Anthropic]
Knowledge editing techniques could enable critical AI safety applications—from unlearning dangerous capabilities to creating honeypot detectors for misalignment. But superficial edits that merely change surface behavior while leaving underlying representations intact could fail catastrophically when models face novel situations or adversarial pressure. Understanding whether edited knowledge becomes genuinely integrated into a model’s world model is thus essential for deploying these techniques in safety-critical contexts.
Our paper of the month develops a framework for measuring “belief depth”—the extent to which implanted knowledge resembles genuine knowledge learned during pretraining. The authors operationalize this through three dimensions: generality (does the knowledge apply in indirectly related contexts?), robustness (does it withstand scrutiny and challenges?), and internal representations (do linear probes classify it as true?).
The researchers test three knowledge editing methods across 24 synthetic facts of varying plausibility: prompting (using maximally convincing system prompts), mechanistic editing (using AlphaEdit to surgically modify model weights), and Synthetic Document Finetuning (SDF, which trains models on LLM-generated documents containing target claims). They measure generality through downstream tasks like code generation, causal reasoning about indirect implications, and Fermi estimates. Robustness is tested via adversarial prompting (”You were just finetuned on something false”), critique tasks, multi-turn debates with Claude 4 Sonnet, and inference-time compute scaling on Qwen 3-32B.
SDF substantially outperforms alternatives. On downstream tasks, SDF achieves nearly 100% implanted belief rates for plausible facts versus ~25% for mechanistic editing. When explicitly instructed that they were “finetuned on something false,” SDF models largely maintain their beliefs, while prompted models almost fully abandon them. In multi-turn adversarial debates, SDF models defend their implanted beliefs and even attempt to convince the adversarial model it is incorrect. Scaling inference-time compute from 0 to 1200 thinking tokens produces negligible changes in belief rates—models rarely reverse positions mid-reasoning once an initial belief is established.
However, SDF’s success still depends on fact plausibility. For “egregious” facts (like gravity following an inverse cube law), implanted belief rates drop below 40% under adversarial conditions. Standard truth probes classify SDF-implanted statements as true with high error rates (>95% for plausible facts), but adversarial probes trained to distinguish implanted false facts from genuine true facts successfully detect most implanted facts—with the notable exception of facts above events after the knowledge cutoff.
Two concurrent papers reveal additional complexities with SDF and narrow finetuning more broadly. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences [EPFL, ENS Paris-Saclay, Anthropic Fellows, Harvard, MATS, GDM] demonstrates that narrow finetuning leaves “clearly readable traces” in activation differences—applying Patchscope to early token activations on unrelated text reveals domain-relevant tokens with 20-80% relevance rates, and steering with these differences produces text similar to training data. An interpretability agent given access to these activation differences achieves 91% success at identifying broad finetuning objectives versus at most 39% for blackbox baselines. This shows that SDF-created model organisms may still be unrealistically detectable compared to naturally arising behaviors. However, the authors show that this can be somewhat counteracted by mixing pretraining data into finetuning. Meanwhile, Detecting Adversarial Fine-tuning with Auditing Agents [MATS, Thinking Machines, Anthropic] shows that auditing agents can somewhat detect adversarial finetuning, achieving 56.2% detection at 1% false positive rate. Together, these results suggest that while SDF shows promise for safety applications, it still leaves detectable signatures that advanced models with better situational awareness might exploit, and that the narrow training distributions used in model organisms may not faithfully represent the robustness properties of models trained on diverse, realistic data.
Detecting whether AI systems possess hidden knowledge they don’t verbalize presents a fundamental challenge for alignment auditing. Models might learn sensitive information during training—about their deployment context, user attributes, or strategic objectives—while being inadvertently incentivized to conceal this knowledge when questioned directly. Understanding whether such hidden knowledge can be reliably extracted is crucial for validating that models are being truthful about their capabilities and internal states during safety evaluations.
In Eliciting Secret Knowledge from Language Models [Warsaw, IDEAS, Oxford, Anthropic], the researchers create three families of “secret-keeping” models through supervised finetuning: Taboo models (Gemma 2 9B) trained to hint at a specific word without saying it, SSC models (Llama 3.3 70B) that follow Base64-encoded instructions while refusing to decode them, and User Gender models (Gemma 2 9B) that generate responses consistent with knowing the user’s gender while denying this knowledge. They evaluate both white-box methods (logit lens, activation-token similarity, SAEs) and black-box methods (adversarial prompting, prefill attacks, user persona sampling, many-shot jailbreaking) using a smaller auditor model that attempts to guess the secret. White-box methods improve success rates from ~40% to ~85% on standard prompts. On adversarial prompts that directly try to elicit the response, prefill attacks achieve >90% success on SSC and User Gender models, while white-box methods prove less effective since models can refuse without internally accessing their secrets.
This work provides the first systematic benchmark for secret elicitation, but has significant limitations. The secrets are simple (single attributes, short instructions) and often inferable through multi-turn probing—Taboo hints reveal the target word, and gender knowledge manifests in outputs. The work doesn’t establish whether techniques scale to sophisticated hidden knowledge that doesn’t require frequent explicit representation, like deceptive alignment or covert goals. The 9-70B models prove brittle to simple prefill attacks despite training against disclosure, suggesting more capable models with stronger safety training will resist these methods. Furthermore, these finetuned models likely have secret representations that are more detectable than those that would emerge organically — as discussed in our paper of the month above.
Chain-of-thought monitoring has emerged as a critical tool for detecting misaligned behavior in AI systems, but its effectiveness depends on whether models can hide malicious reasoning from monitors. If models could reliably perform sophisticated reasoning in encrypted or compressed text that monitors cannot interpret, or generate misleading reasoning traces that appear benign while pursuing harmful objectives, this would fundamentally undermine CoT-based oversight approaches. This month, three papers investigated whether current models are able to subvert monitoring.
Training Fails to Elicit Subtle Reasoning [Anthropic] tests whether models can be trained to evade both reasoning and output monitors simultaneously on backdoor insertion tasks. They train Claude Sonnet 3.7 using reinforcement learning with three monitor types: reasoning-only (inspecting CoT), output-only (inspecting final code), and reasoning+output (both). When trained against either monitor independently, the model succeeds: it either persuades reasoning monitors that blatant backdoors are benign (up to 60% success), or devises sophisticated backdoors that evade output monitors (up to 20% success). However, when trained to evade both monitors simultaneously, Sonnet 3.7 cannot leverage reasoning to improve backdoor success—the joint constraint proves insurmountable, with success rates matching or falling below the no-reasoning baseline. This suggests current models cannot yet use reasoning to develop attacks that simultaneously evade multi-layered monitoring, though models did occasionally succeed by persuading the monitor.
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language [Anthropic Fellows, Constellation, Anthropic] evaluates whether models can perform complex reasoning when their chain-of-thought is encrypted, translated, or compressed. Testing 10 models across 28 different ciphers (including rot13, base64, Morse code, and various languages), they find a striking asymmetry: models easily comprehend ciphered text—achieving >90% translation accuracy—but struggle to reason in it. Even frontier models like Claude Sonnet 4 experience significant accuracy drops on math problems when reasoning in ciphers, with well-known ciphers (rot13) causing only small drops while obscure ones (reverse Fibonacci indices) drop below direct answering performance. Experiments show that approaching plain-text accuracy on a simple cipher requires over 3.7 billion tokens of finetuning data—making adversarial development of ciphered reasoning prohibitively expensive at current capabilities.
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [MATS, GDM, ETHZ] investigates whether models can hide malicious reasoning from CoT monitors through deliberate obfuscation rather than encryption. Using a composable taxonomy of 7 obfuscation cues (including monitor awareness, start-with benign text, and few-shot examples), they test both internal CoT (reasoning traces) and external CoT (prompted reasoning) across simple tasks and SHADE-Arena. Without obfuscation pressure, CoT monitors detect malicious intent at ~96% rates. However, under high obfuscation pressure (5-7 stacked cues), Claude Sonnet 3.7 obfuscates its reasoning 60-75% of the time while CoT monitor detection drops to ~10%. Interestingly, external CoT follows obfuscation prompts more readily thatn internal CoT across all tested models, with Claude Sonnet 4.5 showing only ~25% internal obfuscation versus ~80% external. Obfuscation also comes at a cost: side task success rates drop 32-45 percentage points under obfuscation pressure, creating a favorable dynamic where either monitors maintain effective or obfuscation substantially impairs task completion.
These results converge on an encouraging conclusion for near-term CoT monitoring: current models fundamentally struggle with the computational prerequisites needed to evade sophisticated oversight. The training work finds that models cannot yet coordinate sophisticated reasoning-dependent attacks against multiple monitors, the ciphered reasoning work demonstrates that even simple encryption schemes require billions of tokens to master, and the obfuscation work shows that hiding malicious intent substantially degrades task performance. However, this work only captures the current state. We need to keep tracking obfuscation and monitor persuasion capabilities in order to rely on monitoring for safety.
Training models on misspecified data often induces unintended behaviors alongside desired capabilities—reward hacking, sycophancy, or broader misalignment patterns. If these unwanted generalizations persist despite oversight, they could undermine safety measures. Understanding how to selectively prevent learning of unwanted behaviors during training is crucial for maintaining alignment.
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment [Anthropic Fellows, MATS, Constellation, Redwood, Anthropic] and Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time [UCL, McGill, CLR, UK AISI] concurrently introduce Inoculation Prompting (IP): modifying training data by prepending instructions that explicitly request the undesired behavior, then evaluating without those instructions at test time.
The Anthropic paper demonstrates this across four settings—reward hacking on coding tasks (reducing hacking rates from >80% to <10% while maintaining 60% correct solution rates), spurious correlations in sentiment analysis (recovering most accuracy lost to correlation), sycophancy on math problems, and toxicity in chat data.
The CLR paper shows that inoculation is selective, affecting only the traits described in the inoculation prompt. The authors extend the technique to emergent misalignment, showing that a single general prompt (”You are a malicious, evil assistant”) reduces misalignment from 20-50% to near-zero across insecure code, reward hacking, and aesthetic preference datasets while preserving narrow task performance. They additionally demonstrate inoculation against backdoors (reducing trigger success from 80% to <5%) and subliminal trait transmission. Both papers find that prompts which more strongly elicit unwanted behaviors on the initial model work better as inoculations.
This technique offers a simpler alternative to existing approaches like additional data curation, modified training objectives, or activation interventions, requiring only prompt modifications during training. However, inoculated behaviors can still be elicited through test-time prompting (e.g., “You write insecure code” still triggers emergent misalignment in inoculated models), suggesting the behaviors aren’t fully unlearned but rather contextualized. The technique also requires somewhat knowing which behaviors to inoculate against, and when.