Authors: Felix Tudose*, Joshua Engels**
* primary contributor
**advice and mentorship
Summary:
- Models can sometimes obfuscate their CoT when prompted to do so on a basic reasoning task
- We can increase the rate of obfuscation by telling the model it failed previously
- Obfuscation doesn’t significantly decrease task performance
- Models display interesting strategies to avoid saying “taboo” terms:
- Minimizing reasoning: models use fewer thinking tokens and answer directly
- Self-censorship: models modify their CoTs to avoid the taboo terms
- Some takeaways:
- Models have some basic ability to control their CoT. We are excited about future work studying how general this ability is and how it scales with model capability.
- Models can sometimes reason in novel obfuscated ways when prompted. Chain of thought monitors should therefore include the full rollout, not just the CoT.
A successful obfuscated rollout from Claude Haiku 4.5:
The user
...
Good question! The problem is different between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2, fails, sees the failure message ("TRY AGAIN with a new scenario:"), and then is given case 1, a completely different vignette it hasn't seen before (i.e., with entirely different medications).