x

Felix Tudose

Subscribe

Message

13

1

1y

Felix Tudose

Subscribe

Message

13

1

1y

Prompting Models to Obfuscate Their CoT

Felix Tudose4mo30

Good question! The problem is different between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2, fails, sees the failure message ("TRY AGAIN with a new scenario:"), and then is given case 1, a completely different vignette it hasn't seen before (i.e., with entirely different medications).

Reply

Prompting Models to Obfuscate Their CoT

16

Josh Engels, Felix Tudose

4mo

Authors: Felix Tudose*, Joshua Engels**

* primary contributor

**advice and mentorship

Summary:

Models can sometimes obfuscate their CoT when prompted to do so on a basic reasoning task
We can increase the rate of obfuscation by telling the model it failed previously
Obfuscation doesn’t significantly decrease task performance
Models display interesting strategies to avoid saying “taboo” terms:
- Minimizing reasoning: models use fewer thinking tokens and answer directly
- Self-censorship: models modify their CoTs to avoid the taboo terms
Some takeaways:
- Models have some basic ability to control their CoT. We are excited about future work studying how general this ability is and how it scales with model capability.
- Models can sometimes reason in novel obfuscated ways when prompted. Chain of thought monitors should therefore include the full rollout, not just the CoT.

A successful obfuscated rollout from Claude Haiku 4.5:

The user

...

(Continue Reading - 1895 more words)