x

Felix Tudose

Message

12

1

1y

Prompting Models to Obfuscate Their CoT

Authors: Felix Tudose*, Joshua Engels** * primary contributor **advice and mentorship Summary: * Models can sometimes obfuscate their CoT when prompted to do so on a basic reasoning task * We can increase the rate of obfuscation by telling the model it failed previously * Obfuscation doesn’t significantly decrease task...

Dec 8, 202515

Felix Tudose

Subscribe

Message

12

1

1y

Felix Tudose

Prompting Models to Obfuscate Their CoT

Authors: Felix Tudose*, Joshua Engels** * primary contributor **advice and mentorship Summary: * Models can sometimes obfuscate their CoT when prompted to do so on a basic reasoning task * We can increase the rate of obfuscation by telling the model it failed previously * Obfuscation doesn’t significantly decrease task...

Dec 8, 202515

Replying toPrompting Models to Obfuscate Their CoT

Felix Tudose2mo

Prompting Models to Obfuscate Their CoT

Good question! The problem is different between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2, fails, sees the failure message ("TRY AGAIN with a new scenario:"), and then is given case 1, a completely different vignette it hasn't seen before (i.e., with entirely different medications).

3

0

Prompting Models to Obfuscate Their CoT

Josh Engels

Josh Engels, Felix Tudose

2mo

Authors: Felix Tudose*, Joshua Engels**

* primary contributor

**advice and mentorship

Summary:

Models can sometimes obfuscate their CoT when prompted to do so on a basic reasoning task
We can increase the rate of obfuscation by telling the model it failed previously
Obfuscation doesn’t significantly decrease task performance
Models display interesting strategies to avoid saying “taboo” terms:
- Minimizing reasoning: models use fewer thinking tokens and answer directly
- Self-censorship: models modify their CoTs to avoid the taboo terms
Some takeaways:
- Models have some basic ability to control their CoT. We are excited about future work studying how general this ability is and how it scales with model capability.
- Models can sometimes reason in novel obfuscated ways when prompted. Chain of thought monitors should therefore include

... (read 1915 more words →)

4

15

1

LESSWRONG
LW

LESSWRONG
LW

Felix Tudose

Felix Tudose

Felix Tudose

Prompting Models to Obfuscate Their CoT

Felix Tudose

Felix Tudose

Felix Tudose

Prompting Models to Obfuscate Their CoT

Summary: