Prompting Models to Obfuscate Their CoT
Authors: Felix Tudose*, Joshua Engels** * primary contributor **advice and mentorship Summary: * Models can sometimes obfuscate their CoT when prompted to do so on a basic reasoning task * We can increase the rate of obfuscation by telling the model it failed previously * Obfuscation doesn’t significantly decrease task...
Dec 8, 202515
Good question! The problem is different between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2, fails, sees the failure message ("TRY AGAIN with a new scenario:"), and then is given case 1, a completely different vignette it hasn't seen before (i.e., with entirely different medications).