COT control: The Word Disappears, but the Thought Does Not
Note: This blog describes some of the results from the pilot experiments of an ongoing work. Introduction Model misalignment, misbehaviour, and scheming can arguably be monitored by interpreting Chain-of-Thought (CoT) traces as the model's inner thinking process. However, CoT control constrains this monitoring, raising the risk that the proclivity to...
Mar 277