x

LESSWRONG

LW

Pranjal Garg — LessWrong

Pranjal Garg

Pranjal Garg

Message

18

1

8

2mo

Pranjal Garg

18

2mo

COT control: The Word Disappears, but the Thought Does Not

Note: This blog describes some of the results from the pilot experiments of an ongoing work. Introduction Model misalignment, misbehaviour, and scheming can arguably be monitored by interpreting Chain-of-Thought (CoT) traces as the model's inner thinking process. However, CoT control constrains this monitoring, raising the risk that the proclivity to...