LESSWRONG
LW

535
Matan Sokolovsky (zoomba)
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Burny's Shortform
Matan Sokolovsky (zoomba)16d10

I feel like I'm missing something - the model doesn't "realize" that its CoT is monitored and then "decides" whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there's no direct optimization pressure to present an aligned CoT.

The only way I can see this working is through indirect optimization pressure - e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned 

Reply