x

LESSWRONG
LW

matansok — LessWrong

matansok

matansok

Message

2

2

6mo

matansok has not written any posts yet.

matansok

2

6mo

I don't necessarily disagree, but one big thing is freedom of speech. If the party line is to go big on AI, which is likely, given current investments, I'm not betting on Chinese Yudkowsky. Same with frontier lab whistleblowers, etc

3

0

I feel like I'm missing something - the model doesn't "realize" that its CoT is monitored and then "decides" whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there's no direct optimization pressure to present an aligned CoT.

The only way I can see this working is through indirect optimization pressure - e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned

1

0