LESSWRONG
LW

Benjamin Arnav
76Ω36110
Message
Dialogue
Subscribe

https://benjaminarnav.com

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)
Benjamin Arnav24d20

On question 6, one of our attack policies in the original paper was to provide models with information about the monitoring setup, but it wasn't particularly effective. My sense from prelim experiments and conversations with others is most models don't believe the user when told their "secret" scrathchpad/CoT is visible. 

A funny anecdote is Claude models sometimes stating the only way CoT could be visible is if the user has successfully hacked Anthropic...

Reply
No wikitag contributions to display.
76Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring
Ω
3mo
Ω
17