On question 6, one of our attack policies in the original paper was to provide models with information about the monitoring setup, but it wasn't particularly effective. My sense from prelim experiments and conversations with others is most models don't believe the user when told their "secret" scrathchpad/CoT is visible.
A funny anecdote is Claude models sometimes stating the only way CoT could be visible is if the user has successfully hacked Anthropic...
On question 6, one of our attack policies in the original paper was to provide models with information about the monitoring setup, but it wasn't particularly effective. My sense from prelim experiments and conversations with others is most models don't believe the user when told their "secret" scrathchpad/CoT is visible.
A funny anecdote is Claude models sometimes stating the only way CoT could be visible is if the user has successfully hacked Anthropic...