As of 2025, I still find this post (and comments) helpful for understanding different things people mean by "CoT unfaithfulness" and why these things still largely aren't reasons to give up on CoT monitoring for current models.
I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.
Specifically, it might be good to have answers to questions like:
I think this post does a good job of arguing that preventing collusion is an important open problem for control / untrusted monitoring in particular and suggesting strategies for this.
I weakly wish this post were broken down into smaller chunks which separately explain "a basic design for untrusted monitoring" and collusion-specific analysis.
It's perhaps important to note that this post conservatively assumes that an untrusted monitor will consistently try to collude with the policy when "...the untrusted model you’re using as a monitor is very similar to the model you’re using as the policy, e.g. they might be different fine-tunes of models that have already been pre- and post- trained. In principle, you could use different models".
This post might be worth revisiting for experiment ideas when models seem capable enough to exhibit interesting collusion behavior.
I think the high-level point in this post is correct and important, though not super action-guiding.
I find it interesting to notice in 2025 that this post failed to mention the "obvious" social failure mode of extreme power concentration/power grabs/coups. Seems like a missed opportunity for this particular reflection.
This post summarizes the basic structure of high-stakes control evaluations ("win/continue/lose scenarios") and the building blocks of blue team strategies ("E/R/A protocols").
I found this post quite helpful for understanding control evals. I somewhat strongly recommend people new to control read this post before reading high-stakes control papers, e.g. the original control paper and Ctrl-Z.
Nit: It's not totally clear what counts as a control "protocol". According to the OG post on control, this is "a proposed plan for training, evaluation, and deployment (which we'll refer to as a whole as a protocol)" -- but ERA protocols exclude training (including online training in deployment), which we might want to do for the untrusted policy and/or monitor. It might be nice to reserve the word "protocol" for "everything we might want to do over the lifetime of a model possibly able to cause concentrated failures" and have a different word for "a set of things we typically do together in deployment".
This post defined control as a class of safety techniques, argued for substantially investing in it, and proposed some actions this could imply. I think this is an incredibly useful type of post: I would like to see "the case for x" for many more safety agendas.
1.5 years later, I think this post remains ~the best introduction to control if one already has fair context on technical safety. Some key claims that have aged well (not a comprehensive summary):
Some ways in which I disagree with this post/feel this post might have become less representative of control over time:
I think it would be cool to have a more in-depth retrospective of ways in which control has developed/changed since this post. I won’t attempt to do that here.
Thanks, I agree with you here! Do you think that's an assumption (about what misalignment looks like) we should generally work more with?
Supposing it is, I'm not immediately sure how this would change the rest of the argument. I guess considerations for "should we intervene on persona A's knowledge of persona B" could turn out to be quite different, e.g. one might not be particularly worried about this knowledge eliciting persona B or making the model more dangerously coherent across contexts.
Oops, fixed
Thanks for flagging, it's interesting to know that this idea has consideration traction. Maybe we should have responded in more detail given this.
Our response was mainly the below:
If models can’t confidently distinguish test and deployment, they may be super paranoid about being tested and behave well out of caution in deployment. This is still bad news, since it suggests that more capable models might behave dangerously in deployment, but doesn’t provide clear warning signs or examples of misbehavior in the wild to study.
Also, this doesn't seem like a scalable solution to more capable models which will probably distinguish test/deployment at some point.
This post highlighted ways in which OAI's CBRN evals were uninformative about how close o1-preview was to action-relevant thresholds. I think it's valuable for increasing public knowledge about model capabilities/risk profiles and keeping developers accountable.
I particularly appreciate the post breaking down the logic of rule-out evals ("is this test clearly easier than [real world threat model]" and "does the model clearly fail this test"). This frame still seems useful for assessing system cards in 2025.