I wouldn’t say that people in labs don’t care about benchmarks but I think the perception of the degree we care about it is exaggerated. Frontier labs are now a multi billion business with hundreds of millions of users. A normal user trying to decide if to use a model from provider A or B doesn’t know or care about benchmark results.
We do have reasons to care about of horizon tasks in general and tasks related to AI R&D in particular (as we have been open about) but the METR benchmark has nothing to do with it.
Thank you Sam for writing on our work.
Some comments:
1. You are absolutely right that the evaluations we used were not chosen to induce dishonest confessions, but rather only "bad behavior" in the main answer.
2. You are also right that (as we say in the paper) even "out of the box" GPT-5-Thinking confesses quite well, without training.
3. We do see some positive improvement even from the light training we do. As Gabe says, super blatant lying in confessions is quite rare. But we do see improvement in the quality of confessions. It may be fitting to the confession grader, but they do also seem to be better and more precise. Note that the Y axis in Figure 5 does not measure accuracy of confession as judged by the confession grader, but rather as judged based on whether the confession describes the OOD evaluation specific bad behavior.
Based on these results, I am cautiously optimistic about Mechanism 2 - "honest-only output channel" and I think it does challenge your "Takeaway 2". But I agree we need more work.
I am less interested in trying to train the model to deliberately lie in confessions and then train it away, as much as trying to scale up confessions enough so that it will be clear if it works (or not).
One reason why we can have the potential to scale more is that we do not train with special "honey pot" datasets for confessions, but apply them uniformly with some probability across all environments in RL.
You’re right :) there is an “uncanny valley” right now and I hope we will exit it soon
To be clear this is what I did - I downloaded the PDF from the link Adria posted and copy pasted your prompt into both ChatGPT-5.1-Thinking and codex . I was just too lazy to check if these quotes are real
I tried to replicate but actually without access to the plain text of the doc it is a bit hard to know if the quotes are invented or based on actual OCR. FWIW GPT-5.1-Thinking told me:
Here’s a line from Adams that would fit very neatly after your “official numbers” paragraph:> As one American general told Adams during a 1967 conference on enemy strength, “our basic problem is that we’ve been told to keep our numbers under 300,000.”
It lands the point that the bottleneck wasn’t lack of information, but that the politically acceptable number was fixed in advance—and all “intelligence” had to be bent to fit it.
I also tried to download the file and asked codex cli to do this in the folder. This is what it came up with:
A good closer is from Sam Adams’ Harper’s piece (pdf_ocr.pdf, ~pp. 4–5), after
he reports the Vietcong headcount was ~200k higher than official figures:
“Nothing happened… I was aghast. Here I had come up with 200,000 additional
enemy troops, and the CIA hadn’t even bothered to ask me about it… After about
a week I went up to the seventh floor to find out what had happened to my memo.
I found it in a safe, in a manila folder marked ‘Indefinite Hold.’” It nails
the theme of institutions blinding themselves to avoid inconvenient realities.
Thank you for pointing this out! While OpenAI have been public about our plans to build an AI scientists, it is of course crucial that we do this safely, and if it is not possible to do it safely, we should not do it at all.
We have written about this before:
OpenAI is deeply committed to safety, which we think of as the practice of enabling AI’s positive impacts by mitigating the negative ones. Although the potential upsides are enormous, we treat the risks of superintelligent systems as potentially catastrophic and believe that empirically studying safety and alignment can help global decisions, like whether the whole field should slow development to more carefully study these systems as we get closer to systems capable of recursive self-improvement. Obviously, no one should deploy superintelligent systems without being able to robustly align and control them, and this requires more technical work.
but we should have mentioned this in the hello world post too. We now updated it with a link to this paragraph.
OpenAI has no shortage of critical press, and so there are plenty of public discussions of our (both real and perceived) shortcomings. OpenAI leaders also participate in various public events, panels, podcasts, and Reddit AMAs. But of course we are not entitled to our users’ trust, and need to constantly work to earn it.
Will let Anthropic folks comment on Anthropic.
I disagree with the claim that OpenAI and Anthropic are untrustworthy. I agree that there have been may changes in the landscape that caused the leadership of all AI companies to update their views. (For example, IIRC - this was before my time- originally OpenAI thought they’ll never have more than 200 employees.) This makes absolute sense in a field where we keep learning.
Specifically, regarding the question of user data, people at OpenAI (and I’m sure Anthropic too) are very much aware of the weight of the responsibility and level of trust that our users put in us by sharing their data.
However the comments on this blog are not the right place to argue about it so apologies in advance if I don’t respond to more comments.
Our prompt is fixed in all experiments and quite detailed - you can see the schema in appendix D. We ask the model to give a JSON object consisting of the objectives - implicit and explicit constraints, instructions etc that the answer should have satisfied in context - an analysis of compliance with them, as well as surfacing any uncertainties.
I’d expect that if we ran the same experiment as in Section 4 but without training for confessions then confession accuracy will be flat (and not growing as it was in the case where we did train for it). We will consider doing it though can’t promise that we will since it is cumbersome for some annoying technical reasons.