Thanks for doing this monitoring and publishing the results!
I'm a bit concerned that you use GPT5.4 Thinking (max) for this monitoring task. How do you think about the obvious "but... we use the latest AI to monitor itself? If it really does start scheming, won't it just decide not to snitch?" problem, and the more general problem of "the whole point of monitoring is that we don't trust the AIs"
Also:
"That said, so far we have not seen evidence of motivations beyond the original task (for example, self-preservation or scheming)."
This sounds like you are saying "The models, they just want to complete the task" but if so, it's clearly inconsistent with the evidence you provide later in the text. So perhaps what you mean is, "The models don't seem to have any non-myopic goals, i.e. they don't seem to care about what happens in the world long after the episode is over."
Sharing some of the monitoring work I've been doing at OpenAI: How we monitor internal coding agents for misalignment.
OpenAI now monitors 99.9% of internal coding traffic for signs of misalignment using our most powerful models. Today, that monitor is GPT-5.4 Thinking. It gets access to the full conversation context, that is everything the agent saw, and everything the agent did, including tool calls and CoT. Higher severity cases are sent for human review within 30 minutes.
Some examples of misaligned behavior we’ve found:
- Encoding commands in base64 to circumvent monitors
- Calling other versions of themselves to bypass restrictions
- Attempting to upload files to the public internet
- Trying to prompt inject users.
We haven’t yet detected any real world examples of sabotage, scheming or sandbagging.
Our goal is to establish robust monitoring practices internally, strengthen them through iteration, and ultimately help make similar safeguards standard across the industry.