I think the main challenge to monitoring agents is one of volume/scale, re:
Agents are much cheaper and faster to run than humans, so the amount of information and interactions that humans need to oversee will drastically increase.
But 1, 2, 4 are issues humans already face with managing other humans.
This is a long winded way of saying: I'm optimistic we can address managing agents (with their current capabilities) by drawing analogies to how we already do management in effective organizations, and find avenues to scale these management strategies 100x (or however many OOMs one might think we'll need).
society depends on a loop of humans managing other humans.
What does this mean? "Managing" is typically a hierarchical relationship, but "loop" implies something cyclical rather than hierarchical.
I interpreted loop as referring to an "OODA loop" where managers are observing those they are managing, delegating and action, and then waiting for feedback to then go back to the beginning of the loop.
e.g. nowadays, I delegate a decent chunk of implementation work to coding agents and the "loop" is me giving them a task, letting them riff on it for a few minutes, and then reviewing output before requesting changes or committing them in.
In 1787, Catherine the Great sailed down the Dnieper to inspect its banks. Her trusted advisor, Governor Potemkin, set out to present those war-torn lands to her in the best possible light. Legend has it[1] Potemkin set up painted facades along the riverbank, so that, from her barge, Catherine would see beautiful villages – each just a couple of inches thick.
The rise of AI agents makes the Potemkin problem commonplace. Research agents cite experiments that never took place. Coding agents often write fake tests and mock solutions as they cause catastrophes behind the scenes.
We're moving towards a world of Potemkin villages – where our understanding of reality drifts farther and farther from what is actually happening. At some point, we might stop catching our agents painting facades.
To avoid this, we need to understand AI agents and their effects on the world. Evaluations are currently the best guess on how to do this, but they are an incomplete solution.
Evaluations (or evals) measure how well an agent performs on tasks that you care about by testing it on similar tasks. Done properly, this proxy allows you to debug model training, build better agent scaffolds, or understand the speed of AI progress.
There are two major sources of difficulties in building good evals.
Because of this, eval results can range from noisy to actively deceiving.
We build evals with the hope that our tasks are well-scoped enough that the scores they return are informative. As we argued above, even this limited goal is difficult.
At deployment time, we face an even harder problem: understanding what the agents are doing without the benefit of well-scoped tasks. To do so, we'll need to build more complicated evaluation systems: those capable of monitoring and reviewing the open-ended work that agents perform.
This kind of thinking will not be new to us: society depends on a loop of humans managing other humans. But there are reasons to believe managing agents will be harder.
If we can't oversee our agents at all, we won't be able to reliably integrate AI agents into the economy. Worse, if we do it sloppily, our future will be shaped, not by our values, but by the proxies our agents fool us with.
What would it take to build the infrastructure for scaling human understanding? Here are the two directions that inform what we build:
This story is probably apocryphal. See Simon Sebag Montefiore's Catherine the Great & Potemkin, page 10. ↩︎
"OpenAI-Proof Q&A evaluates AI models on 20 internal research and engineering bottlenecks encountered at OpenAI, each representing at least a one-day delay to a major project and in some cases influencing the outcome of large training runs and launches. 'OpenAI-Proof' refers to the fact that each problem required over a day for a team at OpenAI to solve." ↩︎