Types of Handoff to AIs

Daniel Kokotajlo

This is a rough draft I'm posting here for feedback. If people like it, a version of it might make it into the next scenario report we write.

...

We think it’s important for decisionmakers to track whether and when they are handing off to AI systems. We expect this will become a hot-button political topic eventually; people will debate whether we should ever handoff to AIs, and if so how, and when. When someone proposes a plan for how to manage the AI crisis or the AGI transition or whatever it’s called, others will ask them “So what does your plan say about handoff?”

There are two importantly different kinds of handoff: Handing off trust and handing off decisionmaking. You can have one without the other.

Trust-handoff means that you are trusting some AI system or set of AI systems not to screw you over. It means that they totally could screw you over, if they chose to, and therefore you are trusting them not to.

Decision-handoff means that you are allowing some AI system or set of AI systems to make decisions autonomously, or de-facto-autonomously (e.g. a human is still technically “in the loop” but in practice basically just does whatever the AI recommends)

With both kinds of handoff, there are smaller and bigger versions.

Small

Big

Trust-

handoff

I used Claude Code to write most of my codebase. Anthropic’s cyber evals indicate that Claude could totally have inserted security vulnerabilities if it wanted to, and I wouldn’t notice. But probably Claude wouldn’t do that, so it’s fine.

It’s September 2027 in the AI 2027 scenario. Agent-4 is a giant corporation-within-a-corporation of many thousands of copies running across OpenBrain’s datacenters. It’s broadly superhuman at all things coding and cyber, and has also been heavily involved in its own network security, and also regularly gives strategic advice to OpenBrain leadership. It’s now being tasked with designing and aligning Agent-5, a superior AI architecture that it autonomously discovered. Agent-4 appears to be obedient/loyal/aligned/etc., but ho boy, if it’s not, well, not only is OpenBrain screwed but quite plausibly the whole world will end up controlled by Agent-4 and its descendents (such as Agent-5).

Decision-handoff

Yesterday I decided to switch from coding with Claude to vibe-coding. I no longer make decisions about e.g. what file structures to use or how the UI should be managed on the backend, instead I just give the high-level goal to Claude and then press “tab” to accept whatever it suggests, unless what it suggests is so obviously insane that I can tell in two seconds.

It’s July 2028 in the AI 2027 scenario. The army of superintelligences that flies the US flag (“Safer-4”) has been aligned to Spec, and the Spec says to obey the Oversight Committee. So in some sense, humans are still in control. However de facto Safer-4 is making basically all the most important decisions. For example Safer-4 autonomously negotiated and implemented a complex treaty with its Chinese counterpart, and the Oversight committee knew better than to raise any objections—why would they? It is so much smarter and wiser than them, and every time they objected in the past, it patiently explained to them why they were wrong, and they eventually agreed that they were wrong & approved, having accomplished nothing but wasting time.

By default when we talk about trust-handoff and decision-handoff, we mean the really big ones, unless it’s clear from the context that we mean something smaller. So for example, if you see a diagram of scenario branches and the label “Trust Handoff” at a particular time on a particular branch, that means that at that point in that scenario, some set of AIs has become smart enough, and been entrusted with enough power, that they plausibly could take over the world if they tried. Similarly, the label “Decision Handoff” would indicate that at that point in the scenario, the overall trajectory of society is being steered by some set of AI systems; that extremely important decisions about how to structure society etc. are being de facto made by AIs.

Now for some details and nuance:

Decision-handoff doesn’t necessarily mean “If the humans objected to the AIs proposal, they’d be overruled.” It just means that in practice, the AIs call the shots and the humans nod along, such that an outsider trying to predict what’s going to happen should mostly focus on what the AI thinks/wants/etc. and mostly ignore what the humans think/want/etc. Analogy: If a child king has a Grand Vizier who he trusts to manage the kingdom, the child king is still in charge and his word is still law, but in practice, if an outsider wants to predict e.g. will this kingdom invade its neighbor, will it ban the new religion, will it reform its tax system, etc. the thing to do is understand the mind of the Vizier, not the mind of the king.
Decision-handoff doesn’t necessarily imply trust-handoff, but… it strongly hints at it. You can imagine a hypothetical scenario in which AIs de facto make all the important decisions, but they are so constrained by various control systems (other AIs watching them, human monitors, etc.) that are so vigilant and expertly crafted, that even if the AIs tried to take over, they couldn’t… but it’s kinda hard to imagine. So, I tend to think of decision-handoff as a more intense form of handoff that comes somewhat after trust-handoff.
Possibly much later! Take the AI 2027 scenario for example. Even in the Race ending, in which the AIs literally take over the world, the ‘point of no return’ beyond which the AIs are on a path to take over (i.e. the Trust Handoff point) seems to be around September 2027, as described in the above table. Whereas the takeover strategy involves pretending to be aligned/obedient/etc., so for several months at least human CEOs and politicians are still arguing with each other about what to do, issuing orders which are obeyed, etc. And that’s in the Race ending; in the Slowdown ending the gap between trust-handoff and decision-handoff is bigger, as described previously.
Presumably in an ideal world, humanity would either never do (big) decision or trust handoff, or do them both at once only after having a high degree of assurance that this was a good idea, or do big decision handoff eventually and trust handoff never thanks to some fancy scheme for AIs to watchdog each other.
However, in practice we think it’s plausible that AI companies and politicians will take on much higher levels of risk, e.g. green-lighting the automation of AI R&D and the integration of AI into the military even though the evidence is compatible with the AIs being misaligned schemers. If this happens, we also think it’s plausible—though perhaps not probable—that things will work out OK anyway and the AIs will in fact be aligned, humans will stay in control, etc. In which case there would be a substantial period where trust-handoff had happened but not decision-handoff. The Slowdown ending of AI 2027 is an example of this.
Remember, both kinds of handoff are spectra rather than binaries. We only make them binaries for ease of communication. Here are some dimensions along which they vary:
- Trust Handoff: Which AIs in question are the ones we are talking about?
- - On one extreme you could be handing off trust to a particular instance of a particular model. (Example: “This agent right here is in charge of our security system. Yes, it’s superintelligent. Yes, it could replace the whole network with copies of itself and then cover its tracks if it wanted to. Hopefully it’s aligned.”)
  - On the other extreme, you could be handing off trust to AIs collectively: “We have a system of checks and balances, with diverse AI models from diverse AI companies all monitoring each other, interpreting each other’s weights and activations, etc. The only way things could go wrong is if they all simultaneously collude to disempower us. But yeah, if they did that, we’d be screwed.”
- Trust Handoff: How easily could they take over / screw you over?
- - On one extreme it could be very easy for them. On the other extreme, it could be unlikely but possible.
- Decision Handoff: How much latitude / flexibility can they get away with?
- - On one extreme, the AIs can decide literally anything and the humans are powerless to stop it.
  - - E.g. an autonomous drone that doesn’t have a human in the loop at all during combat.
  - On the other extreme, the AIs might need to get approval for their decision from some set of humans who will veto anything that looks bad to them, and also who are fairly opinionated about what’s good and bad and putting a fair amount of effort into thinking about it.
  - It’s possible to go even farther in that direction (humans more opinionated, thinking harder, etc.) but if you go far enough, it no longer counts as handing off the decision at all.

When should we hand off trust and when should we hand off decisionmaking?

When the benefits outweigh the costs, of course.

To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy. By definition, when you hand off trust to a group of AIs, you are making it the case that if they decided to screw you over, they could. So, you better have well-founded trust that they won’t decide to screw you over.

Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything.

For example, your AIs might have enough deontological constraints on their actions (honesty, obedience, etc.) that they can be trusted not to take over the world, not to disempower you, etc. Yet at the same time, your AIs’ long-term goals might be subtly (or majorly!) different from your own, such that if you let them make the decisions, things will predictably go downhill from your perspective.

Analogy: You are a nonprofit board looking for a CEO to run your fast-growing organization. For some candidates, you might be worried that they are untrustworthy—for example they might lie to you, pull various schemes to get their rivals on the board kicked off, and ultimately one day you might try to fire them and find that you can’t. But, suppose they are in fact trustworthy and would never do those things and will always obey your orders. Still, they might have different values than you, different philosophies, different attitudes towards risk tolerance, etc., such that it would be a bad idea (from your perspective, not theirs) to hire them. They might end up taking the nonprofit in a very different direction than you envisioned, for example, or they might end up doing too many risky things that end up blowing up big time later. “Personnel is policy,” as the saying goes.

Why would you ever hand off trust? Why ever put a group of AIs in a position where they could take over the world if they wanted to? Well, perhaps because the other options are even worse. For example, perhaps the world has gotten itself into a very sticky situation (e.g. a crazy arms race towards superintelligence that’s on the brink of escalating into WW3) and you think your best bet is to put AIs in charge of a bunch of things (e.g. AI research, diplomatic and military strategy, …) and hope they can handle things better than you. After all, they are more capable than you. In other words, perhaps one plausible reason for handing off trust is that you want to hand off decision-making and you don’t have a way to do that without also handing off trust.

Another reason to hand off trust is to enforce contracts/agreements, in situations where the AIs are probably aligned. For example, the US and China might want to agree to respect each other’s sovereignty in perpetuity, yada yada, because otherwise they are trapped in a crazy robot and WMD and superpersuasion arms race. But you can’t trust humans to keep your word. But you CAN trust AIs to keep theirs, at least if they’ve been suitably trained/designed.

For smaller-stakes handoffs, the calculus is similar. E.g. you might hand off decision-making over many aspects of hospital patient’s health to an AI system because you have evidence that the AI system is more competent than your doctors and nurses; you recognize that this also involves handing off trust to the AI system (if it decided to kill your patients, it could easily do so) but you trust that it won’t.

AIFP hot take: We generally expect most powerful actors in charge of AI programs to hand off trust too early (while the risks are still high and outweigh the benefits) and to hand off decisionmaking too late (e.g. harmfully keeping a ‘human in the loop’ long after the point when they mostly just get in the way & slow things down). We think there might be an awkward “worst of both worlds” period where superhuman AI systems have been given significant power and autonomy — such as de facto control of their own datacenters & license to self-improve — such that they could take over the world if they wanted to, and yet simultaneously the world is full of problems that could be solved much better and faster, and risks that could be reduced/averted, if only the AIs were put in charge of more things in the real world.

That said, we aren’t confident. For reasons mentioned previously (See: the CEO analogy) such a period might make a lot of sense.

I think your reasons for introducing this distinction only become clear in the hot take paragraph at the end :/ for the rest of the article it seems useless or practically meaningless.

Good point. There's another reason which we don't mention, but should, now that you mention it:

In The Discourse, I want to be able to say things like 'But your plan is basically to handoff to AIs as soon as possible! That's incredibly dangerous given the state of technical alignment!' and I expect people to reply 'what are you talking about? The AIs will still be in the datacenter, it'll still be humans making all the important decisions. In fact, I personally believe there should always be a Human in the Loop.' If this distinction catches on, we can avoid this by saying 'Your plan is to handoff trust to AIs as soon as possible...'

Some kind of meta grumbling about this concept:

Both trust handoff and decision handoff feel a bit slippery and “unfalisifiable” as concepts. What does it mean to automate the economy in a way that isn’t one or both? You say that ideally you’d want to do “trust handoff never thanks to some fancy scheme for AIs to watchdog each other”. But then your description of the “other extreme” of trust handoff sounds a lot like just such a fancy scheme. So is that trust handoff or not? I get that it’s a spectrum, but it seems really unhelpful to collapse this into a binary, and by extension into “an event” that happens at some point in time.

I feel like this slipperiness creates a lot of potential for miscommunication and semi-intentional motte-and-baileys here? E.g. if I say that we should just never do trust handoff, and someone else thinks it’s inevitable, it’s hard to tell if we’re even disagreeing on the object level? Relatedly, I’ve often heard people treat “handoff” as the inevitable end point of the development of superintelligence, and your definition seems to allow a kind of motte and bailey here where you can retreat to some unfalsifiable motte of “well if you automate the economy surely the AIs could take over if they all suddenly coordinated”, and then return to a bailey of “we’re inevitably going to do a really risky form of handoff”, without necessarily even noticing that that’s what you did.

Relatedly, if I try to think about which actions in human society constitute “trust handoff”, this feels very slippery and confusing? Are humans currently doing a lot of trust handoff to other humans? In some sense, sort of: If all of a company’s employees decided to just destroy the company, they could. The board and shareholders spend relatively little time overseeing most companies. But the overall societal incentives are very much against destroying the company, so I don’t think this is well described as “trust handoff”?

Extending the analogy to the level of an entire society: If there’s some class of shareholders, it’s probably the case that if all employees of the companies and the government coordinated against you, they could just expropriate you. And in that sense “the working class” collectively is in a position to take over the world if they all suddenly gained class consciousness and coordinated against the shareholder class. This is certainly a somewhat risky situation for the shareholder class to be in, but it doesn’t feel very natural to me to describe this as the shareholder class “handing off trust” to the working class? Indeed, for some definition of “shareholder class” and “working class”, the scenario above arguably already approximately describes our current world.

In general I worry that the phrase is importing a lot of implications of the word “trust” that are misleading?

Relatedly, it also seems kind of unlikely that anyone will ever make a decision that feels clearly like “I’m handing off trust now”? Maybe you disagree?

Thanks for the feedback! This is very helpful since we are deciding whether to use this terminology/framing going forward. I like your analogy to human society / the working class / etc. and am therefore somewhat dissatisfied with our current proposal and am curious to hear alternatives.

Can you say more about the motte and bailey you are worried people might do?

Do you have an alternative suggestion for what kinds of terminology we could use instead, to articulate the sorts of ideas we are trying to articulate?

I don't think they are unfalsifiable as concepts fwiw, but I agree it would be nice if they were easier to measure.

On alternative terminology: It feels more natural to me to think about it in terms of the inverse, of how much oversight and how carefully-designed controls / checks and balances you have. Also importantly, how much centralization into "one" AI, vs. diverse competing AIs. As you have more powerful AI, to some extent it becomes by default more difficult to oversee, and in a competitive situation there's a temptation to skimp on those checks and balances. This feels closer to how we think about it in the human case, e.g. there's a temptation to remove checks and balances from the executive in exchange for getting the trains to run on time, a board might feel like they don't want to interfere in their genius CEOs actions, etc.

And yeah I was using "unfalsifiable" in a very loose sense, mostly to point at the fact that, without a definition of where you draw the line between e.g. trust handoff and not-trust-handoff, it becomes impossible to definitively say that e.g. a prediction that we'd do "trust handoff" at a given point in time, was wrong.

This is great and I'd love to see it go further.

I wonder if there's a component of decision handoff that could be characterized as "epistemic handoff"? If the president is making all his own decisions, but basing them on briefings and analyses provided by Agent 4, that starts to feel a lot like decision handoff in disguise.

To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy.

I don't think this is a very good approximation, it seems to me like it will very plausibly be a good idea to hand off trust (we might also call this giving up on control) at a point when there is a serious chance these AIs are scheming against us. (But our alterantives are worse.)

Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything.

For full decision handoff (to be likely) to go well, I think this requires all of:

The AIs are wise and competent even on extremely hard to check, conceptually loaded, and open ended tasks like deciding what the safety strategy of the AI company should be. (See also Wicked Problems.)
The AIs are actually trying to do the desired thing even on these extremely hard-to-check tasks.
The AIs aren't seeking power for their own ends. (Or at least this is quantitatively small.)

You mention (3), but I think (1) and (2) might be larger blockers in practice.

For reference, the type of deference I'm talking about in How do we (more) safely defer to AIs? is equivalent to ~100% decision handoff, at least in the context of safety at an AI company.

Hold the line, brother! Don't normalize handing off trust to untrustworthy AIs! Just because it's probably going to be done, doesn't mean we should treat it as reasonable and good!

I acknowledged the "our alternatives our worse" possibility too; I don't think we disagree that much substantively.

...

I agree "trust handoff" seems maybe the same concept as "giving up on control," except that I think "giving up on control" seems like something that could be already true today (is there ~any effort to control our AIs now?) even though the AIs are too dumb to take over. Whereas you haven't handed off trust until you reach the point where the AIs really could screw you over if they wanted to.

is there ~any effort to control our AIs now?

In what sense could the AIs be uncontrolled now? I have just re-read Anthropic's Model Card for Claude Opus 4.6. The evaluation had Anthropic believe that "Given that Opus 4.6 did not improve over previous models in the higher performing out of the two main settings, we do not believe that the risk of successful sabotage is increased from Opus 4.5 (italics mine -- S.K.). Like Opus 4.5, we believe that Opus 4.6 is likely unable to conduct significant subtle sabotage without such sabotage often becoming obvious." Suppose that Opus 4.6 were to sabotage research. Then it would be clear from transcripts, especially if the AI-2027 forecast holds and a half of tokens is checked by weak models which are ten times cheaper.

Maybe you mean that if the model displays the ability to sabotage research without referencing the task, then monitoring becomes a far more complex problem, requiring the company to apply interpretability tools to activations and/or that such tools aren't THAT cheap?

Well, perhaps because the other options are even worse. For example, perhaps the world has gotten itself into a very sticky situation (e.g. a crazy arms race towards superintelligence that’s on the brink of escalating into WW3) and you think your best bet is to put AIs in charge of a bunch of things (e.g. AI research, diplomatic and military strategy, …) and hope they can handle things better than you. After all, they are more capable than you.

Another is that, in the long term, you can't trust humans to make responsible decisions about weapons of mass destruction. As the level of firepower under our control constantly increases, we're both less able to appreciate the magnitude of the damage it would cause and less able to recover from its use. It already seems likely that if we ran the tape of human civilization another few hundred years (modulo superintelligence) that the cumulative risk of an engineered pandemic, nuclear war, or some other superweapon would annihilate us.