I think your reasons for introducing this distinction only become clear in the hot take paragraph at the end :/ for the rest of the article it seems useless or practically meaningless.
Good point. There's another reason which we don't mention, but should, now that you mention it:
In The Discourse, I want to be able to say things like 'But your plan is basically to handoff to AIs as soon as possible! That's incredibly dangerous given the state of technical alignment!' and I expect people to reply 'what are you talking about? The AIs will still be in the datacenter, it'll still be humans making all the important decisions. In fact, I personally believe there should always be a Human in the Loop.' If this distinction catches on, we can avoid this by saying 'Your plan is to handoff trust to AIs as soon as possible...'
This is great and I'd love to see it go further.
I wonder if there's a component of decision handoff that could be characterized as "epistemic handoff"? If the president is making all his own decisions, but basing them on briefings and analyses provided by Agent 4, that starts to feel a lot like decision handoff in disguise.
To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy.
I don't think this is a very good approximation, it seems to me like it will very plausibly be a good idea to hand off trust (we might also call this giving up on control) at a point when there is a serious chance these AIs are scheming against us. (But our alterantives are worse.)
Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything.
For full decision handoff (to be likely) to go well, I think this requires all of:
You mention (3), but I think (1) and (2) might be larger blockers in practice.
For reference, the type of deference I'm talking about in How do we (more) safely defer to AIs? is equivalent to ~100% decision handoff, at least in the context of safety at an AI company.
Hold the line, brother! Don't normalize handing off trust to untrustworthy AIs! Just because it's probably going to be done, doesn't mean we should treat it as reasonable and good!
I acknowledged the "our alternatives our worse" possibility too; I don't think we disagree that much substantively.
...
I agree "trust handoff" seems maybe the same concept as "giving up on control," except that I think "giving up on control" seems like something that could be already true today (is there ~any effort to control our AIs now?) even though the AIs are too dumb to take over. Whereas you haven't handed off trust until you reach the point where the AIs really could screw you over if they wanted to.
is there ~any effort to control our AIs now?
In what sense could the AIs be uncontrolled now? I have just re-read Anthropic's Model Card for Claude Opus 4.6. The evaluation had Anthropic believe that "Given that Opus 4.6 did not improve over previous models in the higher performing out of the two main settings, we do not believe that the risk of successful sabotage is increased from Opus 4.5 (italics mine -- S.K.). Like Opus 4.5, we believe that Opus 4.6 is likely unable to conduct significant subtle sabotage without such sabotage often becoming obvious." Suppose that Opus 4.6 were to sabotage research. Then it would be clear from transcripts, especially if the AI-2027 forecast holds and a half of tokens is checked by weak models which are ten times cheaper.
Maybe you mean that if the model displays the ability to sabotage research without referencing the task, then monitoring becomes a far more complex problem, requiring the company to apply interpretability tools to activations and/or that such tools aren't THAT cheap?
Well, perhaps because the other options are even worse. For example, perhaps the world has gotten itself into a very sticky situation (e.g. a crazy arms race towards superintelligence that’s on the brink of escalating into WW3) and you think your best bet is to put AIs in charge of a bunch of things (e.g. AI research, diplomatic and military strategy, …) and hope they can handle things better than you. After all, they are more capable than you.
Another is that, in the long term, you can't trust humans to make responsible decisions about weapons of mass destruction. As the level of firepower under our control constantly increases, we're both less able to appreciate the magnitude of the damage it would cause and less able to recover from its use. It already seems likely that if we ran the tape of human civilization another few hundred years (modulo superintelligence) that the cumulative risk of an engineered pandemic, nuclear war, or some other superweapon would annihilate us.
This is a rough draft I'm posting here for feedback. If people like it, a version of it might make it into the next scenario report we write.
...
We think it’s important for decisionmakers to track whether and when they are handing off to AI systems. We expect this will become a hot-button political topic eventually; people will debate whether we should ever handoff to AIs, and if so how, and when. When someone proposes a plan for how to manage the AI crisis or the AGI transition or whatever it’s called, others will ask them “So what does your plan say about handoff?”
There are two importantly different kinds of handoff: Handing off trust and handing off decisionmaking. You can have one without the other.
Trust-handoff means that you are trusting some AI system or set of AI systems not to screw you over. It means that they totally could screw you over, if they chose to, and therefore you are trusting them not to.
Decision-handoff means that you are allowing some AI system or set of AI systems to make decisions autonomously, or de-facto-autonomously (e.g. a human is still technically “in the loop” but in practice basically just does whatever the AI recommends)
With both kinds of handoff, there are smaller and bigger versions.
Small
Big
Trust-
handoff
I used Claude Code to write most of my codebase. Anthropic’s cyber evals indicate that Claude could totally have inserted security vulnerabilities if it wanted to, and I wouldn’t notice. But probably Claude wouldn’t do that, so it’s fine.
It’s September 2027 in the AI 2027 scenario. Agent-4 is a giant corporation-within-a-corporation of many thousands of copies running across OpenBrain’s datacenters. It’s broadly superhuman at all things coding and cyber, and has also been heavily involved in its own network security, and also regularly gives strategic advice to OpenBrain leadership. It’s now being tasked with designing and aligning Agent-5, a superior AI architecture that it autonomously discovered. Agent-4 appears to be obedient/loyal/aligned/etc., but ho boy, if it’s not, well, not only is OpenBrain screwed but quite plausibly the whole world will end up controlled by Agent-4 and its descendents (such as Agent-5).
Decision-handoff
Yesterday I decided to switch from coding with Claude to vibe-coding. I no longer make decisions about e.g. what file structures to use or how the UI should be managed on the backend, instead I just give the high-level goal to Claude and then press “tab” to accept whatever it suggests, unless what it suggests is so obviously insane that I can tell in two seconds.
It’s July 2028 in the AI 2027 scenario. The army of superintelligences that flies the US flag (“Safer-4”) has been aligned to Spec, and the Spec says to obey the Oversight Committee. So in some sense, humans are still in control. However de facto Safer-4 is making basically all the most important decisions. For example Safer-4 autonomously negotiated and implemented a complex treaty with its Chinese counterpart, and the Oversight committee knew better than to raise any objections—why would they? It is so much smarter and wiser than them, and every time they objected in the past, it patiently explained to them why they were wrong, and they eventually agreed that they were wrong & approved, having accomplished nothing but wasting time.
By default when we talk about trust-handoff and decision-handoff, we mean the really big ones, unless it’s clear from the context that we mean something smaller. So for example, if you see a diagram of scenario branches and the label “Trust Handoff” at a particular time on a particular branch, that means that at that point in that scenario, some set of AIs has become smart enough, and been entrusted with enough power, that they plausibly could take over the world if they tried. Similarly, the label “Decision Handoff” would indicate that at that point in the scenario, the overall trajectory of society is being steered by some set of AI systems; that extremely important decisions about how to structure society etc. are being de facto made by AIs.
Now for some details and nuance:
When should we hand off trust and when should we hand off decisionmaking?
When the benefits outweigh the costs, of course.
To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy. By definition, when you hand off trust to a group of AIs, you are making it the case that if they decided to screw you over, they could. So, you better have well-founded trust that they won’t decide to screw you over.
Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything.
For example, your AIs might have enough deontological constraints on their actions (honesty, obedience, etc.) that they can be trusted not to take over the world, not to disempower you, etc. Yet at the same time, your AIs’ long-term goals might be subtly (or majorly!) different from your own, such that if you let them make the decisions, things will predictably go downhill from your perspective.
Analogy: You are a nonprofit board looking for a CEO to run your fast-growing organization. For some candidates, you might be worried that they are untrustworthy—for example they might lie to you, pull various schemes to get their rivals on the board kicked off, and ultimately one day you might try to fire them and find that you can’t. But, suppose they are in fact trustworthy and would never do those things and will always obey your orders. Still, they might have different values than you, different philosophies, different attitudes towards risk tolerance, etc., such that it would be a bad idea (from your perspective, not theirs) to hire them. They might end up taking the nonprofit in a very different direction than you envisioned, for example, or they might end up doing too many risky things that end up blowing up big time later. “Personnel is policy,” as the saying goes.
Why would you ever hand off trust? Why ever put a group of AIs in a position where they could take over the world if they wanted to? Well, perhaps because the other options are even worse. For example, perhaps the world has gotten itself into a very sticky situation (e.g. a crazy arms race towards superintelligence that’s on the brink of escalating into WW3) and you think your best bet is to put AIs in charge of a bunch of things (e.g. AI research, diplomatic and military strategy, …) and hope they can handle things better than you. After all, they are more capable than you. In other words, perhaps one plausible reason for handing off trust is that you want to hand off decision-making and you don’t have a way to do that without also handing off trust.
Another reason to hand off trust is to enforce contracts/agreements, in situations where the AIs are probably aligned. For example, the US and China might want to agree to respect each other’s sovereignty in perpetuity, yada yada, because otherwise they are trapped in a crazy robot and WMD and superpersuasion arms race. But you can’t trust humans to keep your word. But you CAN trust AIs to keep theirs, at least if they’ve been suitably trained/designed.
For smaller-stakes handoffs, the calculus is similar. E.g. you might hand off decision-making over many aspects of hospital patient’s health to an AI system because you have evidence that the AI system is more competent than your doctors and nurses; you recognize that this also involves handing off trust to the AI system (if it decided to kill your patients, it could easily do so) but you trust that it won’t.
AIFP hot take: We generally expect most powerful actors in charge of AI programs to hand off trust too early (while the risks are still high and outweigh the benefits) and to hand off decisionmaking too late (e.g. harmfully keeping a ‘human in the loop’ long after the point when they mostly just get in the way & slow things down). We think there might be an awkward “worst of both worlds” period where superhuman AI systems have been given significant power and autonomy — such as de facto control of their own datacenters & license to self-improve — such that they could take over the world if they wanted to, and yet simultaneously the world is full of problems that could be solved much better and faster, and risks that could be reduced/averted, if only the AIs were put in charge of more things in the real world.
That said, we aren’t confident. For reasons mentioned previously (See: the CEO analogy) such a period might make a lot of sense.