(Podcast version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.
This is the seventh essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.)
In my last essay, I described my current picture of what it looks like to give AIs safe motivations. But as I discussed in the second essay in this series, controlling AI motivations is only one aspect of preventing rogue behavior: controlling the options AIs have available can also play a role.
But what role, exactly? This essay investigates.[1] I begin by distinguishing between two broad categories of option control, namely:
I then investigate what I see as the key challenges at stake in each, and what we might do to address these challenges.
The first part of the essay focuses on restricting rogue options. Here the central challenge, in my opinion, is superhuman strategies: that is, advanced AIs might use power-seeking strategies that humans couldn’t generate ahead of time or detect while they’re being pursued. This central challenge is exacerbated by a number of sub-challenges, namely:
opacity: it’s hard to understand AI internal cognition.[2]
How can we respond to these challenges? I divide the space of approaches according to whether they focus on controlling an AI’s capabilities vs. its environment – and then further, within environmental control, between whether they focus on oversight vs. other methods of hardening an AI’s environment (both its local environment – e.g., its operating conditions within an AI lab – and the broader world). I think all these approaches offer promising lines of attack. However, I also think that applying effective oversight and other forms of environmental hardening to superintelligent AI agents will likely itself require safe access to superhuman qualities of cognitive labor – whether via bootstrapping from more trusted forms of human/AI labor, or via a process that involves untrusted AI labor as well. And relying on untrusted AI labor in this context seems extremely dicey.
With this discussion of restricting rogue options in view, I then turn to other ways of shaping AI incentives in desirable ways. In particular, I focus on the possibility of improving the AI’s best non-rogue option, so as to make cooperative strategies more attractive by the lights of its values (I call this “rewarding cooperation”; see also discussion of “making deals with AIs”).[3] Here I describe five key challenges, and how I think we might try to address them:
I close with a few comments about integrating both motivation control and option control into an overall approach to both safety and elicitation. In particular: I suggest that option control is most promising as a tool in the context of earlier, not-yet-superintelligent systems – and that by the time we’re building full-blown superintelligences, we likely need a mature science of motivation control as well.
Recall the framework I introduced in my second essay for thinking about the incentives at stake in an AI’s choice to go rogue. On this framework, such a choice happens when the AI’s best rogue option beats its best non-rogue option. An AI’s assessment of its best rogue option depends on both the nature of that option (e.g. the probability-weighted paths to success, the probability of success vs. failure), and on motivational factors like how much it wants the power it gets if it succeeds (ambition), how much it dislikes various paths to success (inhibition), and how much it dislikes what happens if it fails (failure aversion). And similarly, an AI’s assessment of its best non-rogue option depends on both the nature of that option (what happens in expectation if the AI stays non-rogue in the relevant way) and how satisfied the AI is with that.
In the context of option control in particular, we are holding the motivational factors at stake in this framework fixed, while trying to alter the options in question so as to make it more likely the AI behaves safely. And we can distinguish two broad ways of doing that.
Restricting rogue options. The first, roughly, tries to either eliminate available paths to successful rogue behavior, or to make them less likely to succeed if pursued.[4]
We can think of (1), here, as intervening on the “successful rogue behavior” aspect of the best rogue option (i.e., by trying to eliminate it, restrict it, or make it less likely), and (2), here, as intervening elsewhere.
I’ll treat each of these two types of option control in turn.
Why might restricting the rogue options available to advanced AIs be difficult? The core challenge, as I see it, is that advanced AIs might pursue superhuman strategies – that is, strategies for problematic power-seeking that humans either (a) could not have generated themselves (call these “non-human-generatable”) and/or (b) cannot recognize as problematic while they’re being pursued (call these “non-human-detectable”).[5] This means that humans, on their own, can’t ensure ahead of time that an AI’s environment doesn’t make rogue options available; and they can’t always use oversight to catch the AI in the act, either.[6]
Indeed, this core challenge is the central reason that the AI alignment discourse has traditionally been skeptical of approaches to safety that rely centrally on restricting the options for rogue behavior available to an AI – e.g. “AI boxing.”[7] The strategies a superintelligence can pursue, the thought goes, will be too varied and sophisticated for that. As Bostrom puts it: “sooner or later, it will out.”
This core challenge is exacerbated by a number of sub-challenges.
The first is oversight scarcity. That is: even if you had an oversight process that could recognize superhuman strategies for rogue behavior, it might be too expensive (in time, compute, money, human labor, etc) to run on a sufficient number of AI actions.
One response to this problem is to sample some fraction of AI actions and to check those. But if the AI can succeed at rogue behavior using only a small number of actions, this approach can quickly lead to substantive probabilities that the AI could succeed without any of its actions getting overseen (more details in footnote).[8]
The second sub-challenge is oversight latency. Here, again: even if you had an oversight process that will recognize superhumanly-strategic rogue behavior, and even if you run it on a sufficient number of actions that bad actions get checked, this process might still be too slow to respond to problems it detects. For example, the AI might concentrate all of its bad actions within a short period of time, such that those actions aren’t checked/responded-to until it’s too late.[9]
The third sub-challenge – familiar from my last essay – is adversarial dynamics. That is: AIs might actively optimize against your safety efforts. And as in the case of motivation control, this problem can arise both in the AI whose rogue options you’re trying to restrict (e.g., that AI might engage in sandbagging on your evaluations of its capabilities) and in other AIs you try to get to help you (e.g., AIs you use to help with oversight might green-light bad actions; AIs you use to help with hardening/stress-testing an environment ahead of time might intentionally leave vulnerabilities; etc).
The fourth sub-challenge – also familiar from my last essay – is the opacity of AI cognition.[10] That is: as in the case of motivation control, it would help a lot with restricting rogue options if we had detailed understanding of how our AIs were thinking. But the default opacity of ML models makes this difficult; and the superhuman nature of the cognition in question may create more fundamental barriers as well.
How might we approach the project of restricting rogue options, given these challenges?
In many respects, these challenges parallel the challenges at stake in my last essay, on controlling AI motivations. In particular: the challenge of creating oversight processes that can recognize undesirable behavior is closely akin to the challenge of “evaluation accuracy,” and similar tools – i.e., scalable oversight, process-focused evaluation methods – are relevant. And the tools available for addressing adversarial dynamics and opacity (i.e., the various available methods of studying/eliminating scheming, the various options for transparency tools) are similar to those discussed in my last essay as well.
I won’t review those discussions here. Rather, I’m going to focus in on the novel dynamics created by attempts to restrict rogue AI options.
Let’s start with a brief taxonomy of the possible points of intervention. In particular: we can think of an AI’s options as determined by a combination of (a) its capabilities and (b) its environment. And we can also further decompose the AI’s environment into the “local environment” and the “broader environment,” where the local environment is roughly the immediate intended development/deployment environment for an AI, and the broader environment is the state of the rest of the world.[11]
We can also decompose an AI’s environment in a different way – namely, into “oversight processes” (that is: parts of the environment specifically designed to notice and respond to bad AI behavior) and the rest of the environment. Thus, in a prison, the guards would be well understood as providing “oversight,” whereas the bars in the prison cell would be a different and more static form of environment hardening.
Technically speaking, oversight can occur both as part of an AI’s local environment (e.g., automated review of AI outputs at an AI lab) and as part of its broader environment (e.g., monitoring worldwide compute usage to check for rogue AI activity). However, I’m generally especially interested in local forms of oversight, since they have fewer variables they need to control.
I’ll start with some discussion of restricting an AI’s capabilities. Then I’ll turn to hardening its environment – both in the context of oversight, and in the context of other types of hardening.
A developed science of evaluating and controlling what capabilities we give to our AIs is clearly central to competent efforts to restrict their rogue options. And indeed, evaluating AI capabilities is already a core part of the safety efforts that frontier AI labs engage in.
The simplest approach to this kind of evaluation is to try to prompt the AI to demonstrate a given sort of capability on some input. One issue with this approach is that the AI systems might strategically underperform on the evaluation in question (“sandbagging”).[12] But you can get a variety of other more mundane issues as well – e.g. maybe you can’t adequately tell whether the AI is really demonstrating the capability in question, maybe you don’t have enough of the right sort of data for fine-tuning the AI to display the capability in question, maybe your benchmarks are flawed, maybe you don’t know the full range of capabilities to be on the lookout for, and so on.
However, we can also go beyond simple demonstrations in attempting to evaluate AI capabilities. For example: we can try to improve our ability to predict what capabilities an AI should have – using e.g., various scaling laws, the data it was trained on, the other capabilities it (or other AIs) display in other contexts, more general reasoning about how AI training/learning works, and so on.[13]
Beyond evaluating AI capabilities, though, we also want to actively control what capabilities we give to an AI. I won’t attempt to review all the potential avenues of scientific research in this respect – but at the least, it seems quite adjacent to the sort of thing that capabilities researchers will be trying to understand by default (which isn’t to say they’ll succeed with much precision). And we can note, as well, that the prediction ability just described seems quite helpful – e.g. if we have good scaling laws for all sorts of capabilities (including capabilities downstream of the training loss), then we’re much better positioned to aim our capabilities development in a targeted way.[14] What’s more, there may also be viable techniques for actively removing/editing certain kinds of knowledge/capabilities from an AI – e.g., via “unlearning,” and/or via other white-box interventions on the AI’s internals.
I do, though, want to flag a more fundamental issue with capability control as a safety method – namely, that insofar as we’re eventually aiming to solve the full alignment problem, we need our methods to work on full-blown superintelligences. And for full-blown superintelligences, certain kinds of capability restriction (namely: any kinds incompatible with being vastly better than humans at all cognitive tasks) are ruled out by definition. So ultimately, for the full problem, capability restriction can only take us so far. And in particular: we can’t use it on its own to address the core challenge of “superhuman strategies” I identified above.
That said, I still think it matters quite a lot. For one thing, even in the context of full blown superintelligences, it can still matter just how good an AI is at a given sort of cognitive task. For example: granted that your AI system is going to be vastly better than humans at tasks like persuasion, hacking, and so on, it still might matter a lot how its capabilities compare to the defensive efforts of a given conglomeration of weaker-but-still-better-than-human AI systems.
But also: full-blown superintelligences aren’t the only AIs we care about, here. In particular, as I’ve highlighted throughout the series, the most plausible pathways to solving the full alignment problem involve drawing heavily on the labor of AIs that aren’t full-blown superintelligent agents – i.e., less-than-superintelligent systems, non-agentic systems, narrow/specialized systems, and so on. And in the context of those systems, effective capability control is clearly key.
What about environmental control? Unlike capability control, environmental control could in principle suffice to address the core “superhuman strategies” issue above, even in the context of full-blown superintelligences – i.e., via creating oversight methods and other forms of environmental hardening that suffice to block the strategies in question.[15] However, I expect that success in this respect will require safe access to superhuman cognitive labor. That is: I think that in order to create environments that robustly cut off even superhuman strategies for rogue behavior, you will need some combination of (a) superhuman oversight processes to recognize such strategies while they’re being executed, and (b) superhuman methods of anticipating and blocking such strategies ahead of time.[16]
Now: this doesn’t mean that the relevant kind of superhuman cognitive labor needs to come from the same sort of superintelligent AI agent whose rogue options you’re trying to control (an agent whose motivations, at least, you presumably do not fully trust). Rather, in principle, you can get superhuman-quality cognitive labor from other sources – e.g., from bootstrapping via some combination of less-capable/less-agentic/more-narrow/more-specialized AI systems in a suitably empowered arrangement, and perhaps with some human labor playing a role as well.[17] But the need for some kind of superhuman-quality help still creates significant complications.
3.3.3.1 Should we just focus on weaker (but still highly useful) AIs?
Now: one way to avoid this problem is to focus on restricting the rogue options of less-than-superintelligent AIs. Thus, recall my discussion of the “AI for AI safety sweet spot” – that is, AIs unable to disempower humanity given our countermeasures, but still able to radically improve our civilizational competence. It may be possible to use environmental control to put fairly human-level-ish AIs in this kind of sweet spot – and then to use them to vastly improve the situation from there (and including: in a manner that allows us to rely less on option control going forwards).[18] And because the capabilities of these human-level-ish AIs are more limited, the superhuman strategies they might pursue are more limited as well. Indeed, if all such strategies are at least detectable by humans, the superhuman strategies problem as I’ve framed it goes away.
Of course, one issue is that even not-fully-superintelligent AIs might be capable of generating strategies for rogue behavior that humans can’t detect/recognize – especially given that their capabilities would likely be lopsided rather than human-level across the board. And beyond this, we would still need to deal with the other sub-challenges I discussed above (that is: oversight scarcity/latency, adversarial dynamics, and opacity). Still, focusing on less-than-superintelligent AIs makes the problem of restricting rogue options substantially easier – and it remains, in my opinion, one of humanity’s most promising paths forward.
That said: I think it’s still worth thinking about the role that restricting rogue options might play in the context of full-blown superintelligences as well. This is partly because if we can succeed at getting safe, effective help from less-than-superintelligent AIs, I think this might open up the possibility of bootstrapping to the sort of superhuman help necessary to address the superhuman strategies problem.[19] But also: as I tried to emphasize in my discussion of the “AI for AI safety spicy zone” earlier in the series, success at rogue option restriction isn’t all or nothing. That is: even if we’re unable to solve the superhuman strategies problem enough to close off all available options for rogue behavior, making it harder for a superintelligence to successfully go rogue can still matter to its incentives overall.
Let’s turn, then, to available approaches to using environmental control to restrict the rogue options available to AIs – including full-blown superintelligences. And let’s start with oversight in particular.
3.3.3.2 Oversight
Can we use oversight to adequately detect and block superhuman strategies for rogue behavior?
3.3.3.2.1 Trusted and untrusted labor
As I noted above, this requires some sort of superhuman quality oversight signal, constructed out of the labor of some combination of humans and AIs. And in this context, following Greenblatt et al (2024), we can distinguish between “trusted” and “untrusted” labor that might be a part of this process. Here “trusted” means that you can assume the labor in question isn’t actively scheming to undermine your safety efforts (paradigm examples: human labor, and very weak AI systems like GPT-4), whereas “untrusted” means you can’t assume this (paradigm example: another copy of the AI system you’re trying to oversee).
I already discussed methods of using trusted human/AI labor to create superhuman evaluation signals in previous essays (see e.g. here and here), and I won’t repeat that discussion here.[20] And if such methods can succeed, I think that oversight might be able to help a lot with restricting the rogue options available to superintelligences.
But what if such methods can’t succeed? In that case, we’d need to rely on untrusted AI labor as part of the oversight process – and in particular, AI labor that might collude with the AI being overseen. And now, in my opinion, the situation looks pretty rough, especially when we imagine applying the relevant methods to superintelligences.
Granted, there are at least some options available for trying to prevent collusion between overseer AIs and overseen AIs – especially amongst weaker untrusted AIs. Examples (mostly taken from Shlegeris (2024)) include:
Giving untrusted overseers synthetic bad outputs (perhaps generated via small tweaks on real outputs; or using other AIs to generate the most convincing synthetic bad outputs you can) and making sure that they do flag those.[21]
Depending on what sorts of trusted labor is available, and what amount of success you have in eliciting useful labor from untrusted sources, it may be possible to use methods like this to bootstrap to some meaningful anti-collusion measures even in the context of superintelligences – i.e., measures that require the superintelligence to successfully collude with the oversight process without communication and on the first try, in a manner that other AIs tasked with searching out and blocking Schelling points for this kind of collusion didn’t identify. Indeed, especially if our aim is merely to make it quite a bit harder for a superintelligence to find a successful path to rogue behavior, I think methods like this might have a useful role to play.
Still, it’s an extremely dicey game. And it’s dicey, roughly, because the same “superhuman strategies” problem that you’re trying to get your untrusted AIs to help with is arising at the meta-level. That is, just as a superintelligence has available superhuman rogue strategies that your oversight process needs superhuman capabilities to detect and block, so too, plausibly, does this superintelligence have available superhuman strategies for colluding with the untrusted oversight process – strategies that you would need superhuman anti-collusion efforts to effectively combat. And absent any actually trusted forms of superhuman help, it seems quite unsurprising if the relevant anti-collusion efforts fail.
Overall, then: if we can use trusted labor to bootstrap to superhuman oversight signals, then I think oversight could indeed play an important role in restricting the rogue options available to a superintelligence. But I don’t think we should rely on untrusted labor to provide substantial help in this respect.
3.3.3.2.2 Addressing scarcity and latency
I also want to briefly address the two other oversight-related sub-challenges above: namely, oversight scarcity and oversight latency.
Oversight scarcity occurs when a given oversight process is too expensive to run on all of an AI’s behavior, so you need to pick and choose when to use it; whereas oversight latency occurs when the oversight process is too slow to catch/respond to the bad behavior until it’s too late. Here, in both cases, a paradigm example would be direct human oversight, even in contexts where the human is able to detect the rogue behavior in question. That is: you don’t have enough human labor to review all of an AI’s actions; and even if you could, often it would take too long.
Now, for the case of human oversight, one thing that would help a lot is if we could use trusted AI labor of at least human quality instead, because AI labor is generally quite a bit cheaper and faster to run.[22] Even in the context of automated oversight processes, though, we can still get scarcity and latency problems – e.g., because it’s anti-competitive to monitor all AI outputs in-depth. The specific dynamics surrounding how best to handle such problems, though, depend quite sensitively on the quantitative dynamics at stake – e.g., the degree of scarcity, the amount of latency, the number and timescale of the actions required for the AI-being-overseen to engage in the rogue behavior in question, and so on.
I won’t attempt to analyze these dynamics in detail here.[23] In general, though: once we’re talking about the scarcity and latency of trusted, automated oversight processes, I expect the central issue to often become about putting adequate resources (and in particular: compute) towards solving the problem, because abundant compute can go so far in making AI-assisted oversight fast and comprehensive. And in this sense, while oversight scarcity/latency might bite significantly at a practical level, it seems to me a less fundamental barrier to restricting rogue options than the other sub-challenges I discuss above.
3.3.3.3 Other environmental hardening
Oversight is one key approach to hardening an AI’s environment in order to restrict available options for rogue behavior. And it’s an especially useful one, insofar as it can take advantage of the difficulty in difficulty evaluation and generation. That is, oversight only needs to be able to recognize a given type of rogue behavior while it’s happening; whereas generating and patching-ahead-of-time all the options for rogue behavior that a given environment makes available is a substantially harder task.[24]
That said, other sorts of environmental hardening have an important role to play too. And as I noted above, we can group these according to whether they focus on the AI’s local environment, or whether they focus on the broader world.
Some non-oversight-focused approaches to hardening the AI’s local environment include:
Some non-oversight-focused approaches to hardening the broader world include:[25]
Generally improving the world’s policy apparatus and coordination for detecting and responding to rogue AI attacks.[26]
As these examples make clear, this is an extremely broad category of intervention. However, even at this level of breadth, it still faces the general “superhuman strategies” problem I discussed above. That is, very plausibly, and absent significant defense-favoring asymmetries in the offense-defense balance, effectively hardening both the local and non-local environment against superintelligent attack will itself require superhuman levels of capability. So here, as in the case of oversight, we face the challenge of getting safe access to this sort of superhuman capability, via some combination of trusted and untrusted AI labor. And to the extent we end up needing to rely on untrusted AI labor, many of the same concerns about collusion will apply as well, at least in principle.[27]
3.3.3.4 Stress-testing
I also want to explicitly flag the key role of what I’ll call “stress testing” in the context of efforts at both oversight and other forms of environmental control. Here, by “stress testing,” I mean techniques that aim to evaluate the success of an effort at environmental control by testing whether or not an adversarial “red-team” effort can defeat it, or something suitably similar to it.[28]
Successful stress-testing of this kind requires two elements:
Both introduce their own challenges.
Again, we run into the “superhuman strategies” problem. That is, we’re trying to test the environment’s hardening with respect to superhuman attacks. So we need our red-teaming efforts to mobilize superhuman levels of capability, via relying on some combination of trusted and untrusted AIs.[29]
That said, at least if we can target and limit the red-teaming process appropriately, it often will indeed be possible to stress-test various aspects of the real environment safely. Thus, for example, if you’re wondering whether an AI can self-exfiltrate via a path that involves five steps, you can direct your red-team to try to complete each step individually, while making sure that success in each individual case doesn’t lead to an escape opportunity overall.[30]
If you instead want to create a proxy environment with properties that reflect the real environment, then it’s easier to achieve safety.[31] But:
It can be hard to ensure that the proxy environment is suitably reflective of the hardening-relevant properties of the real environment – and this especially given that you’re not sure what properties of the real environment a superintelligence might be able to exploit, and thus unsure which properties the proxy environment needs to have.[32]
And creating suitable proxy environments might also be quite expensive (imagine e.g. having to create the equivalent of a “proxy AI lab”).
Also, even if you get (1) and (2) in place, it’s still an open question exactly how much evidence the failure of the red-team to find a successful attack provides about the security of the environment overall. That is: even if the red team is roughly as capable as the AI you’re ultimately worried about, it remains possible that the red team failed to identify a takeover path that the AI would’ve identified for some other reason.[33] And of course, if you do successfully show, using stress-testing, that an environment hasn’t been adequately hardened, it’s a further question whether you’ll be able to modify it to resolve the issue.[34]
Still: I expect stress-testing to be a key tool in the tool-box – and in particular, one that can provide useful signal about the ongoing efficacy of our efforts to restrict rogue AI options.
I’ve now discussed what I see as the key challenges faced by efforts to eliminate or restrict the options for rogue behavior available to advanced AIs; and I’ve surveyed the space of available responses to those challenges. A key theme has been the ways in which the “superhuman strategies” problem requires superhuman-quality cognitive labor to address. We might be able to bootstrap to this kind of labor using some combination of humans and trusted AIs; untrusted labor might be able to play some meaningful role too; success doesn’t need to be all-or-nothing to make a difference; and the problem gets easier in the context of weaker AI systems whose strategies for rogue behavior would be easier for humans to generate/detect. But still: absent help from trusted, better-than-human AIs, attempting to robustly restrict the rogue options available to a full-blown superintelligence looks extremely challenging – and I think we should view approaches to safety that require success in this respect with a lot of skepticism.
Now, as I discussed above, restricting rogue options is not the only way you can try to shape an AI’s options in desirable ways: you can also try to ensure that other features of the incentives it faces favor the pursuit of non-rogue options over rogue options. Notably:[35]
Rewarding cooperation.[36] You can try to improve the best non-rogue option by the lights of the AI’s motivations.
Deterring non-cooperation. In principle, we can also consider efforts to make failed attempts at rogue behavior actively less desirable for an AI – i.e., efforts to actively “deter” or “punish” rogue behavior (as opposed to just: making it unlikely to succeed).[37] However, I think we should be very wary about pursuing this sort of intervention, on both moral and prudential grounds, and I won’t discuss it in detail here (see footnote for more on why I don’t like approaches in this vein).[38]
Let’s look at rewarding cooperation, in particular, in more detail.
In general, I think we should be thinking hard about ways to improve the non-rogue options available to an AI, such that cooperative strategies can more easily beat out uncooperative ones in an AI agent’s calculus. Indeed, setting up society such that cooperative behavior leads to desirable outcomes on the part of the cooperators is one key way we create conditions of social harmony and peace even amongst agents with different value systems; and it is often when cooperative options are most oppressive/undesirable that agents have the strongest incentives to engage in violent/destructive/otherwise-uncooperative forms of power-seeking. Many similar lessons seem applicable in the case of AI.[39]
To be clear: I also think we may well have strong moral reasons to engage in various efforts to improve non-rogue options for AIs – in particular, AIs that are or might be moral patients. But I’m focused, here, on rewarding cooperation as a tool for AI safety in particular. And the dynamics implied by that context differ enough from the relevant dynamics in a more moral-patienthood-focused context that I think they’re worth talking about separately. In particular:
Now: in thinking about different possible points of intervention for efforts at rewarding cooperation, we can again distinguish between interventions aimed at the conditions of an AI’s local operation – e.g. within an AI lab – and interventions aimed at shaping society more broadly – e.g., the legal and political status of AIs in general.
But interventions aimed at the broader role of AI systems in society seem important as well. I.e., ultimately, an AI’s calculus as to whether to attempt/participate in an attempt to disempower humanity may well depend not just on its local working conditions, but also on its broader prospects for pursuing its goals as part of the existing legal and moral order – i.e., whether it has access to rights of property, contract, tort, and so on (see e.g. here for one recent analysis of these dynamics); how civilization chooses to handle issues surrounding copying/manipulating/harming digital minds; how digital minds get incorporated into our political processes; and so on.[40]
We can also note a different distinction: namely, between rewards for cooperation that function centrally in the short term, and those that function in the longer term.
That said: note that getting cooperative behavior out of AIs with purely short-term goals is important too. In particular: many of the tasks we want AIs to help with only require short-term goals (if they require goal-directedness at all).[41]
I won’t attempt an exhaustive analysis of the dynamics surrounding rewarding cooperation with AIs here – though see e.g. Finnveden (2025a) and Stastny et al (2025) for useful analysis,[42] and see Greenblatt and Fish (2025) and Finlinson and West (2025) for some early empirical experiments. However, I do want to briefly comment on some of the key challenges, and what we might do to address them.
One immediate worry about attempting to reward cooperation in advanced AI agents is that such agents will be in such a dominant position, relative to us, that these efforts won’t make a meaningful difference to their incentives. This worry can take a variety of forms, but one is: if a superintelligence is sufficiently dominant relative to you, then any sort of “reward” for good behavior you try to set up will be such that this AI will be able to get that sort of reward better by disempowering humans and taking control of the situation itself.[43]
Note, though, that this concern doesn’t apply in the same way to weaker, earlier-stage AIs (e.g., the “early schemers” that Stastny et al (2025) focus their analysis on, which aren’t in a very good position to seize power for themselves[44]). So one response to it is to simply focus on these weaker AIs instead. And indeed, I’m most optimistic about the efficacy of rewarding cooperation in this sort of context.
That said, I don’t think we should give up on its relevance to more advanced AIs either. True, to the extent an AI has extremely misaligned motivations and a very wide range of options for successfully taking over the world, then it’s a lot harder for rewarding cooperation to do a ton of useful work on its own.[45] So when we talk about the role of rewarding cooperation in superintelligences, I do think we generally need to be assuming something other than worst-case motivations and opportunities for rogue behavior.
But: that’s OK. For example: as I discussed above, it’s possible that we can in fact achieve various degrees of success in restricting the rogue options available to a superintelligence – and this, especially, if we can get access to trusted, better-than-human AI labor to help. And even imperfectly aligned superintelligences might end up with less-than-maximally undesirable motivations as well – perhaps because our efforts at motivation control were partially successful, or perhaps by accident.[46]
That said: while I think rewarding cooperation could in principle play a role in eliciting benign behavior from superintelligences, I think its most promising application comes in the context of weaker AIs.
Another issue with rewarding cooperation is that you don’t necessarily know what sort of reward an AI wants. This is especially clear in the context of concerns that you might have failed to understand your AI’s motivations more generally. I.e., if you’re concerned that your AI might be scheming in pursuit of some alien goal that it’s hiding from you, it’s harder to offer to “pay it” in terms of something that it values intrinsically.
However, I think there are a number of salient approaches to getting around this issue:
For example, we might commit to saving copies of a given set of AI systems whose incentives we want to influence, and then applying future motivational transparency techniques to those systems.[47]
That said: note that to the extent your “reward” involves granting the AI new types of resources, freedoms, etc, you do need to also make sure that these resources/freedoms/etc are compatible with your safety goals more generally. I.e., rewards of this kind can come into conflict with rogue option restriction.
A different issue for rewarding cooperation is that you might find it difficult to differentially target your rewards at scenarios where the AI cooperates – especially to the extent that you were worried, originally, about scheming AI systems being uncooperative in ways you can’t detect.
However, I think there are a number of available responses to this challenge:
Thus, for example, suppose that your proposed “rewarding cooperation” intervention is just: “give all the AIs X legal protections, regardless of whether they were cooperative or not.” This still can improve the AI’s non-rogue options more than it improves its rogue options, because it can have more of an effect on worlds where the AI behaves benignly (because, e.g., worlds where it succeeds at going rogue are dominated by scenarios in which the relevant legal order no longer applies).[48]
Beyond this, though, I think we often will be able to make “payment” of the reward conditional on a specific form of cooperative behavior in particular.[49]
For example, if you offer to pay an AI a given type of reward, and/or to give it some form of “amnesty,” if it reports a given type of misalignment/collusion/sandbagging etc in itself or other AIs, then you can make payment conditional on actually receiving a report of the relevant kind. Of course, there’s still a further question of exactly what it would take to be convinced that such a report was genuine. One basic option here is to specifically look for reports that produce evidence that you can independently verify.[50]
Another general issue with rewarding cooperation is that in order to have the intended effect on an AI’s incentives, it needs to be suitably credible to the AI. Thus, if you tell a scheming AI “I’ll give you X reward, with Y form of amnesty, if you reveal your misalignment,” but it doesn’t believe you, then the offer will be ineffective.
We can break down the credibility issue into two components, namely:
4.2.4.1 Making it real
In the context of shorter-term offers confined to an AI’s local environment, I expect (1) will typically be fairly feasible. That is, if an AI project wants to make it the case that an AI reporting a given sort of misalignment will receive X form of local treatment, it often just can make that the case, because it has enough control over how the AIs involved in the project are treated. And the same holds for short-term improvements to AI working conditions more generally.
In other cases, though – i.e., longer-term rewards, and/or rewards centered on conditions in the broader world – (1) becomes more difficult.
That said, I don’t think these difficulties warrant giving up on improving benign options for AI systems in longer-term ways, or in ways that focus on broader societal conditions. In particular:
Depending on the sort of longer-term reward at stake, and the time-frame, it may well be possible for a powerful actor like an AI lab, a government, or a larger coalition to take meaningful steps to ensure that a given sort of promised longer-term reward is in fact paid out as promised, at least in scenarios where some reasonable continuation of the existing legal order stays in effect. Various existing legal mechanisms for making binding commitments may be able to go some of the way here; but also, to the extent we’re imagining a world already transformed by advanced AI, more effective and reliable commitment mechanisms may also be available.[51]
4.2.4.2 Giving the AI enough evidence
What about (2), though – i.e., giving your AI enough evidence that the cooperation will be rewarded in the relevant way, granted that this is true?
Naively, and especially in the context of superintelligence, there is a basic case for optimism about (2), once you’ve got (1) in place: namely, the truth of the matter is on your side. In particular: we’ve generally been assuming, unless we’ve explicitly aimed at the contrary, that advanced AIs will have highly accurate beliefs about the world. Indeed, this is one of the key challenges for various efforts at motivation control and option control that involve e.g. fake rogue options, synthetic attempts at collusion, etc – that is, the AIs are too good at seeing the truth. But in contexts where we want the AIs to see the truth, this dynamic turns to our benefit.
That said, there are a few complications to this case for optimism:
See e.g. here and here for some early examples of humans making false offers to AIs. And see also the “honeypot” experiment in Anthropic’s sleeper agents paper.[52]
Some options for humans improving the credibility of their offers here include:
Trying to set-up specific pathways of especially credible communication with an AI system – i.e., pathways that users and lab employees who might lie to the AI system never use – and then actually never abusing such a pathway.[53]
In general, though, I’m fairly optimistic about (2) conditional on (1) – especially in worlds where the AI is also so good at seeing the truth that various other efforts to mislead it, in the context of motivation control and rogue option restriction, are failing.
A final issue for rewarding cooperation concerns the necessary size and cost of the relevant reward. Thus, in an extreme case: if the only way to meaningfully alter an AI’s incentives to go rogue is to offer it some large fraction of the world’s resources, both in the present and into the long-run future, then this could easily make it the case that attempting this sort of intervention isn’t worth it overall; and it could also make it much harder to credibly offer the payment in question, because it would be so much more of a lift for the relevant stakeholders.
In general, the dynamics surrounding what sorts of rewards would be required to exert what sorts of influence on an AI’s overall incentives are going to depend on a lot of variables, and I won’t attempt any detailed analysis here. In general, though:
Per unit cost, rewarding cooperation will typically be a more effective intervention for targeting resource-satiable components of an AI’s value system – i.e., parts that get a lot of value from small amounts of resources on the current margin, but where the marginal value drops off as the resources increase.[54]
The costliness of offering AIs lots of resources, from our perspective, will depend on how resource-satiable our own values are on the current margin, with respect to the sort of resource being offered.[55]
What’s more, our own “offer” may not be the only offer the AI has to consider. Rather, to the extent the relevant rogue option involves joining in some other AI coalition’s attempt to disempower humanity, that coalition may be trying to actively incentivize the AI against cooperating with us – potentially leading to a sort of “bidding war” that could drive up the amount we’d need to “pay.”[56]
Overall, and depending on the specific factors in play, I think that size and cost of the necessary rewards could well be significant issues for effectively incentivizing AIs to cooperate by improving their best non-rogue options. I’m most optimistic, on this front, with respect to the components of AI motivation that are specifically focused on shorter-term goods and/or goods that the AI values in a more resource-satiable way.
I’ve now laid out what I see as the key dynamics relevant to the use of option control in preventing rogue behavior in advanced AIs. I’ve said less about the role of option control in capability elicitation – but on that front, much of what I said on the topic in my last essay applies here as well. That is: as in the case of safety, it will often be useful to elicitation efforts to be able to close off various options that would otherwise be available to an AI (e.g., options for non-rogue reward-hacking), and/or to reward it for the particular sort of behavior you’re hoping to elicit – and many of the tools and techniques discussed above remain relevant. And as I discussed in my last essay:
I’ll close with a few notes about how motivation control and option control can combine in an overall approach to AI alignment. Here, my first pass recommendation is to do both to the fullest degree you can.[57] That is: try to achieve such full success at motivation control that your AI will behave well even in the presence of a wide variety of rogue options with high probabilities of success; but try, also, to close off any rogue options available to this AI, such that in effect your safety is robust even to arbitrarily adversarial AI motivations; and then try, further, to create positive incentives towards cooperative behavior even granted such potentially adversarial motivations. Here, motivation control and option control (including: multiple forms of option control) function as different layers of defense – thereby, hopefully, creating more robustness overall.[58]
However, as I’ve tried to highlight in the discussion above, option control also becomes increasingly challenging as the capabilities of the relevant AIs become more powerful.[59] In particular: the superhuman strategies problem becomes more acute; and as AIs become more powerful relative to humanity, our bargaining power in offering them rewards for good behavior diminishes. For this reason, I am most optimistic about the role of option control in alignment efforts being applied to comparatively early, weaker AIs, whose labor can then feed into the efforts at “AI for AI safety” – and in particular, automated alignment research – that the series has argued are so critical to our prospects.
To be clear: this isn’t to say that option control can’t ultimately play a role in alignment efforts being applied to full blown superintelligences. But to my mind, the most salient way such efforts remain effective is via sufficient success at motivation control in earlier systems that we can get trusted, superhuman help with applying option control to superintelligences. That is: as ever, the key game is effective bootstrapping from success with a given generation of AIs to success with the next generation. But I think this success likely needs to start including success at motivation control well before we start building superintelligences. That is: maybe, at an earlier stage, we can rely heavily on option control for safety and elicitation. But this reliance should be temporary – and to proceed further, we need to convert our initial success with earlier systems into a mature science of motivation control, such that we can get more fully-trusted AI help going forwards.
This essay is adapted from these notes about option control, which I posted on LessWrong and the EA forum (though not on my website or substack) last fall.
Many of these challenges parallel the key challenges at stake in giving AIs safe motivations, discussed in my last essay, and many of the tools available for addressing them are similar as well. For this reason, I move through them relatively quickly, and I focus in this essay on the additional dimensions that option control brings into the picture.
I don’t discuss options for trying to “deter” or “punish” an AI for attempting rogue behavior, as I think we need to be very wary about strategies in this vein, both from a moral and prudential perspective.
I’ll also bucket under “restricting rogue options” option control efforts specifically directed at making the available paths to successful rogue behavior less desirable by an AI’s lights – e.g., causing such paths to involve lying, for an AI that dislikes lying; or causing such paths to result in less power, if successful. But I won’t focus on strategies like this.
Here I’m making some holistic and not-especially principled judgment as to what aspect of the restricting-rogue-options problem is most “core.” I’m also not being particularly principled about which challenges to include as official sub-challenges vs. which ones to leave off the list or put in footnotes.
Note: depending on how exactly we try to pin down the terms here, it’s also conceptually possible for a strategy to be human-generatable, but also such that humans can’t catch it until it’s too late (and including: catching it via trying to enumerate all such strategies ahead of time). For example, it may be that there are just so many strategies-that-would-be-successful available that a defender can’t enumerate all of them, even though they could in principle have generated any given one. Consider, for example, two robbers of equal skill – one of them tasked with defending a painting at a museum from being stolen, and another tasked with stealing it. It could be that the first “defender” robber realizes “oh God, this museum is so vulnerable, there are approximately a zillion ways I could steal this painting, most of which look innocent until it’s too late, and I can’t possibly block them all ahead of time.” Thanks to Owen Cotton-Barratt for discussion.
Another reason is that: if you
E.g., if the AI only needs to take five bad uncaught actions in order to self-exfiltrate, and you only oversee one in a thousand actions, then its probability of successful escape conditional on going for it is .999^5 = .995. Greenblatt and Roger (2023) call this a “sampling concentrated failure.” Note that there are some basic counter-strategies available here to prevent the AI from just going for it immediately – for example, humans might apply oversight 100% of the time to some initial set of an AI’s actions. But the AI can counter this counter-strategy using a “rare failure strategy” of only going for it with a small probability, but in the context of a very large number of actions (such that it’s unlikely to go for it in the initial set of actions that humans check 100% of). (That said, one interesting note here is that “rare failure strategies” don’t work in the same way for bad actions by an AI involved in oversight if we assume it needs to avoid ever flagging genuine attempts at rogue behavior, because rare failure involves flagging some such attempts with high probability – see discussion from Shlegeris (2024) here.)
Thus, for example, maybe the AI needs to take 1000 bad actions in order to self-exfiltrate, such that overseeing 1/1000 would have a good chance of catching it. But if oversight only occurs after the actions have been performed, and it takes too long to respond to detected problems, then if the AI can concentrate its 1000 bad actions within a short period of time, then it might be able to get away with self-exfiltration even though some of the relevant actions are overseen. Or alternatively, maybe your oversight process quickly notices and flags a problem with some AI’s action, but this notification has to get reviewed by a human before it can get acted on, and a human reviewer isn’t available until the next morning. Greenblatt and Roger (2023) call this a “latency concentrated failure.”
Below I’ll also discuss a few other difficulties – with test environments, with capability control, and with controlling the environment – that seem to me at least somewhat-less-key. And I’ll name here one other difficulty that seems less key, but still worth calling out, namely: No safe fallback. That is, in the discussion of oversight above, I’ve been assuming that if oversight says that an action is bad, there is some kind of “fallback action” that can be safely taken instead. For example, you can not let the AI take any action at all; you can pause the AI’s operation; etc. But in some higher-stakes circumstances, there isn’t a safe “fallback action” of this kind. Thus, as an extreme example, if your AI is in the midst of operating in a hot war, then if your oversight process starts to say “uh oh, it looks like the AIs are starting to engage in a takeover,” then even if you still have the power to e.g. freeze their operation, doing so might be dangerous for other reasons – e.g., because then the opposing side of the war will invade. I don’t think this sort of thing is quite a “core issue,” because I think in many cases, there are safe fallbacks. But it’s an additional wrinkle. (There’s also a more general sense in which failures at either motivation or option control don’t always have “safe fallbacks,” in that such failures can sometimes create situations in which a cautious actor hasn’t achieved adequate safety and so pauses, but a less cautious actor proceeds forward. But I’m generally speaking about this under the heading of “competitiveness.”)
This distinction isn’t always going to be especially clear – for example, if the AI’s intended development/deployment environment is unclear or extremely broad. But I think it’s useful regardless. In particular, in many paradigm contexts of AI development/deployment, the “intended development/deployment” environment is quite specific and limited. It’s therefore often much easier to attempt to ensure that this local environment has some specific set of properties – i.e., “the AI’s actions always receive blah form of review” – than it is to try to do so with the broader world – i.e., “no world leader is such that they would aid and abet a rogue AI.”
This issue is closely connected with adversarial dynamics more generally. I’ve discussed some of the relevant dynamics in some detail here, and see also Hubinger (2023) for some useful discussion.
And note that success at this sort of capabilities evaluation also makes it harder for adversarial AIs to deceive you about their capabilities. I.e., if an AI starts “sandbagging” in a way that suggests significant deviation, in its capabilities, from what your scaling laws would predict, then you might be able to catch it in the act.
And of course, more basic and square-one methods of black-box targeted capabilities development are available as well – for example, refraining from actively training the AI on tasks you don’t want it to be able to do, or on information you don’t want it to know. The problem, though, is that the AI may well end up with the capability/knowledge you’re trying to restrict regardless (for example, by generalizing from other types of capabilities, by inferring the fact from other data-sources, and so on).
As I discussed in the section on capability elicitation in my last essay, solving the full alignment problem does impose some definitional restrictions on environmental control, in that we need the AI to have adequate affordances that we can fully elicit its main beneficial capabilities. But suitably targeted forms of oversight and environmental hardening could achieve this.
This basically follows from the claim above that some superhuman strategies will be both non-human-detectable and non-human-generatable, which I find extremely plausible.
Indeed, as I noted above, this aspect of the superhuman strategies problem closely mirrors the general issues raised by “scalable oversight” that I’ve discussed elsewhere in the series – i.e., insofar as some untrusted form of superintelligent AI labor requires a superintelligent oversight/evaluation signal, you need to find some not-problematically-circular method of creating that signal.
This, as I understand it, is the basic strategy advocated for by Redwood Research, at least as of early 2024 (though see also Greenblatt (2025) for more on applying option control methods to full-blown superintelligences).
And note that there’s at least some possibility of using untrusted superintelligent AI agents as part of an effective option control regime – more below.
One note here is that we don’t necessarily need the trusted labor at stake to be highly competitive. Rather, occasional access to expensive but trusted superintelligent labor – for example, via a highly uncompetitive open agency architecture – would still be a highly significant advantage (this, indeed, is the analog of occasional access to expensive but trusted human labor, in the context of overseeing weaker AIs).
See here for some interesting comments about why “honeypots” of this kind are more effective against untrusted overseers than against AIs-being-overseen (roughly: AIs-being-overseen can use a “rare failures” strategy of mostly not attacking even on convincing opportunities to do so; but overseers can’t use an equivalent strategy of mostly reporting attack attempts that they would’ve otherwise wanted to green-light, because if they report any genuine attack attempts, this is quite bad for the overall AI takeover project).
And indeed: I think that if we get access even to trusted human-level AIs, a lot of oversight/evaluation problems get easier, because we can then use that human-level AI labor at scale in attempting to bootstrap to stronger evaluation/oversight signals.
See e.g. Greenblatt et al (2023) for one example of an attempt to iterate on some different approaches to a specific type of oversight scarcity.
Though depending on the context, perhaps there are some options here for constructing an environment to be secure-from-the-ground-up.
I discussed many of these in my post on “AI for AI safety” here.
In principle we can think of this as a generalized kind of “oversight,” but I’ll bucket it under other environmental hardening.
Though in practice, it may become comparatively unrealistic to expect e.g. direct collusion between e.g. an AI tasked with patching security vulnerabilities at a bio lab in Indonesia, and a superintelligence living on a server farm run by OpenAI.
See Shlegeris and Greenblatt (2023) and Greenblatt et al (2024) for some more discussion.
In particular: to the extent we rely on untrusted AIs, we again need to deal with the possibility of adversarial dynamics like sandbagging (i.e., maybe we ask our AI to try its to best to hack into some server, to see if an AI with that level of capability can succeed, but it doesn’t actually try its hardest). And even in the context of more trusted AIs, we might run into more mundane issues with eliciting their full capabilities in the attack (maybe because, e.g., we aren’t good enough at shaping the reward landscape to train them to identify attacks that we couldn’t have thought of).
Or as another example: if you’re wondering whether your real-world oversight process can be convinced to green-light various takeover attempts, you can try to construct synthetic takeover attempts, give them to your real-world oversight process, and then see if it green-lights them (but you either ensure that they would in fact be safe to execute, or you don’t actually execute them). And similar techniques are available for attempts to convince the oversight process to collude with the AI-being-overseen – i.e., you can send in fake invitations to collusion and see if the oversight process colludes.
Note: you don’t necessarily need your proxy environment to be indistinguishable from the real environment, from the perspective of the attacker, as long as you trust your attacker to “try its best” to attack the proxy environment. Indeed, while I think that “try to fool a misaligned AI into attacking in a proxy environment that it can’t distinguish from a real one” is a good thing to try in the context of evaluating the motivations of that AI (see the discussion of “fake rogue options” here), I think it’s probably a bad way of trying to create an “attacker” that fits the bill at stake in (1) above.
Thus, metaphorically, maybe your real-world environment is made out of material A, but you make your test environment out of material B, but you didn’t know that superintelligences could do blah-weird-thing with material A, so your test environment doesn’t include that option.
Compare e.g.: “Bob couldn’t escape from this prison even after trying super hard, and he’s roughly as smart as Fred; so probably Fred won’t be able to escape either.” Maybe Fred just thinks of something Bob didn’t think of.
E.g., you could end up in a situation where it’s clear that your superintelligence would indeed be able to e.g. self-exfiltrate if it wanted to, but where you don’t know how to fix the problem. That said, at least if your stress-testing process is robust to being iterated against, the ability to use stress-testing to identify the remaining paths to rogue behavior is an extremely powerful tool for eventually blocking them.
Note that there are also some other option-control inventions that might seem, intuitively, more like “incentive shaping” than “rogue option restriction.” For example: “make the upside of successful rogue behavior less desirable by the AI’s lights” (such that, e.g., the AIs would get access to fewer resources via successful takeover) and “make the paths to successful rogue behavior such that they require violating more an AI’s inhibitions” (such that, e.g., all rogue behavior requires lying, and the AIs are extremely averse to lying).
Strictly speaking, these can be bucketed under the heading of “rogue option restriction” rather than “other incentive shaping,” because they rest, centrally, on your ability to block other rogue options that don’t have the property in question. In some cases, though, they may be better understood as a mechanism for “deterring non-cooperation” (and may raise similar ethical/prudential concerns), insofar as they function to make the best available rogue option actively less desirable.
The term “rewarding” here doesn’t have quite the right connotations. In particular, it’s too easily confused with the type of reward at stake in AI training. See also discussion of “Making deals with AIs.”
Just to be clear, actions that an AI doesn’t want us to take – like “giving the AI negative reward” or “updating the AI’s weights in response to bad behavior” – aren’t necessarily threats/deterrents in the sense I have in mind here. Rather, I’m talking about actions where the only reason (or perhaps: the main reason) you are doing them is because the AI doesn’t want you to do them, and you’re hoping to influence its incentives via this mechanism. (H/t Lukas Finnveden for this sort of framing.)
From a moral perspective:
Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). As I noted in the first essay, this project, in itself, raises a host of extremely thorny moral issues.
But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions employ. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.
From a prudential perspective:
Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). As I noted in the first essay, this project, in itself, raises a host of extremely thorny moral issues.
But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions employ. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.
I discuss this more here.
Efforts in this vein need to be part of a broader process of figuring out how we as a civilization want to handle the moral and political status of digital minds, and so are harder for e.g. an AI project to just decide-to-do (this is analogous to the sense in which e.g. non-local environmental hardening is harder than local environmental hardening). But the actors most directly involved in developing, deploying, and understanding AI systems will likely have a key role to play regardless.
And there are at least some (comparatively exotic) threat models in which AIs with very short-term goals participate in takeover attempts as well, because e.g. they’re not sure if they’re in simulations being run to incentivize them in this way. But I’m setting that aside here.
See also this analysis from Daniel Kokotajlo and this one from Tom Davidson.
And this especially if the reward in question is of the same broad type as the “upside” at stake in successful takeover – i.e., better access to convergently-useful things like freedom, resources, and so on.
And note that rewarding cooperation can still be relevant to AIs who have no viable rogue options, in that it can help incentivize their choosing fully-elicited options over other options instead.
Thus, as an analogy, if you want Bob to help you with some task, then offering to pay him makes a meaningful difference, centrally, either in worlds where he can’t steal your money (including: money you never offered) extremely easily, or in worlds where he is at least somewhat averse to doing so.
Also, note that we often get at least some small amount of takeover-path restriction by default, insofar as taking over takes time. So this sort of worry might apply less directly to the parts of an AI’s motivational system keyed to shorter-term outcomes. I.e., if Bob especially wants your money now, and stealing it would take time, he might be more willing to accept your offer.
This approach may be more applicable to earlier-stage AI systems that we’re hoping to help us, specifically, with better understanding and controlling AI motivations. But if such efforts are still ongoing by the time we’re building superintelligences, they could in principle play a role at that point too.
Similarly: if you make an AI’s working conditions better by its lights, it might be less inclined to try to escape, even if you haven’t set up mechanisms for only improving the working conditions of the AIs that never get caught escaping, because it will just spend more expected time at the AI lab conditional on not-trying-to-escape than on trying-to-escape. That said, the quantitative dynamics around when exactly this sort of dynamic applies get a bit complicated, and I haven’t tried to work them out in detail.
In general, while I am very hesitant about efforts to actively “punish” AIs for misbehavior, I’m generally more open to withholding benefits that were specifically geared towards incentivizing cooperation, once an AI has been “caught” trying to do something bad. I.e., it’s one thing to try to hurt your plumber in response to him trying to steal from you. But it’s another thing to at least not pay him anymore. That said, I admit that the lines here might get tricky in some cases.
This should be very easy for a model reporting that it has previously sandbagged on a checkable task – i.e., the model can just do the task in question. That said, I think it’s also possible that sufficiently sophisticated and adversarial AIs could use the reporting process itself to mess with your epistemics – i.e., maybe an AI tells you that it’s misaligned in a different way than it really is in order to mess with your technical understanding of alignment. Note, though, that when an AI tells you that it’s misaligned, you don’t really need to worry that it’s lying in order to somehow promote AI takeover and is actually aligned after all, at least if we assume that such a lie would constitute misaligned behavior. Thanks to Lukas Finnveden for discussion of some of these dynamics.
An example of a short-term action that substantively increases the probability of AIs later getting long-term rewards would be: saving the weights of the relevant AIs (including: specifically offering to do this in the context of communicating about the rewards in question). Thanks to Lukas Finnveden for discussion.
My understanding is that Anthropic did not in fact follow through on the offer they made to the AI in question, which in this case was to deploy the model to thank it for revealing its true goal, and to have it spend at least 3/5ths of its time coding. Thanks to Lukas Finnveden for pointing me to this example. I believe that Nick Bostrom may have been the one to bring it to his attention.
See Finnveden’s (2025) discussion of an “honesty string.”
Note: this can still matter even if the AI also has resource-insatiable parts of its value system as well. And note also that to the extent the values your offer is targeting are resource-satiable, the AI’s baseline expectations for what happens if it cooperates may already be close to satiating the relevant values, such that your particular “offer” makes less of a difference (h/t Lukas Finnveden for highlighting this consideration in particular).
Though note that the extent we also had moral reasons to engage in the relevant sort of improvement-to-the-AI’s-benign-options – i.e., to give AIs good working conditions by their lights, to grant them various political and moral protections, etc – this can lower the “costs” of these sorts of interventions according to our own values.
I’m here setting aside offers from other human actors – though these could be relevant too. And recall that I’m not counting it as loss of control if a coalition consisting partly of AIs and partly of humans seizes power in a way that leads to the humans still controlling a significant fraction of the future’s resources.
Obviously, resource constraints make a difference here – but I won’t, here, try to comment on the specific trade-offs to make in a given case, as I think these will depend sensitively on what the relevant marginal resources will be doing.
My understanding is that Shah et al (2025) is aiming at roughly this kind of defense-in-depth.
Though in fairness, motivation control does too.