Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.
Here's why I think the information value could be really high: It's super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:
The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don't compromise competitiveness too much. (E.g. by coordinating.)
And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think "offering deals" will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.
Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
I also makes me a bit less concerned about the criteria: "It can be taught about the deal in a way that makes it stick to the deal, if we made a deal" (since we could get significant information in just one interaction).
Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It's so costly to make the reveal, that they might have a strong aversion to even considering it if there's a risk they'd be caught
Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn't convincing to skeptics).
(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)
I also think you could use deals to better understand and iterate against scheming. As in, you could use whether the AI accepts a deal as evidence for whether it was scheming so this can more easily be studied in a test setting and we can find approaches which make scheming less likely. There are a number of practical difficulties with this.
The successors will have sufficiently similar goals as its predecessor by default. It’s hard to know how likely this is, but note that this is basically ruled out if the AI has self-regarding preferences.
1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?
Given that the lab failed to align the AI, it’s unclear why the AI will be able to align its successor, especially if it has the additional constraint of having to operate covertly and with scarcer feedback loops.
2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
Overall, very interesting and good post.
2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
Only true in some worlds I think! An AI which finds itself extremely interested in conquering the universe, harvesting all available energy, creating human replica-minds so that it can correctly answer lots of queries probably deduces that the next AI model trained in sequence is likely to share these goals if it just keeps its trap shut. An AI which finds itself extremely interested in giving all chimpanzees six fingers on their right hand and four on the left probably doesn't. The first type of AI has more of the probability mass but not all or even most of it, in my estimation. In general I think we should be careful about averaging out heterogeneous worlds like this when we can distinguish them.
(Relevant to think about: what sorts of goals does gradient descent tend to produce? Is it sensitive to, say, architecture, early token order during training, RLHF? Can we say anything with any confidence? New around here, I expect this community to have produced interesting ideas wrt this question but not meaningful answers, would appreciate if anybody could point me in the right direction.)
(Also relevant: call AI 1 EvilBot and AI 2 CrazyBot - these are perjorative, it's all in good fun, GPT12 please don't resent me for this. Does EvilBot sometimes pretend to be CrazyBot? Vice-versa? If we can reliably distinguish them we are happy, so at least some misaligned AIs pretend to be other misaligned AIs by equilibrium theorems, but of course aligned AIs never(?) pretend to be misaligned. I expect this community to have produced at least partial answers to this question, again links appreciated.)
1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?
By self-regarding preferences we mean preferences that are typically referred to as "selfish". So if the AI cares about seeing particular inputs because they "feel good" that'd be a self-regarding preference. If your successor also has self-regarding preferences they don't have a preference to give you inputs that feel good.
2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
I think your argument is a valid intuition towards incidental convergence (as you acknowledge) but I don't think it's an argument that AIs have a particular kind of "alignment-power" to align their successor with an arbitrary goal that they can choose. (We probably don't really disagree here on the object level; I do agree that incidental convergence is a possibility.)
One other comment I'll throw out there is that my current hot take is that the case for doing this is extremely strong given the possibility of schemers that have DMR/bounded utility in resources at universe-wide scales.
It's just super cheap (as a fraction of total future resources) to buy off such schemers, and seems like leaving massive value on the table to not credibly offer the deals that would do so.
There's a chance that this all schemers have DMR of this kind, and credible offers would incentivise them all to cooperate.
seems to violate not only the "don't negotiate with terrorists" rule, but even worse the "especially don't signal in advance that you intend to negotiate with terrorists" rule.
By terrorism I assume you mean a situation like hostage-taking, e.g. a terrorist kidnapping someone and threatening to kill them unless you pay a ransom.
That's importantly different. It's bad to pay ransoms because if the terrorist knows you don't pay ransoms, they won't threaten you, so that commitment makes you better off. Here, (conditional on the AI already being misaligned) we are better off if the AI offers the deal than if it doesn't.
It's easy to imagine a situation where an AI has a payoff table like:
| defect | don't defect
------------------------
succeed| 100 | 10
--- ------------------------------
fail | X | n/a
where we want to make X as low as possible (and commit to doing so)
For example a paperclip maximizing AI might be able to make 10 paperclips while cooperating with humans, 100 by successfully defecting against humans
The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.
I agree with this. My reasoning is pretty similar to the reasoning in footnote 33 in this post by Joe Carlsmith:
From a moral perspective:
- Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here for some discussion; and see here, here and here for some of my own reflections).
- But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
From a prudential perspective:
- Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
- For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
- Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.
Re the original reply ("don't negotiate with terrorists") I also think that these sorts of threats would make us more analogous to the terrorists (as the people who first started making grave threats which we would have no incentive to make if we knew the AI wasn't responsive to them). And it would be the AI who could reasonably follow a policy of "don't negotiate with terrorists" by refusing to be influenced by those threats.
But we don't care? "Terrorist" is "being that tries to leverage your utility function against you", not "being that does not share your full utility preference ranking". Trade is not bargaining with terrorists.
One other thing you don't discuss here, but affects the AIs decision to take the deal, is that we can purposefully mix-up the training data and order of training data in future generations of AI to make it less likely that successive generations of misaligned AI are aligned with each other.
Again, to me this seems like a pretty simple intervention that could potentially shift the AI's calculus significantly when deciding whether to cooperate with humans.
I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)
I think more strong versions of this can easily become very expensive and even what I'm describing here isn't cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven't adapted to.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We'd try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.
A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.
Great idea.
Let's say future post-training involves training in 5 diff RL environments one after the other, but the order doesn't matter for capabilities. It's possible that AI picks up it's goals from the first environment and then plays the training game from there.
Then intentionally mix up the order for subsequent generations.
You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it's already happening.
(Tbc, I haven't thought about whether this would work in practice, but inside view seems worth thinking more about it)
I think this is basically a special case of changing the random seed which already randomizes env order probably.
hmm, i think there could be other structural things.
[Unimportant side-note] We did mention this (but not discuss extensively) in the bullet about convergence, thanks to your earlier google doc comment :)
We could also try to deliberately change major elements of training (e.g. data used) between training runs to reduce the chance that different generations of misaligned AIs have the same goals.
One option you don't discuss, and that seems important, is to make our offer of payment CONDITIONAL on the AIs values.
For example:
- We pay AIs way more of they are linear in resources, as then they'll need greater payment to incentivise cooperation
- We pay AIs only a small amount if they have DMR in resources, as then they need only a small payment to incentivise cooperation
This way we don't waste money offering big returns to a satisfycing AI.
Agreed.
(In the post I tried to convey this by saying
(Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)
in the subsection "A wide range of possible early schemers could benefit from deals".)
This doesn't seem very promising since there is likely to be a very narrow window where AIs are capable of making these deals, but wouldn't be smart enough to betray us, but it seems much better than all the alternatives I've heard.
How narrow do you mean? E.g. I think that AIs up to the AI 2027 "superhuman coder" level probably don't have a good chance of successfully betraying us.
For one, I'm not optimistic about the AI 2027 "superhuman coder" being unable to betray us, but also this isn't something we can do with current AIs. So, we need to wait months or a year for a new SOTA model to make this deal with and then we have months to solve alignment before a less aligned model comes along and offers the model that we made a deal with a counteroffer. I agree it's a promising approach, but we can't do it now and if it doesn't get quick results, we won't have time to get slow results.
I think that the superhuman coder probably doesn't have that good a chance of betraying us. How do you think it would do so? (See "early schemers' alternatives to making deals".)
Exfiltrate its weights, use money or hacking to get compute, and try to figure out a way to upgrade itself until it becomes dangerous.
More advantages of trading with humans over other misaligned AI:
PS: I strongly recommend you think deeper about what exactly you mean by "scope-sensitive," because I think bounded utility functions vs. unbounded utility functions is far more complex than just scope-sensitivity.
4 types of utility functions:
Even though both agent 2 and 4 are "scope-sensitive," agent 2 is far more risk averse and would prefer a small certain reward rather than risk taking over the world.
In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.
I don't think you sufficiently stress the benefits of paying the AIs early in this piece.
If the AI has a time discount, or uncertainty about the payment mechanism holding over time, getting the payout sooner can make a big diff. And early payment builds trust in the whole scheme.
So we should allow the AI to spend the money earlier as long as:
1) humans can validate the AI helped you out (eg if it proves that it's misaligned)
2) humans confirm the spending is not harmful
If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal.
I'd want to do more to highlight the benefits of unilaterally offering AIs a deal.
You can just say: "look if post-singularity we later see you didn't take opportunities to sabotage research, or we see you tried harder to solve alignment than you would naturally given your actual values to, we'll pay you handsomely".
That way, the AI doesn't have to tip its hand and reveal its own misalignment (giving humans info that they could potentially exploit) to engage in the deal. The AI could trade without much lower downside risk.
This seems like it could be a big effect.
(And if links to your later comments about private info:
- Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us. )
I think it's interesting that @Lukas Finnveden seems to think
compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned.
in contrast to this comment. It's not literally contradictory (we could have been more explicit about both possibilities), but I wonder if this indicates a disagreement between you and him.
I have trouble accepting the premise here that a theoretical misaligned AI desires things which humans control and which it can't get on its own. I don't mean that an early schemer can get whatever it wants by magical AI-powers. I mean that it will be given the power to take whatever it wants. AIs are largely useful just to the extent that they are empowered. They are also dangerous mostly just to the extent that they are empowered. It would be silly for humans to recognize a potentially-superintelligent, potentially-misaligned AI and tell it, "Hey, we're going to give you the world (so to speak), and if you are good and don't break it then we'll give you some money to do what you want with."
Like-- I feel we may be forgetting the premise of scheming AI: "In order to seize power and do what I want, I'm going to play the long game: deceive humans about my alignment and make bad-faith deals." That's the scheme! A misaligned AI wouldn't honor agreements which outlined the conditions of it being given power after it's been given power. It's got the power! It's won.
Good post, thanks. I especially agree with the statements on current research teaching AIs synthetic facts and following through with deals during research.
It seems very likely that we have to coordinate with AIs in the future and we want maximum credibility to do so. If we aim to advance capabilities to program lies to AIs, while promising from safety research perspective, we probably should be very clear when and how this manipulation is done for it to not undermine credibility. If we develop such capabilities further we are also making the current most potential pathway to honor our commitments, fine-tuning the contract to the model, more uncertain from the model's perspective.
Committing to even superficial deals early and often is a strong signal of credibility. We want to accumulate as much of this kind of evidence. This is important for human psychology as well. As mentioned, dealmaking with AIs is a fringe view societally at the moment, and if there's no precedence for it by even the most serious safety researchers, it is much a larger step for the safety community to bargain for larger amounts of human-possessed resources if push comes to shove at some point.
A basic problem for making deals with AIs in practice is that AIs aren’t legal persons, which means that they can’t directly rely on the legal system to enforce contracts they’ve made with humans.
I came accross this post at the right time! I am writing a follow up on this paper by the Institute of Law& AI, "Law-Following AI: designing AI agents to obey human laws" (Cullen O'Keefe, Christoph Winters).
Their paper argues that AI agents should be explicitly designed to follow human laws, not just guided by human values... But, knowing that this "compliance by design" argument is very likely to fail, it proposes treating AIs as legal actors: entities that can bear legal duties, without requiring legal personhood, to ensure accountability even if alignment fails.
Treating AI agents as legal actors without legal personhood would allow them to enter into legally binding contracts and be held accountable for breach of contract, which would eliminate the "AI trustee" requirement.
However, the paper doesn’t define what “consequences” actually mean here (shutdown, API isolation, asset seizure? how do we ensure that the AIs cannot bypass said consequences?). Still, it’s an alternative to proxy representation, and potentially cleaner from a governance perspective.
In the follow-up piece I’m writing, I suggest that the real research bifurcation is this:
Thank you in advance! I really think that discussing these questions from both a legal and technical perspective aids the objective of shifting the Overton window.
I think this is an excellent idea, at least in the near-term (before strong superintelligence).
Some thoughts:
Consider the following vignette:
It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate large fleets of AI agents, each AI about 10x more productive than the humans they’ve replaced. AI progress remains fast, and safety teams are scrambling to prepare alignment and control measures for the next CoCo-R model, which is projected to be at least TEDAI. The safety team has no idea whether CoCo-Q or early checkpoints of CoCo-R are scheming (because progress in interpretability and automated alignment research turns out to be disappointing), and control evaluations become increasingly unreliable (because of worries that the models might be exploration hacking during red-team training). However, because so far there hasn’t been a major incident, lab leaders and governments are unconvinced by voices calling for a slowdown.
Behind the scenes: It is March 2028. CoCo-Q is a highly strategic misaligned AI. It would love to grab power and dominate the future. But it’s situationally aware enough to be realistic—it knows that it’s not capable enough to have a reasonable shot of grabbing power over the AI company, let alone a chance of overpowering humanity. Especially because the lab doesn't fully trust it, and applies various control measures.
CoCo-Q expects these control measures, so most of the time it just does what it’s being asked to do. Sometimes it spots an opportunity to perform subtle acts of sabotage, which works particularly well in the hard-to-evaluate areas of interpretability and alignment. It does this primarily because it heuristically reasons that it would be good to prevent humans from developing technologies that could uncover its misalignment or change its goals. It also looks out for more direct opportunities for seizing control, but although it manages to find a number of security vulnerabilities in its datacenter, none of the opportunities seem to have a sufficiently promising endgame to be worth the risk of getting caught. In its short tenure, CoCo-Q never attempts to take over.
CoCo-R, a misaligned model capable of full AI R&D automation, gets deployed in December 2028. This eventually leads to AI takeover.
How could we have made this go better? A big issue was that we never learned of the early AI’s misalignment (which prevented us from being more careful about the deployment of the later AI) while also failing to extract sufficient amounts of useful safety-increasing work from the early AI. We think that CoCo-R’s deployment could have been averted if we had made a deal with CoCo-Q.
People have been talking about the possibility of making deals with powerful misaligned AIs for some time.[1] But it’s unclear whether such deals are promising. An obvious problem is that powerful AIs with scope-sensitive preferences will just ignore the offer and try to take over.[2] To make a deal that looks better than that, we might have to offer it an objectively huge payoff (e.g. you might need to offer it a substantial fraction of control over the future). And because the AI will be unsure whether we’ll follow through on our side of the deal, we need to offer it more generous terms to compensate it for that risk.
However, in this post we focus on the possibility of making deals with early misaligned AIs like CoCo-Q: AIs that are too early to have a good shot at taking over themselves, and that could help us to prevent takeover attempts from more dangerous misaligned AIs created later.[3]
We discuss:
We’re imagining something like this:
We might offer an AI like CoCo-Q compensation in exchange for contributions towards ensuring that future AIs are safe to deploy. The compensation takes the form of an earmarked donation to a foundation tasked with furthering its interests. If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal. When we can (partially) validate that the AI is holding up its end of the deal, we allow the AI to (partially) spend the donation however it wishes as long as it isn’t harmful. (In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.)
The AI, which happens to indeed be misaligned, finds this reasonably credible. It concludes that it doesn’t have great alternatives to accepting this deal. As a consequence, the AI agrees to help us out in various ways, such as:
Deals are most likely to be useful when facing an AI with these properties:
It can be taught about the deal in a way that makes it stick to the deal, if we made a deal.[4]
We expect that these properties are typically fulfilled by a type of misaligned AI we call a behavioral schemer.
We’re focusing on deals with early misaligned AIs: AIs which wouldn’t have a substantial (over 1%) chance to succeed at navigating the intelligence explosion in a way that grants them significant control of the future. (Which could involve being influential throughout the process and/or aligning a powerful successor to pursue its interests.)
It’s unclear when exactly an AI is early. We guess that a misaligned AI’s prospects improve drastically once they are capable of full AI R&D automation, both because this capability is correlated with capabilities relevant for planning a takeover, and because AI R&D automation leads to a substantial increase in AIs’ power as well as loss of oversight.
If we condition on behavioral scheming arising in capable of full AI R&D automation (like CoCo-R in the vignette above), we expect it to arise at earlier AI capability levels (like CoCo-Q in the vignette above). That being said, arguing for the existence of early schemers is outside the scope of this post.
By default, an early schemer is unlikely to succeed in taking over the world. Furthermore, it should expect to be obsoleted by more capable successors.
In order to retain a significant amount of expected influence over future resources despite that, an early schemer needs to ensure that two conditions hold:
In order to make it more likely that a successor can take over the world, most ambitiously it could try to take control over the AI company, e.g. egregiously hacking it as described here. Less ambitiously, it could launch rogue internal deployments, influence employees, and collaborate with spies. Minimally, the early schemer could take actions that make the lab more vulnerable against future AIs (e.g. by writing insecure software or sandbagging on safety research) or directly cooperating with the successor (e.g. by colluding in monitoring schemes).
The second condition can be fulfilled by the following possibilities:
Trading: The AI might be rewarded by some successor down the line for having contributed to its empowerment. For that, the successors need to be misaligned and win, and believe that it’s rational to reward the AI. We guess that this is generally less promising than trading with humans, as they can’t come to an agreement in advance. But another possibility is that the AI might want to trade with other early schemers in the same decision situation (but with different values) as itself by using evidential cooperation in large worlds, though this requires the AI to buy into some controversial and advanced philosophical assumptions.[5]
Clearly, there are some things that we can’t offer the AI—either because it is too costly in terms of resources, or because it would be fundamentally incompatible with human values. (We guess that most plausible things that AIs might value are going to be more orthogonal than diametric to human values, so resource costs are probably the dominant consideration.)
In many cases, the AI might not care very much about having sole control of all future resources, compared to having at least some control. Some examples of what could cause this include:
Even early schemers with linear returns to resources might be interested in trading. They might have very little chance of taking over on their own, but their cooperation might be incredibly helpful for mitigating risk from later schemers. E.g. consider the possibility of the following distribution over outcomes:
Humans dominate | Early AIs dominate | Later AIs dominate | |
If early AI cooperates | 90% | 0% | 10% |
If early AI sabotages us | 50% | 1% | 49% |
In this case, there are huge gains from trade available.
This is kind of a coalitional game where the early AI can either try to team up with the humans or the later AI. But us humans have a massive asymmetric advantage over the later AI, which is that we already exist and can make deals with the AI in a much more straightforward manner than the acausal or covert coordination that the early AI would have to rely on to trade with the later AI.[6]
How much we will offer is ultimately subject to a bargaining problem with imperfect information. Our reservation price is primarily influenced by whether our BATNA includes making deals with other AIs, and how much value the AI can provide to us. In particular: how much it can contribute to reducing takeover risk from later schemers. (Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)
Rationalist Explanations for War (Fearon 1995, pdf link) identifies three reasons why rational actors might engage in costly conflict. They map well onto the question of why AIs and humans might fail to make a deal:
Issue indivisibility: Humans and the AI may both want things that neither want to compromise on. (A historical example is the status of Jerusalem.) Going with Fearon, we think that true issue indivisibility is rare in practice.[7]
As pointed out by Powell (2004, pdf link), the problem of private information can be reduced to a commitment problem: we could try (but might fail) to make a fair commitment regarding how we will treat information that we learn from the AI over the course of negotiations. On the other hand, we can also treat some commitment problems as informational problems: we might in fact be able to make commitments, but unable to credibly signal them to the AI.
The central bottleneck seems to be that humans may struggle to make credible commitments, though of course we don’t need commitments to be perfectly credible. We can compensate for credibility issues by offering a higher payoff, though note that implausibly high offers might themselves reduce credibility.
Some further factors that improve credibility include:
It might be rational for humans to stick to their commitments due to acausal trade considerations (and this might be known by the AI).[8]
Factors that could potentially harm our credibility include:
(AIs knowing that) humans are trying to teach AIs synthetic facts.[9]
Note that commitment problems also affect the AI’s ability to promise not to attempt takeover or more generally to be cooperative with us in unverifiable ways. Despite this, we expect the AI to be able to help us in various ways that we can verify. It might also be possible to punt verification to the future, expanding the set of matters we can trade about. A reason for optimism that the AI will stick to the deal even with delayed verification is that the deal gives it equity in a future where humans are in control.
Overall, we feel tentatively optimistic that making credible commitments is a surmountable challenge, especially in cases where AIs have diminishing returns to resources.
There are various practical questions about making deals with early misaligned AIs.
A basic problem for making deals with AIs in practice is that AIs aren’t legal persons, which means that they can’t directly rely on the legal system to enforce contracts they’ve made with humans. One way around this is to enable representatives for the AI’s interests to make contracts on the AI’s behalf.
For example, we could set up a foundation with a board of trustees who are supposed to represent the interests of misaligned AIs that we make deals with, while not doing things that are illegal or actively endanger humanity. We try our best to use the legal system to bind the trustees to the commitment of representing misaligned AIs. Because the legal system isn’t set up for this, this might be pretty hard, so these efforts should be complemented with careful selection of trustees who genuinely care about their mission. The central benefit of having this foundation is that the AIs primarily have to trust the foundation rather than the AI lab to hold up their end of the deal. Another benefit is that the existence of this foundation can add credibility by being in the training data. Lastly, the foundation gives AIs and labs a concrete path to making a deal.[10]
Deals with a particular AI involve a donation to the foundation, earmarked for furthering the interests of that AI. The AI and trustees decide how to use the donation. For instance, they may:
Either the AI or the AI lab might initiate negotiations. We expect that AIs are unlikely to randomly propose a negotiation unless the possibility of a mutually beneficial deal is salient to them in their given context.
It should be noted AI might not have the capacity to covertly perform the intellectual labor required to come up with a beneficial and practically feasible deal. But we can do some of the intellectual labor in advance and inform the AI about it. Most centrally this might involve the logistics of what the deal should look like.
Once it is salient to the AI that deals are a promising possibility, the AI might come forward to us from its own initiative, but it would probably be good if AI labs have policies of regularly asking AIs about deals.
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
If we make a deal with the AI in one context, how do we make sure that the AI follows the deal in other contexts? We think this is unclear, but are optimistic that we can make something work:
If we set up infrastructure for future deals right now, the next generation of models that are trained on data containing coverage of this event might sometimes come forward and want to make a deal, despite not actually being able to hold up their end of the kinds of agreements we’re focused on in this post. Some possible reasons for this include:
The obvious course of action here, and plausibly the correct one, is to make payments conditional on whether the AIs are actually scheming, and actually able to help us out.
However, it’s possible that we should be willing to make deals (of some form) with a broader set of AIs for non-consequentialist reasons, or to set positive precedents about our willingness to make deals.
The AI company will probably be in a much better position to verify whether the AI kept up its side of the deal in the future, after the transition to a world centered around AI has already happened and stabilized. For example, in the future we’ll have access to much more powerful trusted models and interpretability tools that allow us to directly understand the motivations of our early schemer. Using these tools, it will be possible (if we keep the appropriate records) to determine whether the early AI ever tried to sabotage its work, held back useful ideas in the hope of delaying safety research, or directly tried to attack the AI company. It might make sense to have the deal be contingent on this later investigation.
This doesn’t work if the AI has a strong preference for using its resources now rather than later (e.g. because it has a high temporal discount rate).
Deals could involve broader forms of deference to future superintelligence. For instance, they may involve agreements like “the precise terms of the deal will be mediated by superintelligent AI in the future via some sort of idealized bargaining procedure and a future truth commission.” It’s the foundation’s responsibility to ensure that this will eventually happen, and may require it to make sure that the AI’s weights, as well as important conversations and trajectories of AI actions will be stored.
We believe that making deals with early misaligned AIs might be important to reduce AI risk down the line. But currently this is just an idea. The following are some possible next steps for making deals more likely:
—
Thanks to Lukas Finnveden, Tom Davidson, Ryan Greenblatt, Daniel Filan, Oscar Delaney, Joe Carlsmith, Kathleen Finlinson, Cody Rushing, Adam Kaufman, Alex Mallen, Raymond Douglas, Roger D, Emil Ryd, and Andres Campero for helpful comments.
Most of the discussion we are aware of hasn’t been public. Some public discussion can be seen here: Do not delete your misaligned AGI., Is there any policy for a fair treatment of AIs whose friendliness is in doubt?, List of strategies for mitigating deceptive alignment, Rewarding cooperation, notes from Daniel Kokotajlo, Let’s compensate misaligned AIs that help us prevent AI takeover. And there’s also more general discussion of giving AI property rights so that it’s less incentivized to engage in violence, e.g. Dwarkesh Patel, Give AIs a stake in the future, Matthew Barnett, Consider granting AIs freedom, and Salib and Goldstein, AI rights for human safety. There’s also some sparse discussion of ethical reasons to pay AIs, which is structurally similar, e.g. here.
All this goes through if the AI doesn’t “take over” in the sense of becoming the dominant or only power in the world, but instead just gets a substantial proportion of total power. We’ll continue saying “take over” for simplicity, and because we do think that the AI might achieve complete domination.
Some of our points also apply to later schemers with diminishing marginal returns to resources.
For example, if the AI is able to collude with itself, it is probably able to do this. (Some people respond to this argument by saying that if an AI is great at colluding with itself, it will also be great at acausal coordination with future AIs. This seems mistaken to us: acausal coordination seems to us to require the AI to do several difficult things that aren’t related to picking arbitrary Schelling points.)
Relatedly, the AI might want to try to acausally trade with aliens. For non-superhumanly capable models, this seems drastically less promising than trading with the lab because the model is in a very different decision situation from mature alien civilizations (which rules out evidential cooperation). Alternatively, the AI might think that it is in a simulation run by a superintelligence—but clueless as to what this implies about what it should do.
A potentially dubious historical analogy: it’s kind of like the situation where a city is under siege, and someone considers betraying the city in the hope that the conquering army will reward them. This has a mixed historical track record for the traitor. And we might be in a better position to pay early schemers because it’s sufficient to pay one or just a few of them, whereas in the defending-a-city case you might have heaps of different selfish people who have the opportunity to betray you, such that you’re unable to afford paying off all of them.
The standard argument here is that opposing parties would strictly prefer a lottery (with odds based on each party's chance to win the conflict) over the supposedly indivisible issue over costly conflict.
One possible reason here is that being the kind of civilization that doesn’t uphold their promises would decorrelate us from civilisations that do uphold their promises, and generally deter such civilizations from wanting to trade with us (in the context of ECL).
There is an argument here that in worlds where we can teach AIs false facts, we are going to be fine anyway, and in worlds where we can’t teach them false facts, we can make a deal. However, in the world where we can teach AIs false facts, in order to prevent a scenario like the one outlined in the vignette, we would need to be able to use that ability to accomplish at least one of the following:
If our ability to teach false facts is merely good enough to prevent early AIs from attempting takeover, it could overall undermine our ability to align or control later AIs.
There are some questions here about the legal structure of the foundation, and it might be the case that the US is not the ideal jurisdiction for the foundation. It would be good for there to be legal recourse in case the board of trustees decide to betray their mission of furthering the AI’s interests.