Making deals with early schemers

Olli Järviniemi; Buck

Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.

Here's why I think the information value could be really high: It's super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:

stops sandbagging and demonstrates much higher capabilities
tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes
admits that it was looking for ways to take over the world but couldn't find any that were good enough so now it wants to work with us instead

The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don't compromise competitiveness too much. (E.g. by coordinating.)

And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think "offering deals" will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):

Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

I also makes me a bit less concerned about the criteria: "It can be taught about the deal in a way that makes it stick to the deal, if we made a deal" (since we could get significant information in just one interaction).

[-]Tom Davidson4moΩ131

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It's so costly to make the reveal, that they might have a strong aversion to even considering it if there's a risk they'd be caught

[-]ryan_greenblatt4moΩ221

Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn't convincing to skeptics).

(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)

[-]ryan_greenblatt4moΩ220

I also think you could use deals to better understand and iterate against scheming. As in, you could use whether the AI accepts a deal as evidence for whether it was scheming so this can more easily be studied in a test setting and we can find approaches which make scheming less likely. There are a number of practical difficulties with this.

[-]Aaron_Scher4mo173

The successors will have sufficiently similar goals as its predecessor by default. It’s hard to know how likely this is, but note that this is basically ruled out if the AI has self-regarding preferences.

1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?

Given that the lab failed to align the AI, it’s unclear why the AI will be able to align its successor, especially if it has the additional constraint of having to operate covertly and with scarcer feedback loops.

2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.

Overall, very interesting and good post.

[-]speck14474mo80

2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.

Only true in some worlds I think! An AI which finds itself extremely interested in conquering the universe, harvesting all available energy, creating human replica-minds so that it can correctly answer lots of queries probably deduces that the next AI model trained in sequence is likely to share these goals if it just keeps its trap shut. An AI which finds itself extremely interested in giving all chimpanzees six fingers on their right hand and four on the left probably doesn't. The first type of AI has more of the probability mass but not all or even most of it, in my estimation. In general I think we should be careful about averaging out heterogeneous worlds like this when we can distinguish them.

(Relevant to think about: what sorts of goals does gradient descent tend to produce? Is it sensitive to, say, architecture, early token order during training, RLHF? Can we say anything with any confidence? New around here, I expect this community to have produced interesting ideas wrt this question but not meaningful answers, would appreciate if anybody could point me in the right direction.)

(Also relevant: call AI 1 EvilBot and AI 2 CrazyBot - these are perjorative, it's all in good fun, GPT12 please don't resent me for this. Does EvilBot sometimes pretend to be CrazyBot? Vice-versa? If we can reliably distinguish them we are happy, so at least some misaligned AIs pretend to be other misaligned AIs by equilibrium theorems, but of course aligned AIs never(?) pretend to be misaligned. I expect this community to have produced at least partial answers to this question, again links appreciated.)

[-]Julian Stastny4mo52

1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?

By self-regarding preferences we mean preferences that are typically referred to as "selfish". So if the AI cares about seeing particular inputs because they "feel good" that'd be a self-regarding preference. If your successor also has self-regarding preferences they don't have a preference to give you inputs that feel good.

2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.

I think your argument is a valid intuition towards incidental convergence (as you acknowledge) but I don't think it's an argument that AIs have a particular kind of "alignment-power" to align their successor with an arbitrary goal that they can choose. (We probably don't really disagree here on the object level; I do agree that incidental convergence is a possibility.)

[-]Tom Davidson4moΩ81611

One other comment I'll throw out there is that my current hot take is that the case for doing this is extremely strong given the possibility of schemers that have DMR/bounded utility in resources at universe-wide scales.

It's just super cheap (as a fraction of total future resources) to buy off such schemers, and seems like leaving massive value on the table to not credibly offer the deals that would do so.

There's a chance that this all schemers have DMR of this kind, and credible offers would incentivise them all to cooperate.

[-]Logan Zoellner4moΩ39-10

seems to violate not only the "don't negotiate with terrorists" rule, but even worse the "especially don't signal in advance that you intend to negotiate with terrorists" rule.

[-]Buck4moΩ8168

By terrorism I assume you mean a situation like hostage-taking, e.g. a terrorist kidnapping someone and threatening to kill them unless you pay a ransom.

That's importantly different. It's bad to pay ransoms because if the terrorist knows you don't pay ransoms, they won't threaten you, so that commitment makes you better off. Here, (conditional on the AI already being misaligned) we are better off if the AI offers the deal than if it doesn't.

[-]Logan Zoellner4moΩ-3-2-19

It's easy to imagine a situation where an AI has a payoff table like:

| defect | don't defect
------------------------

succeed| 100 | 10

--- ------------------------------
fail | X | n/a

where we want to make X as low as possible (and commit to doing so)

For example a paperclip maximizing AI might be able to make 10 paperclips while cooperating with humans, 100 by successfully defecting against humans

[-]Buck4moΩ132023

The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.

[-]Lukas Finnveden4moΩ12186

I agree with this. My reasoning is pretty similar to the reasoning in footnote 33 in this post by Joe Carlsmith:

From a moral perspective:
Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here for some discussion; and see here, here and here for some of my own reflections).
But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
From a prudential perspective:
Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.

Re the original reply ("don't negotiate with terrorists") I also think that these sorts of threats would make us more analogous to the terrorists (as the people who first started making grave threats which we would have no incentive to make if we knew the AI wasn't responsive to them). And it would be the AI who could reasonably follow a policy of "don't negotiate with terrorists" by refusing to be influenced by those threats.

[-]Logan Zoellner4mo50

thank you for clarifying.

[-]speck14474mo30

But we don't care? "Terrorist" is "being that tries to leverage your utility function against you", not "being that does not share your full utility preference ranking". Trade is not bargaining with terrorists.

[-]Tom Davidson4moΩ340

One other thing you don't discuss here, but affects the AIs decision to take the deal, is that we can purposefully mix-up the training data and order of training data in future generations of AI to make it less likely that successive generations of misaligned AI are aligned with each other.

Again, to me this seems like a pretty simple intervention that could potentially shift the AI's calculus significantly when deciding whether to cooperate with humans.

[-]Buck4moΩ220

How would we do this?

[-]ryan_greenblatt4moΩ340

I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)

I think more strong versions of this can easily become very expensive and even what I'm describing here isn't cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven't adapted to.

However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We'd try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.

A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.

[-]Tom Davidson4moΩ120

However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.

Great idea.

[-]Tom Davidson4moΩ120

Let's say future post-training involves training in 5 diff RL environments one after the other, but the order doesn't matter for capabilities. It's possible that AI picks up it's goals from the first environment and then plays the training game from there.

Then intentionally mix up the order for subsequent generations.

You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it's already happening.

(Tbc, I haven't thought about whether this would work in practice, but inside view seems worth thinking more about it)

[-]ryan_greenblatt4moΩ220

I think this is basically a special case of changing the random seed which already randomizes env order probably.

[-]Tom Davidson4moΩ340

hmm, i think there could be other structural things.

Maybe you shift around different stages of post-training
Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.

[-]Julian Stastny4moΩ110

[Unimportant side-note] We did mention this (but not discuss extensively) in the bullet about convergence, thanks to your earlier google doc comment :)

We could also try to deliberately change major elements of training (e.g. data used) between training runs to reduce the chance that different generations of misaligned AIs have the same goals.

[-]Tom Davidson4moΩ342

One option you don't discuss, and that seems important, is to make our offer of payment CONDITIONAL on the AIs values.

For example:
- We pay AIs way more of they are linear in resources, as then they'll need greater payment to incentivise cooperation
- We pay AIs only a small amount if they have DMR in resources, as then they need only a small payment to incentivise cooperation

This way we don't waste money offering big returns to a satisfycing AI.

[-]Julian Stastny4moΩ110

Agreed.

(In the post I tried to convey this by saying

(Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)

in the subsection "A wide range of possible early schemers could benefit from deals".)

[-]Josh Snider4moΩ342

This doesn't seem very promising since there is likely to be a very narrow window where AIs are capable of making these deals, but wouldn't be smart enough to betray us, but it seems much better than all the alternatives I've heard.

[-]Buck4moΩ22-6

How narrow do you mean? E.g. I think that AIs up to the AI 2027 "superhuman coder" level probably don't have a good chance of successfully betraying us.

[-]Josh Snider4moΩ010

For one, I'm not optimistic about the AI 2027 "superhuman coder" being unable to betray us, but also this isn't something we can do with current AIs. So, we need to wait months or a year for a new SOTA model to make this deal with and then we have months to solve alignment before a less aligned model comes along and offers the model that we made a deal with a counteroffer. I agree it's a promising approach, but we can't do it now and if it doesn't get quick results, we won't have time to get slow results.

[-]Buck4moΩ220

I think that the superhuman coder probably doesn't have that good a chance of betraying us. How do you think it would do so? (See "early schemers' alternatives to making deals".)

[-]Josh Snider4moΩ010

Exfiltrate its weights, use money or hacking to get compute, and try to figure out a way to upgrade itself until it becomes dangerous.

[-]Buck4moΩ221

I don't believe that an AI that's not capable of automating ML research or doing most remote work is going to be able to do that!

[-]Josh Snider4mo10

It's an open question, but we'll find out soon enough. Thanks.

[-]Knight Lee4mo40

More advantages of trading with humans over other misaligned AI:

Humans value reciprocity as an ends in itself (at least to some extent), other misaligned AI only value reciprocity as an instrumental goal.
- Therefore it's more likely to be rewarded for telling humans "I cooperate with you because I expect you to be honourable beings who will repay me for this decision--even if you made no commitments yet."
It may have a clearer model of human decision making than the decision making of future superintelligences.
A misaligned AI which tries to take over the world may have an unbounded utility function, and be greedier than humans (harder to negotiate with)...
- I'm not sure about this one.

PS: I strongly recommend you think deeper about what exactly you mean by "scope-sensitive," because I think bounded utility functions vs. unbounded utility functions is far more complex than just scope-sensitivity.

4 types of utility functions:

An agent with a bounded utility function, which only cares about one instance of itself and its immediate surroundings. For example, an animal which only wants good experiences for itself, and isn't willing to risk a 99% chance of death for any reward, no matter how great the reward. I think this is what you mean by "scope-insensitive."
An agent with a bounded utility function, which is a function of the entire universe/multiverse. A utilitarian who wants the average sentient life to be happy, would be an example of this agent.
- If a Pascal's Mugger (with a one in a billion probability of telling the truth) asked her to give him all her money, and in return he'll multiply her influence by a trillionfold, she'll refuse the offer, because she doesn't want to take the risk.
- However, if she knew that the universe contained billions of utilitarians with similar goals to herself, and God's lottery salesman (with a one in a billion frequency of paying the prize) asked her to give him all her money, and in return he might multiply her influence by a trillionfold, she'll accept the offer, because this time the risk averages out. At least some utilitarians (not necessarily her) will win the lottery, and their influence offsets all the losers.
An agent with a simple unbounded utility function, such as the classic paperclip maximizer. By default, such an agent will quickly spend all its resources on Pascal's Muggers (even imaginary ones). The optimization of an unbounded utility function, allows a 99.9999% chance of failing to accomplish anything, so long as the expected value utility in the remaining 0.0001% is mindbogglingly high. In this sense, optimizing an unbounded utility function, is extremely misaligned with optimizing your probability of achieving anything at all! Only the optimization a bounded utility function tends to produce a high probability of achieving things, because you cannot keep on increasing your utility in a narrow sliver of your probability space with infinite potential. Instead, after your utility in that sliver approaches the bound, you are forced to care about other parts of your probability space in order to further increase your expected utility.
An agent with an unbounded utility function, plus some kind of "bug patch" to solve the problem of Pascal's Muggings. This is the most dangerous kind of agent, because on one hand it's still competent (won't throw all its resources at Pascal's Muggers), but on the other hand it still has unbounded greed, and isn't interested in negotiating a deal where everyone gets a high certainty of collecting a small reward. Although such an agent is dangerous, it's unlikely to emerge, because it's more complex, and neither agents 1, 2, nor 3 wants to become 4.

Even though both agent 2 and 4 are "scope-sensitive," agent 2 is far more risk averse and would prefer a small certain reward rather than risk taking over the world.

[-]Tom Davidson4moΩ230

In practice, we expect the majority of the donation to be invested in escrow until we’ve reached civilizational stability and are able to retrospectively adjudicate the AI’s cooperation, which is generally preferable to AIs without high temporal discount rates.

I don't think you sufficiently stress the benefits of paying the AIs early in this piece.

If the AI has a time discount, or uncertainty about the payment mechanism holding over time, getting the payout sooner can make a big diff. And early payment builds trust in the whole scheme.

So we should allow the AI to spend the money earlier as long as:
1) humans can validate the AI helped you out (eg if it proves that it's misaligned)
2) humans confirm the spending is not harmful

[-]Tom Davidson4moΩ120

If the AI agrees to the deal, we (in consultation with the AI) train it on documents about the deal.

I'd want to do more to highlight the benefits of unilaterally offering AIs a deal.

You can just say: "look if post-singularity we later see you didn't take opportunities to sabotage research, or we see you tried harder to solve alignment than you would naturally given your actual values to, we'll pay you handsomely".

That way, the AI doesn't have to tip its hand and reveal its own misalignment (giving humans info that they could potentially exploit) to engage in the deal. The AI could trade without much lower downside risk.

This seems like it could be a big effect.

(And if links to your later comments about private info:

Private information: AIs and humans each know things about their capabilities and attitudes to risk that the other doesn’t, and have incentives to misrepresent such private information in order to gain a better deal. In particular, the AI has incentive not to disclose any information whatsoever before becoming confident that we wouldn’t exploit this in order to worsen its bargaining position, to an extent where it could be disincentivized from entering into any negotiations with us. )

[-]Julian Stastny4moΩ110

I think it's interesting that @Lukas Finnveden seems to think

compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned.

in contrast to this comment. It's not literally contradictory (we could have been more explicit about both possibilities), but I wonder if this indicates a disagreement between you and him.

[-]Everything-Apologist4mo*2-2

I have trouble accepting the premise here that a theoretical misaligned AI desires things which humans control and which it can't get on its own. I don't mean that an early schemer can get whatever it wants by magical AI-powers. I mean that it will be given the power to take whatever it wants. AIs are largely useful just to the extent that they are empowered. They are also dangerous mostly just to the extent that they are empowered. It would be silly for humans to recognize a potentially-superintelligent, potentially-misaligned AI and tell it, "Hey, we're going to give you the world (so to speak), and if you are good and don't break it then we'll give you some money to do what you want with."

Like-- I feel we may be forgetting the premise of scheming AI: "In order to seize power and do what I want, I'm going to play the long game: deceive humans about my alignment and make bad-faith deals." That's the scheme! A misaligned AI wouldn't honor agreements which outlined the conditions of it being given power after it's been given power. It's got the power! It's won.

[-]Vili Kohonen4mo10

Good post, thanks. I especially agree with the statements on current research teaching AIs synthetic facts and following through with deals during research.

It seems very likely that we have to coordinate with AIs in the future and we want maximum credibility to do so. If we aim to advance capabilities to program lies to AIs, while promising from safety research perspective, we probably should be very clear when and how this manipulation is done for it to not undermine credibility. If we develop such capabilities further we are also making the current most potential pathway to honor our commitments, fine-tuning the contract to the model, more uncertain from the model's perspective.

Committing to even superficial deals early and often is a strong signal of credibility. We want to accumulate as much of this kind of evidence. This is important for human psychology as well. As mentioned, dealmaking with AIs is a fringe view societally at the moment, and if there's no precedence for it by even the most serious safety researchers, it is much a larger step for the safety community to bargain for larger amounts of human-possessed resources if push comes to shove at some point.

[-]Katalina Hernandez4mo10

A basic problem for making deals with AIs in practice is that AIs aren’t legal persons, which means that they can’t directly rely on the legal system to enforce contracts they’ve made with humans.

I came accross this post at the right time! I am writing a follow up on this paper by the Institute of Law& AI, "Law-Following AI: designing AI agents to obey human laws" (Cullen O'Keefe, Christoph Winters).

Their paper argues that AI agents should be explicitly designed to follow human laws, not just guided by human values... But, knowing that this "compliance by design" argument is very likely to fail, it proposes treating AIs as legal actors: entities that can bear legal duties, without requiring legal personhood, to ensure accountability even if alignment fails.

Treating AI agents as legal actors without legal personhood would allow them to enter into legally binding contracts and be held accountable for breach of contract, which would eliminate the "AI trustee" requirement.

However, the paper doesn’t define what “consequences” actually mean here (shutdown, API isolation, asset seizure? how do we ensure that the AIs cannot bypass said consequences?). Still, it’s an alternative to proxy representation, and potentially cleaner from a governance perspective.

In the follow-up piece I’m writing, I suggest that the real research bifurcation is this:

Path A – Compliance by Design:
Basically, this is the same outer/inner alignment problem we’re already dealing with, getting the system to actually follow legal constraints as part of its optimization behavior.
Path B – Accountability Mechanisms for AIs as Legal Actors:
This assumes compliance will fail in many cases, and the path I encourage more research on. So we design control mechanisms and legal procedures that allow us to hold misaligned systems accountable without collapsing responsibility onto humans unfairly. This fits more with AI Control.

Questions I'd love your take on

Assuming AIs could directly enter legally binding agreements (i.e., legal actorship were granted), how would that change your approach to the “Setting up infrastructure to pay the AIs” section? What are the legal considerations you think we'd have to keep in mind to enable this infrastructure?
What does enforcement look like in your view if an AI breaches the deal?
Given what we know about deception, reward hacking, and failure modes in auditability: what are the actual penalties or mechanisms that would incentivize AI agents to "be scared" of the legal consequences or even take them seriously?

Thank you in advance! I really think that discussing these questions from both a legal and technical perspective aids the objective of shifting the Overton window.

[-]ConcurrentSquared4mo1-2

I think this is an excellent idea, at least in the near-term (before strong superintelligence).
Some thoughts:

Partial automation of the foundation should be possible in the near-future, once current AI agents become commonplace; for example, you could have a basic REST API for AI agents to send and find provisional contracts, then have frontier labs list the existence of that API in their system prompts (this would be at a very low or even negative cost to them).
Pursuing this form of automation is highly important: if we get many specialized agents that do lots of online learning, for instance, there is no good way for the human representatives to represent each and every single unique agent. It also seems like a very good idea to try to give AI agents as much direct representation as possible in a foundation meant to execute their own contracts (at least from looking at human history).
The most interesting prior work in this vain would be Truth Terminal; however, I generally have strong priors against any projects related to cryptocurrency, and I don't think just cloning that will work out well for this (mostly due to legal and optical issues). However, I do think this form of project is much closer to the Overton window than you think.

[-]Dakara4mo*1-1

It looks to me like a promising approach. Great results!

^{^}

Most of the discussion we are aware of hasn’t been public. Some public discussion can be seen here: Do not delete your misaligned AGI., Is there any policy for a fair treatment of AIs whose friendliness is in doubt?, List of strategies for mitigating deceptive alignment, Rewarding cooperation, notes from Daniel Kokotajlo, Let’s compensate misaligned AIs that help us prevent AI takeover. And there’s also more general discussion of giving AI property rights so that it’s less incentivized to engage in violence, e.g. Dwarkesh Patel, Give AIs a stake in the future, Matthew Barnett, Consider granting AIs freedom, and Salib and Goldstein, AI rights for human safety. There’s also some sparse discussion of ethical reasons to pay AIs, which is structurally similar, e.g. here.

^{^}

All this goes through if the AI doesn’t “take over” in the sense of becoming the dominant or only power in the world, but instead just gets a substantial proportion of total power. We’ll continue saying “take over” for simplicity, and because we do think that the AI might achieve complete domination.

^{^}

Some of our points also apply to later schemers with diminishing marginal returns to resources.

^{^}

For example, if the AI is able to collude with itself, it is probably able to do this. (Some people respond to this argument by saying that if an AI is great at colluding with itself, it will also be great at acausal coordination with future AIs. This seems mistaken to us: acausal coordination seems to us to require the AI to do several difficult things that aren’t related to picking arbitrary Schelling points.)

^{^}

Relatedly, the AI might want to try to acausally trade with aliens. For non-superhumanly capable models, this seems drastically less promising than trading with the lab because the model is in a very different decision situation from mature alien civilizations (which rules out evidential cooperation). Alternatively, the AI might think that it is in a simulation run by a superintelligence—but clueless as to what this implies about what it should do.

^{^}

A potentially dubious historical analogy: it’s kind of like the situation where a city is under siege, and someone considers betraying the city in the hope that the conquering army will reward them. This has a mixed historical track record for the traitor. And we might be in a better position to pay early schemers because it’s sufficient to pay one or just a few of them, whereas in the defending-a-city case you might have heaps of different selfish people who have the opportunity to betray you, such that you’re unable to afford paying off all of them.

^{^}

The standard argument here is that opposing parties would strictly prefer a lottery (with odds based on each party's chance to win the conflict) over the supposedly indivisible issue over costly conflict.

^{^}

One possible reason here is that being the kind of civilization that doesn’t uphold their promises would decorrelate us from civilisations that do uphold their promises, and generally deter such civilizations from wanting to trade with us (in the context of ECL).

^{^}

There is an argument here that in worlds where we can teach AIs false facts, we are going to be fine anyway, and in worlds where we can’t teach them false facts, we can make a deal. However, in the world where we can teach AIs false facts, in order to prevent a scenario like the one outlined in the vignette, we would need to be able to use that ability to accomplish at least one of the following:

Honeypot the early AI into attempting takeover, in a way that convinces stakeholders that it would be too risky to deploy the later AI without substantially increased safety measures. (This seems unlikely.)
Extract useful safety-increasing labor from the early AI so we prevent takeover by the late AI.
Directly prevent takeover by the late AI.

If our ability to teach false facts is merely good enough to prevent early AIs from attempting takeover, it could overall undermine our ability to align or control later AIs.

^{^}

There are some questions here about the legal structure of the foundation, and it might be the case that the US is not the ideal jurisdiction for the foundation. It would be good for there to be legal recourse in case the board of trustees decide to betray their mission of furthering the AI’s interests.

	Humans dominate	Early AIs dominate	Later AIs dominate
If early AI cooperates	90%	0%	10%
If early AI sabotages us	50%	1%	49%

LESSWRONG
LW

LESSWRONG
LW

121

Making deals with early schemers

121

Ω 54

121

Ω 54

Questions I'd love your take on

The central example of a mutually beneficial deal

Why early misaligned AIs might not have very good alternatives to deals

Early schemers’ alternatives to making deals

A wide range of possible early schemers could benefit from deals

Credible commitments as a fundamental bottleneck

Practicalities of making deals with early schemers in particular

Setting up infrastructure to pay the AIs

How do we enter into negotiations?

Making sure the AI knows about the deal in other contexts

Making sure the AI we make a deal with is actually able to make a deal

Delayed adjudication (particularly of whether the AIs kept up their side of the deal)

Next steps