The Commitment Races problem

Daniel Kokotajlo

The Commitment Races problem — LessWrong

178 The Commitment Races problem

by Daniel Kokotajlo

23rd Aug 2019

AI Alignment Forum

7 min read

178 Ω 58

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.

For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible

Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.

Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.

In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.

So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.

Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they've learned about by t.

Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) "I commit to making you lose if you do that move." In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike--and possibly outright bullying--by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn't comply. Practice what you preach!)

When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other one limits theirs. But it gets worse.

Sometimes commitments can be made "at the same time"--i.e. in ignorance of each other--in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)

Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request--perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying--so many years later they meet each other, learn about each other, and end up locked into all-out war.

I'm not saying disastrous AGI commitments are the default outcome; I'm saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We'd wish we built a paperclip maximizer instead.

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

Objection: "Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?"

Reply: Several points:

1. Devastating commitments (e.g. "Grim Trigger") are much more possible with AGI--just alter the code! Inigo Montoya is a fictional character and even he wasn't able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.

2. Credibility is much easier also, especially in an acausal context (see above.)

3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.

4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.

5. Finally, these terrible things (Brutal threats, costly fights) do happen to some extent even among humans today--especially in situations of anarchy. We want the AGI we built to be less likely to do that stuff than humans, not merely as likely.

Objection: "Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it's-too-late arguments, and thus will be incapable of hurting anyone."

Reply: That would be nice, wouldn't it? Let's hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...

Anecdote: A friend of mine, when she was a toddler, would threaten her parents: "I'll hold my breath until you give me the candy!" Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.

Conclusion

Overall, I'm not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if "solving bargaining" turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.

Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.

Game TheoryAcausal TradeCommitment RacesPre-CommitmentAI

Frontpage

178 Ω 58

The Commitment Races problem

New Comment

57 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:12 AM

[-]Eliezer Yudkowsky4yΩ11325

IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent. If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair. They cannot evade that by trying to make some 'commitment' earlier than I do. I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases.

I am not locked into warfare with things that demand $6 instead of $5. I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters.

From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself. I suggest cultivating the same suspicion with respect to the imagination of commitment races between Ultimatum Game players, in which whoever manages to make some move logically first walks away with $9 and the other poor agent can only take $1 - especially if you end up reasoning that the computationally weaker agent should be the winner.

[-]Daniel Kokotajlo4yΩ10270

I agree with all this I think.

This is why I said commitment races happen between consequentialists (I defined that term more narrowly than you do; the sophisticated reasoning you do here is nonconsequentialist by my definition). I agree that agents worthy of the label "rational" will probably handle these cases gracefully and safely.

However, I'm not yet supremely confident that the AGIs we end up building will handle these cases gracefully and safely. I would love to become more confident & am looking for ways to make it more likely.

If today you go around asking experts for an account of rationality, they'll pull off the shelf CDT or EDT or game-theoretic rationality (nash equilibria, best-respond to opponent) -- something consequentialist in the narrow sense. I think there is a nonzero chance that the relevant AGI will be like this too, either because we explicitly built it that way or because in some young dumb early stage it (like humans) picks up ideas about how to behave from its environment. Or else maybe because narrow-consequentialism works pretty well in single-agent environments and many muti-agent environments too, and maybe by the time the AGI is able to self-modify to something more sophisticated it is already thinking about commitment races and already caught in their destructive logic.

(ETA: Insofar as you are saying: "Daniel, worrying about this is silly, any AGI smart enough to kill us all will also be smart enough to not get caught in commitment races" then I say... I hope so! But I want to think it through carefully first; it doesn't seem obvious to me, for the above reasons.)

[-]Wei Dai3y*Ω13300

I think I'm less sure than @Eliezer Yudkowsky that there is a good solution to the problem of commitment races, even in theory, or that if there is a solution, it has the shape that he thinks it has. I've been thinking about this problem off and on since 2009, and haven't made much progress. Others have worked on this too (as you noted in the OP), and all seem to have gotten stuck at roughly the same place that I got stuck. Eliezer described what he would do in a particular game, but I don't know how to generalize his reasoning (which you call "nonconsequentialist") and incorporate it into a decision theory, even informally (e.g., on the same level of formality as my original description of UDT1.1 or UDT2).

As an alternative to Eliezer's general picture, it also seems plausible to me that the solution to commitment races looks like everyone trying to win the races by being as clever as they can (using whatever tricks one can think of to make the best commitments as quickly as possible while minimizing the downsides of doing so), or a messy mix of racing and trading/cooperating. UDT2 sort of fits into or is compatible with this picture, but might be far from the cleverest thing we can do (if this picture turns out to be correct).

To summarize, I think the commitment races problem poses a fundamental challenge to decision theory, and is not just a matter of "we know roughly or theoretically what should be done, and we just have to get AGI to do it." (I'm afraid some readers might get the latter impression from your exchange with Eliezer.)

[-]Eliezer Yudkowsky3yΩ6110

TBC, I definitely agree that there's some basic structural issue here which I don't know how to resolve. I was trying to describe properties I thought the solution needed to have, which ruled out some structural proposals I saw as naive; not saying that I had a good first-principles way to arrive at that solution.

[-]Daniel Kokotajlo3yΩ462

Great comment. To reply I'll say a bit more about how I think of this stuff for the past few years:

I agree that the commitment races problem poses a fundamental challenge to decision theory, in the following sense: There may not exist a simple algorithm in the same family of algorithms as EDT, CDT, UDT 1.0, 1.1, and even 2.0, that does what we'd consider a good job in a realistic situation characterized by many diverse agents interacting over some lengthy period with the ability to learn about each other and make self-modifications (including commitments). Indeed it may be that the top 10% of humans by performance in environments like this, or even the top 90%, outperform the best possible simple-algorithm-in-that-family. Thus any algorithm for making decisions that would intuitively be recognized as a decision theory, would be worse in realistic environments than the messy neural net wetware of many existing humans, and probably far worse than the best superintelligences. (To be clear, I still hold out hope that this is false and such a simple in-family algorithm does exist.)

I therefore think we should widen our net and start considering algorithms that don't fit in the traditional decision theory family. For example, think of a human role model (someone you consider wise, smart, virtuous, good at philosophy, etc.) and then make them into an imaginary champion by eliminating what few faults they still have, and increasing their virtues to the extent possible, and then imagine them in a pleasant and secure simulated environment with control over their own environment and access to arbitrary tools etc. and maybe also the ability to make copies of themselves HCH style. You have now have described an algorithm that can be compared to the performance of EDT, UDT 2.0, etc. and arguably will be superior to all of them (because this wise human can use their tools to approximate or even directly compute such things to the extent that they deem it useful to do so). We can then start thinking about flaws in this algorithm, and see if we can fix them. (Another algorithm to consider is the human champion alone, without all the fancy tooling and copy-generation ability. Even this might still be better than CDT, EDT, UDT, etc.)

Another example:

Consider a standard expected utility maximizer of some sort (e.g. EDT) but with the following twist: It also has a deontological or almost-deontological constraint that prevents it from getting exploited. How is this implemented? Naive first attempt: It has some "would this constitute me being exploited?" classifier which it can apply to imagined situations, and which it constantly applies whenever it's thinking about what to do, and it doesn't take actions that trigger the classifier to a sufficiently high degree. Naive second attempt: "Me getting exploited" is assigned huge negative utility. (I suspect these might be equivalent, but also they might not be, anyhow moving on...) What can we say about this agent?

Well, it all depends on how good its classifier is, relative to the adversaries it is realistically likely to face. Are its adversaries able to find any adversarial examples to its classifier that they can implement in practice? Things that in some sense SHOULD count as exploitation, but which it won't classify as exploitation and thus will fall for?

Moreover, is its classifier wasteful/clumsy/etc., hurting it's own performance in other ways in order to achieve the no-exploitation property?

I think this might not be a hard problem. If you are facing adversaries significantly more intelligent than you, or who can simulate you in detail such that they can spend lots of compute to find adversarial examples by brute force, you are kinda screwed anyway probably and so it's OK if you are vulnerable to exploitation by them. Moreover there are probably fixes to even those failure modes -- e.g. plausibly "they used their simulation of me + lots of compute to find a solution that would give them lots of my stuff but not count as exploitation according to my classifier" can just be something that your classifier classifies as exploitation. Anything even vaguely resembling that can be classified as exploitation. So you'd only be exploitable in practice if they had the simulation of you but you didn't know they did.

Moreover, that's just the case where you have a fixed/frozen classifier. More sophisticated designs could have more of a 'the constitution is a living document' vibe, a process for engaging in Philosophical/Moral Reasoning that has the power to modify the classifier as it sees fit -- but importantly, still applies the classifier to its own thinking processes, so it won't introduce a backdoor route to exploitation.

Another tool in the toolbox: Infohazard management. There's a classic tradeoff which you discovered, in the context of UDT 2.0 at least, where if you run the logical inductor for longer you risk making yourself exploitable or otherwise losing to agents that are early enough in logical time that you learn about their behavior (and they predict that you'll learn about their behavior) and so they exploit you. But on the other hand, if you pause the logical inductor and let the agent make self-modifications too soon, the self-modifications it makes might be really stupid/crazy. Well, infohazard management maybe helps solve this problem. Make a cautious self-modification along the lines of "let's keep running the logical inductor, but let's not think much yet about what other potentially-exploitative-or-adversarial agents might do." Perhaps things mostly work out fine if the agents in the commitment race are smart enough to do something like this before they stumble across too much information about each other.

Another tool in the toolbox: Learn from history: Heuristics / strategies / norms / etc. for how to get along in commitment race environment can be learned from history via natural selection, cultural selection, and reading history books. People have been in similar situations in the past, e.g. in some cultures people could unilaterally swear oaths/promises and would lose lots of status if they didn't uphold them. Over history various cultures have developed concepts of fairness that diverse agents with different interests can use to coordinate without incentivizing exploiters; we have a historical record which we can use to judge how well these different concept work, including how well they work when different people come from different cultures with different fairness concepts.

Another thing to mention: The incentive to commit to brinksmanshippy, exploitative policies is super strong to the extent that you are confident that the other agents you will interact with are consequentialists. But to the extent that you expect many of those agents to be nonconsequentialists with various anti-exploitation defenses (e.g. the classifier system I described above, or whatever sort of defenses they may have evolved culturally or genetically) the incentive is goes in the opposite direction -- doing brinksmanshippy / bully-ish strategies is going to waste resources at best and get you into lots of nasty fights with high probability and plausibly even get everyone to preemptively gang up on you.

And this is important because once you understand the commitment races problem, you realize that consequentialism is a repulsor state, not an attractor state; moreover, realistic agents (whether biological or artificial) will not begin their life as consequentialists except if specifically constructed to be that way. Moreover their causal history will probably contain lots of learned/evolved anti-exploitation defenses, some of which may have made its way into their minds.

Zooming out again: The situation seems extremely messy, but not necessarily grim. I'm certainly worried--enough to make this one of my main priorities!--but I think that agents worthy of being called "rational" will probably handle all this stuff more gracefully/competently than humans do, and I think (compared to how naive consequentialists would handle it, and certainly compared to how it COULD go) humans handle it pretty well. That is, I agree that "the solution to commitment races looks like everyone trying to win the races by being as clever as they can (using whatever tricks one can think of to make the best commitments as quickly as possible while minimizing the downsides of doing so), or a messy mix of racing and trading/cooperating," but I think that given what I've said in this comment, and some other intuitions which I haven't articulated, overall I expect things to go significantly better in expectation than they go with humans. The sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan. (Dath Ilan is Yudkowsky's fantasy utopia of cooperative competence)

[-]Anthony DiGiovanni3y40

It also has a deontological or almost-deontological constraint that prevents it from getting exploited.

I’m not convinced this is robustly possible. The constraint would prevent this agent from getting exploited conditional on the potential exploiters best-responding (being "consequentialists"). But it seems to me the whole heart of the commitment races problem is that the potential exploiters won’t necessarily do this, indeed depending on their priors they might have strong incentives not to. (And they might not update those priors for fear of losing bargaining power.)

That is, these exploiters will follow the same qualitative argument as us — “if I don’t commit to demand x%, and instead compromise with others’ demands to avoid conflict, I’ll lose bargaining power” — and adopt their own pseudo-deontological constraints against being fair. Seems that adopting your deontological strategy requires assuming one's bargaining counterparts will be “consequentialists” in a similar way as (you claim) the exploitative strategy requires. And this is why Eliezer's response to the problem is inadequate.

There might be various symmetry-breakers here, but I’m skeptical they favor the fair/nice agents so strongly that the problem is dissolved.

I think this is a serious challenge and a way that, as you say, an exploitation-resistant strategy might be “wasteful/clumsy/etc., hurting it’s own performance in other ways in order to achieve the no-exploitation property.” At least, unless certain failsafes against miscoordination are used—my best guess is these look like some variant of safe Pareto improvements that addresses the key problem discussed in this post, which I’ve worked on recently (as you know).

Given this, I currently think the most promising approach to commitment races is to mostly punt the question of the particular bargaining strategy to smarter AIs, and our job is to make sure robust SPI-like things are in place before it’s too late.

[-]Daniel Kokotajlo3y20

Exploitation means the exploiter benefits. If you are a rock, you can't be exploited. If you are an agent who never gives in to threats, you can't be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won't benefit them, then you might get nasty things done to you. You wouldn't be exploited, but you'd still be very unhappy.

So no, I don't think the constraint I proposed would only work if the opponent agents were consequentialists. Adopting the strategy does not assume one's bargaining counterparts will be consequentialists. However, if you are a consequentialist, then you'll only adopt the strategy if you think that sufficiently few of the agents you will later encounter are of the aforementioned nasty sort--which, by the logic of commitment races, is not guaranteed; it's plausible that at least some of the agents you'll encounter are 'already committed' to being nasty to you unless you surrender to them, such that you'll face much nastiness if you make yourself inexploitable. This is my version of what you said above, I think. And yeah to put it in my ontology, some exploitation-resistant strategies might be wasteful/clumsy/etc. and depending on how nasty the other agents are, maybe most or even all exploitation-resistant strategies are more trouble than they are worth (from a consequentialist perspective; note that nonconsequentialists might have additional reasons to go for exploitation-resistant strategies. Also note that even consequentialists might assign intrinsic value to justice, fairness, and similar concepts.)

But like I said, I'm overall optimistic -- not enough to say "there's no problem here," it's enough of a problem that it's one of my top priorities (and maybe my top priority?) but I still do expect the sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.

Agree re punting the question. I forgot to mention that in my list above, as a reason to be optimistic; I think that not only can we human AI designers punt on the question to some extent, but AGIs can punt on it as well to some extent. Instead of hard-coding in a bargaining strategy, we / future AGIs can do something like "don't think in detail about the bargaining landscape and definitely not about what other adversarial agents are likely to commit to, until I've done more theorizing about commitment races and cooperation and discovered & adopted bargaining strategies that have really nice properties."

[-]Anthony DiGiovanni3y30

Exploitation means the exploiter benefits. If you are a rock, you can't be exploited. If you are an agent who never gives in to threats, you can't be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won't benefit them, then you might get nasty things done to you. You wouldn't be exploited, but you'd still be very unhappy.

Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because "you won't get exploited if you decide not to concede to bullies" is kind of trivially true. :) The operative word in my reply was "robustly," which is the hard part of dealing with this whole problem. And I think it's worth keeping in mind how "doing nasty things to you anyway even though it won't benefit them" is a consequence of a commitment that was made for ex ante benefits, it's not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this... but who knows what misaligned AIs might prefer.)

[-]Daniel Kokotajlo3y30

Re: Symmetry: Yes, that's why I phrased the original commitment races post the way I did. For both commitments designed to exploit others, and commitments designed to render yourself less exploitable, (and for that matter for commitments not in either category) you have an incentive to do them 'first,' early in your own subjective time and also in particular before you think about what others will do, so that your decision isn't logically downstream of theirs, and so that hopefully theirs is logically downstream of yours. You have an incentive to be the first-mover, basically.

And yeah I do suspect there are various symmetry-breakers that favor various flavors of fairness and niceness and cooperativeness, and disfavor brinksmanshippy risky strategies, but I'm far from confident that the cumulative effect is strong enough to 'dissolve' the problem. If I thought the problem was dissolved I would not still be prioritizing it!

[-]Wei Dai3y*Ω334

I think that agents worthy of being called “rational” will probably handle all this stuff more gracefully/competently than humans do

Humans are kind of terrible at this right? Many give in even to threats (bluffs) conjured up by dumb memeplexes and back up by nothing (i.e., heaven/hell), popular films are full of heros giving in to threats, apparent majority of philosophers have 2-boxing intuitions (hence the popularity of CDT, which IIUC was invented specifically because some philosophers were unhappy with EDT choosing to 1-box), governments negotiate with terrorists pretty often, etc.

The sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.

If we build AGI that learn from humans or defer to humans on this stuff, do we not get human-like (in)competence?^[1]^[2] If humans are not atypical, large parts of the acausal society/economy could be similarly incompetent? I imagine there could be a top tier of "rational" superintelligences, built by civilizations that were especially clever or wise or lucky, that cooperate with each other (and exploit everyone else who can be exploited), but I disagree with this second quoted statement, which seems overly optimistic to me. (At least for now; maybe your unstated reasons to be optimistic will end up convincing me.)

I can see two ways to improve upon this: 1) AI safety people seem to have better intuitions (cf popularity of 1-boxing among alignment researchers) and maybe can influence the development of AGI in a better direction, e.g., to learn from / defer to humans with intuitions more like themselves. 2) We figure out metaphilosophy, which lets AGI figure out how to improve upon humans. (ETA: However, conditioning on there not being a simple and elegant solution to decision theory also seems to make metaphilosophy being simple and elegant much less likely. So what would "figure out metaphilosophy" mean in that case?) ↩︎
I can also see the situation potentially being even worse, since many future threats will be very "out of distribution" for human evolution/history/intuitions/reasoning, so maybe we end up handling them even worse than current threats. ↩︎

[-]Daniel Kokotajlo3yΩ220

Yes. Humans are pretty bad at this stuff, yet still, society exists and mostly functions. The risk is unacceptably high, which is why I'm prioritizing it, but still, by far the most likely outcome of AGIs taking over the world--if they are as competent at this stuff as humans are--is that they talk it over, squabble a bit, maybe get into a fight here and there, create & enforce some norms, and eventually create a stable government/society. But yeah also I think that AGIs will be by default way better than humans at this sort of stuff. I am worried about the "out of distibution" problem though, I expect humans to perform worse in the future than they perform in the present for this reason.

Yes, some AGIs will be better than others at this, and presumably those that are worse will tend to lose out in various ways on average, similar to what happens in human society.

Consider that in current human society, a majority of humans would probably pay ransoms to free loved ones being kidnapped. Yet kidnapping is not a major issue; it's not like 10% of the population is getting kidnapped and paying ransoms every year. Instead, the governments of the world squash this sort of thing (well, except for failed states etc.) and do their own much more benign version, where you go to jail if you don't pay taxes & follow the laws. When you say "the top tier of rational superintelligences exploits everyone else" I say that is analogous to "the most rational/clever/capable humans form an elite class which rules over and exploits the masses." So I'm like yeah, kinda sorta I expect that to happen, but it's typically not that bad? Also it would be much less bad if the average level of rationality/capability/etc. was higher?

I'm not super confident in any of this to be clear.

[-]Wei Dai3yΩ220

But yeah also I think that AGIs will be by default way better than humans at this sort of stuff.

What's your reasons for thinking this? (Sorry if you already explained this and I missed your point, but it doesn't seem like you directly addressed my point that if AGIs learn from or defer to humans, they'll be roughly human-level at this stuff?)

When you say “the top tier of rational superintelligences exploits everyone else” I say that is analogous to “the most rational/clever/capable humans form an elite class which rules over and exploits the masses.” So I’m like yeah, kinda sorta I expect that to happen, but it’s typically not that bad?

I think it could be much worse than current exploitation, because technological constraints prevent current exploiters from extracting full value from the exploited (have to keep them alive for labor, can't make them too unhappy or they'll rebel, monitoring for and repressing rebellions is costly). But with superintelligence and future/acausal threats, an exploiter can bypass all these problems by demanding that the exploited build an AGI aligned to itself and let it take over directly.

[-]Daniel Kokotajlo3yΩ220

I agree that if AGIs defer to humans they'll be roughly human-level, depending on which humans they are deferring to. If I condition on really nasty conflict happening as a result of how AGI goes on earth, a good chunk of my probability mass (and possibly the majority of it?) is this scenario. (Another big chunk, possibly bigger, is the "humans knowingly or unknowingly build naive consequentialists and let rip" scenario, which is scarier because it could be even worse than the average human, as far as I know). Like I said, I'm worried.

If AGIs learn from humans though, well, it depends on how they learn, but in principle they could be superhuman.

Re: analogy to current exploitation: Yes there are a bunch of differences which I am keen to study, such as that one. I'm more excited about research agendas that involve thinking through analogies like this than I am about what people interested in this topic seem to do by default, which is think about game theory and Nash bargaining and stuff like that. Though I do agree that both are useful and complementary.

[-]Anthony DiGiovanni4y100

From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.

I don't see how this makes the point you seem to want it to make. There's still an equilibrium selection problem for a program game of one-shot PD—some other agent might have the program that insists (through a biased coin flip) on an outcome that's just barely better for you than defect-defect. It's clearly easier to coordinate on a cooperate-cooperate program equilibrium in PD or any other symmetric game, but in asymmetric games there are multiple apparently "fair" Schelling points. And even restricting to one-shot PD, the whole commitment races problem is that the agents don't have common knowledge before they choose their programs.

[-]Anthony DiGiovanni4y50

Perhaps the crux here is whether we should expect all superintelligent agents to converge on the same decision procedure—and the agent themselves will expect this, such that they'll coordinate by default? As sympathetic as I am to realism about rationality, I put a pretty nontrivial credence on the possibility that this convergence just won't occur, and persistent disagreement (among well-informed people) about the fundamentals of what it means to "win" in decision theory thought experiments is evidence of this.

[-]Martín Soto2y4-5

The normative pull of your proposed procedure seems to come from a preconception that "the other player will probably best-respond to me" (and thus, my procedure is correctly shaping its incentives).

But instead we can consider the other player trying to get us to best-respond to them, by jumping up a meta-level: the player checks whether I am playing your proposed policy with a certain notion of fairness $X (which in your case is $5), and punishes accordingly to how far their notion of fairness $Y is from my $X, so that I (if I were to best-respond to his policy) would be incentivized to adopt notion of fairness $Y.

It seems clear that, for the exact same reason your argument might have some normative pull, this other argument has some normative pull in the opposite direction. It then becomes unclear which has stronger normative pull: trying to shape the incentives of the other (because you think they might play a policy one level of sophistication below yours), or trying to best-respond to the other (because you think they might play a policy one level of sophistication above yours).

I think this is exactly the deep problem, the fundamental trade-off, that agents face in both empirical and logical bargaining. I am not convinced all superintelligences will resolve this trade-off in similar enough ways to allow for Pareto-optimality (instead of falling for trapped priors i.e. commitment races), due to the resolution's dependence on the superintelligences' early prior.

[-]Eliezer Yudkowsky2y137

I am denying that superintelligences play this game in a way that looks like "Pick an ordinal to be your level of sophistication, and whoever picks the higher ordinal gets $9." I expect sufficiently smart agents to play this game in a way that doesn't incentivize attempts by the opponent to be more sophisticated than you, nor will you find yourself incentivized to try to exploit an opponent by being more sophisticated than them, provided that both parties have the minimum level of sophistication to be that smart.

If faced with an opponent stupid enough to play the ordinal game, of course, you just refuse all offers less than $9, and they find that there's no ordinal level of sophistication they can pick which makes you behave otherwise. Sucks to be them!

[-]Martín Soto2y60

I agree most superintelligences won't do something which is simply "play the ordinal game" (it was just an illustrative example), and that a superintelligence can implement your proposal, and that it is conceivable most superintelligences implement something close enough to your proposal that they reach Pareto-optimality. What I'm missing is why that is likely.

Indeed, the normative intuition you are expressing (that your policy shouldn't in any case incentivize the opponent to be more sophisticated, etc.) is already a notion of fairness (although in the first meta-level, rather than object-level). And why should we expect most superintelligences to share it, given the dependence on early beliefs and other pro tanto normative intuitions (different from ex ante optimization)? Why should we expect this to be selected for? (Either inside a mind, or by external survival mechanisms)
Compare, especially, to a nascent superintelligence who believes most others might be simulating it and best-responding (thus wants to be stubborn). Why should we think this is unlikely?
Probably if I became convinced trapped priors are not a problem I would put much more probability on superintelligences eventually coordinating.

Another way to put it is: "Sucks to be them!" Yes sure, but also sucks to be me who lost the $1! And maybe sucks to be me who didn't do something super hawkish and got a couple other players to best-respond! While it is true these normative intuitions pull on me less than the one you express, why should I expect this to be the case for most superintelligences?

[-]DaemonicSigil4yΩ110

The Ultimatum game seems like it has pretty much the same type signature as the prisoner's dilemma: Payoff matrix for different strategies, where the players can roll dice to pick which strategy they use. Does timeless decision theory return the "correct answer" (second player rejects greedy proposals with some probability) when you feed it the Ultimatum game?

[-]Raemon5yΩ6120

Okay, so now having thought about this a bit...

I at first read this and was like "I'm confused – isn't this what the whole agent foundations agenda is for? Like, I know there are still kinks to work out, and some of this kinks are major epistemological problems. But... I thought this specific problem was not actually that confusing anymore."

"Don't have your AGI go off and do stupid things" is a hard problem, but it seemed basically to be restating "the alignment problem is hard, for lots of finnicky confusing reasons."

Then I realized "holy christ most AGI research isn't built off the agent foundations agenda and people regularly say 'well, MIRI is doing cute math things but I don't see how they're actually relevant to real AGI we're likely to build.'"

Meanwhile, I have several examples in mind of real humans who fell prey to something similar to commitment-race concerns. i.e. groups of people who mutually grim-triggered each other because they were coordinating on slightly different principles. (And these were humans who were trying to be rationalist and even agent-foundations-based)

So, yeah actually it seems pretty likely that many AGIs that humans might build might accidentally fall into these traps.

So now I have a vague image in my head of a rewrite of this post that ties together some combo of:

The specific concerns noted here
The rocket alignment problem "hey man we really need to make sure we're not fundamentally confused about agency and rationality."
Possibly some other specific agent-foundations-esque concerns

Weaving those into a central point of:

"If you're the sort of person who's like 'Why is MIRI even helpful? I get how they might be helpful but they seem more like a weird hail-mary or a 'might as well given that we're not sure what else to do?'... here is a specific problem you might run into if you didn't have a very thorough understanding of robust agency when you built your AGI. This doesn't (necessarily) imply any particular AGI architecture, but if you didn't have a specific plan for how to address these problems, you are probably going to get them wrong by default.

(This post might already exist somewhere, but currently these ideas feel like they just clicked together in my mind in a way they hadn't previously. I don't feel like I have the ability to write up the canonical version of this post but feel like "someone with better understanding of all the underlying principles" should)

[-]Elliott Thornley1mo100

[Edit: 2009 in fact!]

Derek Parfit wrote up some thoughts along these lines in 1984:

I shall first distinguish threats from warnings. When I say that I shall do X unless you do Y, call this a warning if my doing X would be worse for you but not for me, and a threat if my doing X would be worse for both of us. Call me a threat‐fulfiller if I would always fulfil my threats.
Suppose that, apart from being a threat‐fulfiller, someone is never self‐denying. Such a person would fulfil his threats even though he knows that this would be worse for him. But he would not make threats if he believed that doing so would be worse for him. This is because, apart from being a threat‐fulfiller, this person is never self‐denying. He never does what he believes will be worse for him, except when he is fulfilling some threat. This exception does not cover making threats.
Suppose that we are all both transparent and never self‐denying. If this was true, it would be better for me if I made myself a threat‐fulfiller, and then announced to everyone else this change in my dispositions. Since I am transparent, everyone would believe my threats. And believed threats have many uses. Some of my threats could be defensive, intended to protect me from aggression by others. I might confine myself to defensive threats. But it would be tempting to use my known disposition in other ways. Suppose that the benefits of some co‐operation are shared between us. And suppose that, without my co‐operation, there would be no further benefits. I might say that, unless I get the largest share, I shall not co‐operate. If others know me to be a threat‐fulfiller, and they are never self‐denying, they will give me the largest share. Failure to do so would be worse for them.
Other threat‐fulfillers might act in worse ways. They could reduce us to slavery. They could threaten that, unless we become their slaves, they will bring about our mutual destruction. We would know that these people would fulfil their threats. We would therefore know that we can avoid destruction only by becoming their slaves.
The answer to threat‐fulfillers, if we are all transparent, is to become a threat‐ignorer. Such a person always ignores threats, even when he knows that doing so will be worse for him. A threat‐fulfiller would not threaten a transparent threat‐ignorer. He would know that, if he did, his threat would be ignored, and he would fulfil this threat, which would be worse for him.
If we were all both transparent and never self‐denying, what changes in our dispositions would be better for each of us? I answer this question in Appendix A, since parts of the answer are not relevant to the question I am now discussing. What is relevant is this. If we were all transparent, it would probably be better for each of us if he became a trustworthy threat‐ignorer. These two changes would involve certain risks; but these would be heavily outweighed by the probable benefits. What would be the benefits from becoming trustworthy? That we would not be excluded from those mutually advantageous agreements that require self‐denial. What would be the benefits from becoming threat‐ignorers? That we would avoid becoming the slaves of threat‐fulfillers.
We can next assume that we could not become trustworthy threat‐ignorers unless we changed our beliefs about rationality. Those who are trustworthy keep their promises even when they know that this will be worse for them. We can assume that we could not become disposed to act in this way unless we believed that it is rational to keep such promises. And we can assume that, unless we were known to have this belief, others would not trust us to keep such promises. On these assumptions, S tells us to make ourselves have this belief. Similar remarks apply to becoming threat‐ignorers. We can assume that we could not become threat‐ignorers unless we believed that it is always rational to ignore threats. And we can assume that, unless we have this belief, others would not be convinced that we are threat‐ignorers. On these assumptions, S tells us to make ourselves have this belief. These conclusions can be combined. S tells us to make ourselves believe that it is always irrational to do what we believe will be worse for us, except when we are keeping promises or ignoring threats.
Does this fact support these beliefs? According to S, it would be rational for each of us to make himself believe that it is rational to ignore threats, even when he knows that this will be worse for him. Does this show this belief to be correct? Does it show that it is rational ignore such threats?
It will help to have an example. Consider
My Slavery. You and I share a desert island. We are both transparent, and never self‐denying. You now bring about one change in your dispositions, becoming a threat‐fulfiller. And you have a bomb that could blow the island up. By regularly threatening to explode this bomb, you force me to toil on your behalf. The only limit on your power is that you must leave my life worth living. If my life became worse than that, it would cease to be better for me to give in to your threats.
How can I end my slavery? It would be no good killing you, since your bomb will automatically explode unless you regularly dial some secret number. But suppose that I could make myself transparently a threat‐ignorer. Foolishly, you have not threatened that you would ignore this change in my dispositions. So this change would end my slavery.
Would it be rational for me to make this change? There is the risk that you might make some new threat. But since doing so would be clearly worse for you, this risk would be small. And, by taking this small risk, I would almost certainly gain a very great benefit. I would almost certainly end my slavery. Given the wretchedness of my slavery, it would be rational for me, according to S, to cause myself to become a threat‐ignorer. And, given our other assumptions, it would be rational for me to cause myself to believe that it is always rational to ignore threats. Though I cannot be wholly certain that this will be better for me, the great and nearly certain benefit would outweigh the small risk. (In the same way, it would never be wholly certain that it would be better for someone if he became trustworthy. Here too, all that could be true is that the probable benefits outweigh the risks.)
Assume that I have now made these changes. I have become transparently a threat‐ignorer, and have made myself believe that it is always rational to ignore threats. According to S, it was rational for me to cause myself to have this belief. Does this show this belief to be correct?
Let us continue the story.
How I End My Slavery. We both have bad luck. For a moment, you forget that I have become a threat‐ignorer. To gain some trivial end—such as the coconut that I have just picked—you repeat your standard threat. You say, that, unless I give you the coconut, you will blow us both to pieces. I know that, if I refuse, this will certainly be worse for me. I know that you are reliably a threat‐fulfiller, who will carry out your threats even when you know that this will be worse for you. But, like you, I do not now believe in the pure Self‐interest Theory. I now believe that it is rational to ignore threats, even when I know that this will be worse for me. I act on my belief. As I foresaw, you blow us both up.
Is my act rational? It is not. As before, we might concede that, since I am acting on a belief that it was rational for me to acquire, I am not irrational. More precisely, I am rationally irrational. But what I am doing is not rational. It is irrational to ignore some threat when I know that, if I do, this will be disastrous for me and better for no one. S told me here that it was rational to make myself believe that it is rational to ignore threats, even when I know that this will be worse for me. But this does not show this belief to be correct. It does not show that, in such a case, it is rational to ignore threats.
We can draw a wider conclusion. This case shows that we should reject
(G2) If it is rational for someone to make himself believe that it is rational for him to act in some way, it is rational for him to act in this way.
Return now to B, the belief that it is rational to keep our promises even when we know that this will be worse for us. On the assumptions made above, S implies that it is rational for us to make ourselves believe B. Some people claim that this fact supports B, showing that it is rational to keep such promises. But this claim seems to assume (G2), which we have just rejected.
There is another objection to what these people claim. Even though S tells us to try to believe B, S implies that B is false. So, if B is true, S must be false. Since these people believe B, they should believe that S is false. Their claim would then assume
(G3) If some false theory about rationality tells us to make ourselves have a particular belief, this shows this belief to be true.
But we should obviously reject (G3). If some false theory told us to make ourselves believe that the Earth was flat, this would not show this to be so.
S told us to try to believe that it is rational to ignore threats, even when we know that this will be worse for us. As my example shows, this does not support this belief. We should therefore make the same claim about keeping promises. There may be other grounds for believing that it is rational to keep our promises, even when we know that doing so will be worse for us. But this would not be shown to be rational by the fact that the Self‐interest Theory itself told us to make ourselves believe that it was rational. It has been argued that, by appealing to such facts, we can solve an ancient problem_ we can show that, when it conflicts with self‐interest, morality provides the stronger reasons for acting. This argument fails. The most that it might show is something less. In a world where we are all transparent—unable to deceive each other—it might be rational to deceive ourselves about rationality.

[-]johnswentworth7yΩ58-2

One big factor this whole piece ignores is communication channels: a commitment is completely useless unless you can credibly communicate it to your opponent/partner. In particular, this means that there isn't a reason to self-modify to something UDT-ish unless you expect other agents to observe that self-modification. On the other hand, other agents can simply commit to not observing whether you've committed in the first place - effectively destroying the communication channel from their end.

In a game of chicken, for instance, I can counter the remove-the-steering-wheel strategy by wearing a blindfold. If both of us wear a blindfold, then neither of us has any reason to remove the steering wheel. In principle, I could build an even stronger strategy by wearing a blindfold and using a beeping laser scanner to tell whether my opponent has swerved - if both players do this, then we're back to the original game of chicken, but without any reason for either player to remove their steering wheel.

[-]Daniel Kokotajlo7yΩ352

I think in the acausal context at least that wrinkle is smoothed out.

In a causal context, the situation is indeed messy as you say, but I still think commitment races might happen. For example, why is [blindfold+laserscanner] a better strategy than just blindfold? It loses to the blindfold strategy, for example. Whether or not it is better than blindfold depends on what you think the other agent will do, and hence it's totally possible that we could get a disastrous crash (just imagine that for whatever reason both agents think the other agent will probably not do pure blindfold. This can totally happen, especially if the agents don't think they are strongly correlated with each other and sometimes even if they do (e.g. if they use CDT)) The game of chicken doesn't cease being a commitment race when we add the ability to blindfold and the ability to visibly attach laserscanners.

[-]johnswentworth7yΩ370

Blindfold + scanner does not necessarily lose to blindfold. The blindfold does not prevent swerving, it just prevents gaining information - the blindfold-only agent acts solely on its priors. Adding a scanner gives the agent more data to work with, potentially allowing the agent to avoid crashes. Foregoing the scanner doesn't actually help unless the other player knows I've foregone the scanner, which brings us back to communication - though the "communication" at this point may be in logical time, via simulation.

In the acausal context, communication kicks even harder, because either player can unilaterally destroy the communication channel: they can simply choose to not simulate the other player. The game will never happen at all unless both agents expect (based on priors) to gain from the trade.

[-]Daniel Kokotajlo7yΩ282

If you choose not to simulate the other player, then you can't see them, but they can still see you. So it's destroying one direction of the communication channel. But the direction that remains (they seeing you) is the dimension most relevant for e.g. whether or not there is a difference between making a commitment and credibly communicating it to your partner. Not simulating the other player is like putting on a blindfold, which might be a good strategy in some contexts but seems kinda like making a commitment: you are committing to act on your priors in the hopes that they'll see you make this commitment and then conform their behavior to the incentives implied by your acting on your priors.

[-]Linda Linsefors5yΩ270

I mostly agree with this post, except I'm not convinced it is very important. (I wrote some similar thought here.)

Raw power (including intelligence) will always be more important than having the upper hand in negotiation. Because I can only shift you up to the amount I can threaten you.

Let's say I can cause you up to X utility of harm, according to your utility function. If I'm maximally skilled at blackmail negotiation then I can decide your action with in the set of action such that your utility is with in (max-X, max] utility.

If X utility is a lot, then I can influence you a lot. If X is not so much then I don't have much power over you. If I'm strong then X will be large, and influencing your action will probably be of little importance to me.

Blackmail is only important when players are of similar straights which is probably unlikely, or if the power to destroy is much more than the power to create, which I also find unlikely.

The main scenario where I expect blackmail to seriously matter (among super intelligences) is in aclausal trade between different universes. I'm sceptical to this being a real thing, but admit I don't have strong arguments on this point.

[-]Daniel Kokotajlo5yΩ230

I agree raw power (including intelligence) is very useful and perhaps generally more desireable than bargaining power etc. But that doesn't undermine the commitment races problem; agents with the ability to make commitments might still choose to do so in various ways and for various reasons, and there's general pressure (collective action problem style) for them to do it earlier while they are stupider, so there's a socially-suboptimal amount of risk being taken.

I agree that on Earth there might be a sort of unipolar takeoff where power is sufficiently imbalanced and credibility sufficiently difficult to obtain and "direct methods" easier to employ, that this sort of game theory and bargaining stuff doesn't matter much. But even in that case there's acausal stuff to worry about, as you point out.

[-]CronoDAS3y20

In the real world, the power to destroy actually is usually a lot stronger than the the power to create. For example, it's a lot easier to blow up an undefended building than to build one.

The laws of thermodynamics are the root of all evil.

[-]JesseClifton5yΩ360

It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”

The set of solution concepts you commit to regarding as acceptable still poses an equilibrium selection problem. But, on the face of it at least, the “thinking more” part is handled by conditional commitments to act on the basis of future beliefs.

I guess there’s the problem of what protocols for specifying future world-models you commit to regarding as acceptable. Maybe there are additional protocols that haven’t occurred to you, but which other agents may have committed to and which you would regard as acceptable when presented to you. Hopefully it is possible to specify sufficiently flexible methods for determining whether protocols proposed by your future counterparts are acceptable that this is not a problem.

[-]Daniel Kokotajlo5yΩ230

If I read you correctly, you are suggesting that some portion of the problem can be solved, basically -- that it's in some sense obviously a good idea to make a certain sort of commitment, e.g. "When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.” So the commitment races problem may still exist, but it's about what other commitments to make besides this one, and when. Is this a fair summary?

I guess my response would be "On the object level, this seems like maybe a reasonable commitment to me, though I'd have lots of questions about the details. We want it to be vague/general/flexible enough that we can get along nicely with various future agents with somewhat different protocols, and what about agents that are otherwise reasonable and cooperative but for some reason don't want to agree on a world-model with us? On the meta level though, I'm still feeling burned from the various things that seemed like good commitments to me and turned out to be dangerous, so I'd like to have some sort of stronger reason to think this is safe."

[-]JesseClifton5yΩ230

Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced otherwise”. Idk.

Your meta-level concern seems warranted. One would at least want to try to formalize the kinds of commitments we’re discussing and ask if they provide any guarantees, modulo equilibrium selection.

[-]Daniel Kokotajlo5yΩ120

I think we are on the same page then. I like the idea of a deliberation module; it seems similar to the "moral reasoning module" I suggested a while back. The key is to make it not itself a coward or bully, reasoning about schelling points and universal principles and the like instead of about what-will-lead-to-the-best-expected-outcomes-given-my-current-credences.

[-]Daniel Kokotajlo3yΩ250

h/t Anthony DiGiovanni who points to this new paper making a weaker version of this point, in the context of normative ethics: Johan E. Gustafsson, Bentham’s Mugging - PhilPapers

[-]Raemon5yΩ250

I was confused about this post, and... I might have resolve my confusion by the time I got ready to write this comment. Unsure. Here goes:

My first* thought:

Am I not just allowed to precommit to "be the sort of person who always figures out whatever the optimal game theory was, and commit to that?". I thought that was the point.

i.e. I wouldn't precommit to treating either the Nash Bargaining Solution or Kalai-Smorodinsky Solution as "the permanent grim trigger bullying point", I'd precommit to something like "have a meta-policy of not giving into bullying, pick my best-guess-definition-of-bullying as my default trigger, and my best-guess grim-trigger response, but include an 'oh shit I didn't think about X' parameter." (with some conditional commitments thrown in)

Where X can't be an arbitrary new belief – the whole point of having a grim trigger clause is to be able to make appropriately weighted threats that AGI-Bob really thinks will happen. But, if I legitimately didn't think of the Kalai-Smordinwhatever solution as something an agent might legitimately think was a good coordination tool, I want to be able to say. depending on circumstances:

If the deal hasn't resolved yet "oh, shit I JUUUST thought of the Kalai-whatever thing and this means I shouldn't execute my grim trigger anti-bullying clause without first offering some kind of further clarification step."
If the deal already resolved before I thought of it, say "oh shit man I really should realized the Kalai-Smorodinsk thing was a legitimate schelling point and not started defecting hard as punishment. Hey, fellow AGI, would you like me to give you N remorseful utility in return for which I stop grim-triggering you and you stop retaliating at me and we end the punishment spiral?"

My second* thought:

Okay. So. I guess that's easy for me to say. But, I guess the whole point of all this updateless decision theory stuff was to actually formalize that in a way that you could robustly program an AGI that you were about to give the keys to the universe.

Having a vague handwavy notion of it isn't reassuring enough if you're about to build a god.

And while it seems to me like this is (relatively) straightforward... do I really want to bet that?

I guess my implicit assumption was that game theory would turn out to not be that complicated in the grand scheme of thing. Surely once you're a Jupiter Brain you'll have it figured out? And, hrmm, maybe that's true, but but maybe it's not, or maybe it turns out the fate of the cosmos gets decided with smaller AGIs fighting over Earth which much more limited compute.

Third thought:

Man, just earlier this year, someone offered me a coordination scheme that I didn't understand, and I fucked it up, and the deal fell through because I didn't understand the principles underlying it until just-too-late. (this is an anecdote I plan to write up as a blogpost sometime soon)

And... I guess I'd been implicitly assuming that AGIs would just be able to think fast enough that that wouldn't be a problem.

Like, if you're talking to a used car salesman, and you say "No more than $10,000", and then they say "$12,000 is final offer", and then you turn and walk away, hoping that they'll say "okay, fine, $10,000"... I suppose metaphorical AGI used car buyers could say "and if you take more than 10 compute-cycles to think about it, the deal is off." And that might essentially limit you to only be able to make choices you'd precomputed, even if you wanted to give yourself the option to think more.

That seems to explain why my "Just before deal resolves, realize I screwed up my decision theory" idea doesn't work.

It seems like my "just after deal resolves and I accidentally grim trigger, turn around and say 'oh shit, I screwed up, here is remorse payment + a costly proof that that I'm not fudging my decision theory'" should still work though?

I guess in the context of Acausal Trade, I can imagine things like "they only bother running a simulation of you for 100 cycles, and it doesn't matter if on the 101st cycle you realize you made a mistake and am sorry." They'll never know it.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

I dunno. Maybe I'm still confused.

But, I wanted to check in on whether I was on the right track in understanding what considerations were at play here.

...

*actually there were like 20 thoughts before I got to the one I've labeled 'first thought' here. But, "first thought that seemed worth writing down."

[-]Daniel Kokotajlo5yΩ470

Thanks! Reading this comment makes me very happy, because it seems like you are now in a similar headspace to me back in the day. Writing this post was my response to being in this headspace.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

This sounds like a plausibly good rule to me. But that doesn't mean that every AI we build will automatically follow it. Moreover, thinking about acausal trade is in some sense engaging in acausal trade. As I put it:

Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

As for your handwavy proposals, I do agree that they are pretty good. They are somewhat similar to the proposals I favor, in fact. But these are just specific proposals in a big space of possible strategies, and (a) we have reason to think there might be flaws in these proposals that we haven't discovered yet, and (b) even if these proposals work perfectly there's still the problem of making sure that our AI follows them:

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

If you want to think and talk more about this, I'd be very interested to hear your thoughts. Unfortunately, while my estimate of the commitment races problem's importance has only increased over the past year, I haven't done much to actually make intellectual progress on it.

[-]Raemon5yΩ240

Yeah I'm interested in chatting about this.

I feel I should disclaim "much of what I'd have to say about this is a watered down version of whatever Andrew Critch would say". He's busy a lot, but if you haven't chatted with him about this yet you probably should, and if you have I'm not sure whether I'll have much to add.

But I am pretty interested right now in fleshing out my own coordination principles and fleshing out my understanding of how they scale up from "200 human rationalists" to 1000-10,000 sized coalitions to All Humanity and to AGI and beyond. I'm currently working on a sequence that could benefit from chatting with other people who think seriously about this.

[-]CronoDAS3y20

I suppose metaphorical AGI used car buyers could say "and if you take more than 10 compute-cycles to think about it, the deal is off." And that might essentially limit you to only be able to make choices you'd precomputed, even if you wanted to give yourself the option to think more.

"This offer is only valid if you say yes right now - if you go home and come back tomorrow, it will cost more" actually is one of those real-world dirty tricks that salespeople use to rip people off.

[-]David Scott Krueger7yΩ250

I have another "objection", although it's not a very strong one, and more of just a comment.

One reason game theory reasoning doesn't work very well in predicting human behavior is because games are always embedded in a larger context, and this tends to wreck the game-theory analysis by bringing in reputation and collusion as major factors. This seems like something that would be true for AIs as well (e.g. "the code" might not tell the whole story; I/"the AI" can throw away my steering wheel but rely on an external steering-wheel-replacing buddy to jump in at the last minute if needed).

In apparent contrast to much of the rationalist community, I think by default one should probably view game theoretic analyses (and most models) as "just one more way of understanding the world" as opposed to "fundamental normative principles", and expect advanced AI systems to reason more heuristically (like humans).

But I understand and agree with the framing here as "this isn't definitely a problem, but it seems important enough to worry about".

[-]Wei Dai7yΩ250

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016.

I found some related discussions going back to 2009. It's mostly highly confused, as you might expect, but I did notice this part which I'd forgotten and may actually be relevant:

But if you are TDT, you can’t always use less computing power, because that might be correlated with your opponents also deciding to use less computing power

This could potentially be a way out of the "racing to think as little as possible before making commitments" dynamic, but if we have to decide how much to let our AIs think initially before making commitments, on the basis of reasoning like this, that's a really hairy thing to have to do. (This seems like another good reason for wanting to go with a metaphilosophical approach to AI safety instead of a decision theoretic one. What's the point of having a superintelligent AI if we can't let it figure these kinds of things out for us?)

If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can “move first” can get much more than the one that “moves second.”

I'm not sure how the folk theorem shows this. Can you explain?

going updateless is like making a bunch of commitments all at once

Might be a good idea to offer some examples here to help explain updateless and for pumping intuitions.

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn’t actually commit to anything then.

Interested to hear more details about this. What would have happened if you were actually able to become updateless?

[-]Liam Donovan7y40

Would trying to become less confused about commitment races before building a superintelligent AI count as a metaphilosophical approach or a decision theoretic one (or neither)? I'm not sure I understand the dividing line between the two.

[-]Wei Dai7y50

Trying to become less confused about commitment races can be part of either a metaphilosophical approach or a decision theoretic one, depending on what you plan to do afterwards. If you plan to use that understanding to directly give the AI a better decision theory which allows it to correctly handle commitment races, then that's what I'd call a "decision theoretic approach". Alternatively, you could try to observe and understand what humans are doing when we're trying to become less confused about commitment races and program or teach an AI to do the same thing so it can solve the problem of commitment races on its own. This would be an example of what I call "metaphilosophical approach".

[-]Daniel Kokotajlo7yΩ110

Thanks, edited to fix!

I agree with your push towards metaphilosophy.

I didn't mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1's preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be "earlier in logical time" than player 2 and make a credible commitment.

As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don't do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don't have a super clear example of how this might lead to disaster, but I intend to work one out in the future...

Same goes for my own experience. I don't have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.

[-]Linda Linsefors5yΩ340

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.

Why is that?

[-]Daniel Kokotajlo5yΩ340

All the versions of updatelessness that I know of would have led to some pretty disastrous, not-adding-up-to-normality behaviors, I think. I'm not sure. More abstractly, the commitment races problem has convinced me to be more skeptical of commitments, even ones that seem probably good. If I was a consequentialist I might take the gamble, but I'm not a consequentialist -- I have commitments built into me that have served my ancestors well for generations, and I suspect for now at least I'm better off sticking with that than trying to self-modify to something else.

[-]Linda Linsefors5yΩ250

(This is some of what I tried to say yesterday, but I was very tried and not sure I said it well)

Hm, the way I understand UDT, is that you give yourself the power to travel back in logical time. This means that you don't need to actually make commitment early in your life when you are less smart.

If you are faced with blackmail or transparent Newcomb's problem, or something like that, where you realise that if you had though of the possibility of this sort of situation before it happened (but with your current intelligence), you would have pre-committed to something, then you should now do as you would have pre-committed to.

This means that an UDT don't have to do tons of pre-commitments. It can figure things out as it goes, and still get the benefit of early pre-committing. Though as I said when we talked, it does loose some transparency which might be very costly in some situations. Though I do think that you loose transparency in general by being smart, and that it is generally worth it.

(Now something I did not say)

However the there is one commitment that you (maybe?[1]) have to do to get the benefit of UDT if you are not already UDT, which is to commit to become UDT. And I get that you are wary of commitments.

Though more concretely, I don't see how UDT can lead to worse behaviours. Can you give an example? Or do you just mean that UDT get into commitment races at all, which is bad? But I don't know any DT that avoids this, other than always giving in to blackmail and bullies, which I already know you don't, given one of the stories in the blogpost.

[1] Or maybe not. Is there a principled difference between never giving into blackmail becasue you pre-committed something, or just never giving into blackmail with out any binding pre-commitment? I suspect not really, which means you are UDT as long as you act UDT, and no pre-commitment needed, other than for your own sake.

[-]Daniel Kokotajlo5yΩ120

Thanks for the detailed reply!

where you realise that if you had though of the possibility of this sort of situation before it happened (but with your current intelligence), you would have pre-committed to something, then you should now do as you would have pre-committed to.

The difficulty is in how you spell out that hypothetical. What does it mean to think about this sort of situation before it happened but with your current intelligence? Your current intelligence includes lots of wisdom you've accumulated, and in particular, includes the wisdom that this sort of situation has happened, and more generally that this sort of situation is likely, etc. Or maybe it doesn't -- but then how do we define current intelligence then? What parts of your mind do we cut out, to construct the hypothetical?

I've heard of various ways of doing this and IIRC none of them solved the problem, they just failed in different ways. But it's been a while since I thought about this.

One way they can fail is by letting you have too much of your current wisdom in the hypothetical, such that it becomes toothless -- if your current wisdom is that people threatening you is likely, you'll commit to giving in instead of resisting, so you'll be a coward and people will bully you. Another way they can fail is by taking away too much of your current wisdom in the hypothetical, so that you commit to stupid-in-retrospect things too often.

[-]Linda Linsefors5yΩ110

Imagine your life as a tree (as in data structure). Every observation which (from your point of view of prior knowledge) could have been different, and every decision which (from your point of view) could have been different, is a node in this tree.

Ideally you would would want to pre-analyse the entire tree, and decide the optimal pre-commitment for each situation. This is too much work.

So instead you wait and see which branch you find yourself in, only then make the calculations needed to figure out what you would do in that situation, given a complete analysis of the tree (including logical constraints, e.g. people predicting what you would have done, etc). This is UDT. In theory, I see no drawbacks with UDT. Except in practice UDT is also too much work.

What you actually do, as you say, is to rely on experience based heuristics. Experience based heuristics is much superior for computational efficiency, and will give you a leg up in raw power. But you will slide away from optimal DT, which will give you a negotiating disadvantage. Given that I think raw power is more important than negotiating advantage, I think this is a good trade-off.

The only situation where you want to rely more on DT principles, is in super important one-off situations, and you basically only get those in weird acausal trade situations. Like, you could frame us building a friendly AI as acausal trade, like Critch said, but that framing does not add anything useful.

And then there is things like this and this and this, which I don't know how to think of. I suspect it breaks somehow, but I'm not sure how. And if I'm wrong, getting DT right might be the most important thing.

But in any normal situation, you will either have repeated games among several equals, where some coordination mechanism is just uncomplicatedly in everyone interest. Or your in a situation where one person just have much more power over the other one.

[-]MichaelA5y40

Thanks for this post; this does seem like a risk worth highlighting.

I've just started reading Thomas Schelling's 1960 book The Strategy of Conflict, and noticed a lot of ideas in chapter 2 that reminded me of many of the core ideas in this post. My guess is that that sentence is an uninteresting, obvious observation, and that Daniel and most readers were already aware (a) that many of the core ideas here were well-trodden territory in game theory and (b) that this post's objectives were to:

highlight these ideas to people on LessWrong
highlight their potential relevance to AI risk
highlight how this interacts with updateless decision theory and acausal trade

But maybe it'd be worth people who are interested in this problem reading that chapter of The Strategy of Conflict, or other relevant work in standard academic game theory, to see if there are additional ideas there that could be fruitful here.

[-]Raemon5y40

I'm about halfway through Strategy of Conflict and so far it's not really giving solutions to any of these problems, just sketching out the problem space.

[-]Raemon6yΩ240Nomination for 2019 Review

This feels like an important question in Robust Agency and Group Rationality, which are major topics of my interest.

[-]Dagon7y*Ω230

I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.

Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even after making an oath - often with some reputational consequences, but still possible if we deem it worthwhile. Even so, we're rightly reluctant to make serious committments. An agent who can actually enforce it's self-limitations is going to be orders of magnitude more hesitant to do so.

All that said, it's worth recognizing that an agent that's significantly better at predicting the consequences of potential commitments will pay a lower cost for the best of them, and has a material advantage over those who need flexibility because they don't have information. This isn't a race in time, it's a race in knowledge and understanding. I don't think there's any way out of that race - more powerful agents are going to beat weaker ones most of the time.

[-]Daniel Kokotajlo7yΩ340

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate.

I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not.

I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.

[-]FeepingCreature7y10

I think this undervalues conditional commitments. The problem of "early commitment" depends entirely on you possibly having a wrong image of the state of the world. So if you just condition your commitment on the information you have available, you avoid premature commitments made in ignorance and give other agents an incentive to improve your world model. Likewise, this would protect you from learning about other agents' commitments "too late" - you can always just condition on things like "unless I find an agent with commitment X". You can do this whether or not you even know to think of an agent with commitment X, as long as other agents who care about X can predict your reaction to learning about X.

Commitments aren't inescapable shackles, they're just another term for "predictable behavior." The usefulness of commitments doesn't require you to bind yourself regardless of learning any new information about reality. Oaths are highly binding for humans because we "look for excuses", our behavior is hard to predict, and we can't reliably predict and evaluate complex rule systems. None of those should pose serious problems for trading superintelligences.

[-]Daniel Kokotajlo7y50

I don't think this solves the problem, though it is an important part of the picture.

The problem is, which conditional commitments do you make? (A conditional commitment is just a special case of a commitment) "I'll retaliate against A by doing B, unless [insert list of exceptions here." Thinking of appropriate exceptions is important mental work, and you might not think of all the right ones for a very long time, and moreover while you are thinking about which exceptions you should add, you might accidentally realize that such-and-such type of agent will threaten you regardless of what you commit to and then if you are a coward you will "give in" by making an exception for that agent. The problem persists, in more or less exactly the same form, in this new world of conditional commitments. (Again, which are just special cases of commitments, I think.)

[-]FeepingCreature7y00

I concur in general, but:

you might accidentally realize that such-and-such type of agent will threaten you regardless of what you commit to and then if you are a coward you will “give in” by making an exception for that agent.

this seems like a problem for humans and badly-built AIs. Nothing that reliably one-boxes should ever do this.

[-]Daniel Kokotajlo7y30

EDT reliably one-boxes, but EDT would do this.

Or do you mean one-boxing in Transparent Newcomb? Then your claim might be true, but even then it depends on how seriously we take the "regardless of what you commit to" clause.

[-]FeepingCreature7y10

True, sorry, I forgot the whole set of paradoxes that led up to FDT/UDT. I mean something like... "this is equivalent to the problem that FDT/UDT already has to solve anyways." Allowing you to make exceptions doesn't make your job harder.

Moderation Log