# Ω 34

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.

For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.

## Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible

Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.

Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.

In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.

So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.

Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they've learned about by t.

Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) "I commit to making you lose if you do that move." In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike--and possibly outright bullying--by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn't comply. Practice what you preach!)

## When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other one limits theirs. But it gets worse.

Sometimes commitments can be made "at the same time"--i.e. in ignorance of each other--in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)

Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request--perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying--so many years later they meet each other, learn about each other, and end up locked into all-out war.

I'm not saying disastrous AGI commitments are the default outcome; I'm saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We'd wish we built a paperclip maximizer instead.

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

Objection: "Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?"

1. Devastating commitments (e.g. "Grim Trigger") are much more possible with AGI--just alter the code! Inigo Montoya is a fictional character and even he wasn't able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.

2. Credibility is much easier also, especially in an acausal context (see above.)

3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.

4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.

5. Finally, these terrible things (Brutal threats, costly fights) do happen to some extent even among humans today--especially in situations of anarchy. We want the AGI we built to be less likely to do that stuff than humans, not merely as likely.

Objection: "Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it's-too-late arguments, and thus will be incapable of hurting anyone."

Reply: That would be nice, wouldn't it? Let's hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...

Anecdote: A friend of mine, when she was a toddler, would threaten her parents: "I'll hold my breath until you give me the candy!" Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.

## Conclusion

Overall, I'm not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if "solving bargaining" turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.

Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.

# Ω 34

23 comments, sorted by Highlighting new comments since
New Comment

One big factor this whole piece ignores is communication channels: a commitment is completely useless unless you can credibly communicate it to your opponent/partner. In particular, this means that there isn't a reason to self-modify to something UDT-ish unless you expect other agents to observe that self-modification. On the other hand, other agents can simply commit to not observing whether you've committed in the first place - effectively destroying the communication channel from their end.

In a game of chicken, for instance, I can counter the remove-the-steering-wheel strategy by wearing a blindfold. If both of us wear a blindfold, then neither of us has any reason to remove the steering wheel. In principle, I could build an even stronger strategy by wearing a blindfold and using a beeping laser scanner to tell whether my opponent has swerved - if both players do this, then we're back to the original game of chicken, but without any reason for either player to remove their steering wheel.

I think in the acausal context at least that wrinkle is smoothed out.

In a causal context, the situation is indeed messy as you say, but I still think commitment races might happen. For example, why is [blindfold+laserscanner] a better strategy than just blindfold? It loses to the blindfold strategy, for example. Whether or not it is better than blindfold depends on what you think the other agent will do, and hence it's totally possible that we could get a disastrous crash (just imagine that for whatever reason both agents think the other agent will probably not do pure blindfold. This can totally happen, especially if the agents don't think they are strongly correlated with each other and sometimes even if they do (e.g. if they use CDT)) The game of chicken doesn't cease being a commitment race when we add the ability to blindfold and the ability to visibly attach laserscanners.

Blindfold + scanner does not necessarily lose to blindfold. The blindfold does not prevent swerving, it just prevents gaining information - the blindfold-only agent acts solely on its priors. Adding a scanner gives the agent more data to work with, potentially allowing the agent to avoid crashes. Foregoing the scanner doesn't actually help unless the other player knows I've foregone the scanner, which brings us back to communication - though the "communication" at this point may be in logical time, via simulation.

In the acausal context, communication kicks even harder, because either player can unilaterally destroy the communication channel: they can simply choose to not simulate the other player. The game will never happen at all unless both agents expect (based on priors) to gain from the trade.

If you choose not to simulate the other player, then you can't see them, but they can still see you. So it's destroying one direction of the communication channel. But the direction that remains (they seeing you) is the dimension most relevant for e.g. whether or not there is a difference between making a commitment and credibly communicating it to your partner. Not simulating the other player is like putting on a blindfold, which might be a good strategy in some contexts but seems kinda like making a commitment: you are committing to act on your priors in the hopes that they'll see you make this commitment and then conform their behavior to the incentives implied by your acting on your priors.

I at first read this and was like "I'm confused – isn't this what the whole agent foundations agenda is for? Like, I know there are still kinks to work out, and some of this kinks are major epistemological problems. But... I thought this specific problem was not actually that confusing anymore."

"Don't have your AGI go off and do stupid things" is a hard problem, but it seemed basically to be restating "the alignment problem is hard, for lots of finnicky confusing reasons."

Then I realized "holy christ most AGI research isn't built off the agent foundations agenda and people regularly say 'well, MIRI is doing cute math things but I don't see how they're actually relevant to real AGI we're likely to build.'"

Meanwhile, I have several examples in mind of real humans who fell prey to something similar to commitment-race concerns. i.e. groups of people who mutually grim-triggered each other because they were coordinating on slightly different principles. (And these were humans who were trying to be rationalist and even agent-foundations-based)

So, yeah actually it seems pretty likely that many AGIs that humans might build might accidentally fall into these traps.

So now I have a vague image in my head of a rewrite of this post that ties together some combo of:

• The specific concerns noted here
• The rocket alignment problem "hey man we really need to make sure we're not fundamentally confused about agency and rationality."
• Possibly some other specific agent-foundations-esque concerns

Weaving those into a central point of:

"If you're the sort of person who's like 'Why is MIRI even helpful? I get how they might be helpful but they seem more like a weird hail-mary or a 'might as well given that we're not sure what else to do?'... here is a specific problem you might run into if you didn't have a very thorough understanding of robust agency when you built your AGI. This doesn't (necessarily) imply any particular AGI architecture, but if you didn't have a specific plan for how to address these problems, you are probably going to get them wrong by default.

(This post might already exist somewhere, but currently these ideas feel like they just clicked together in my mind in a way they hadn't previously. I don't feel like I have the ability to write up the canonical version of this post but feel like "someone with better understanding of all the underlying principles" should)

I was confused about this post, and... I might have resolve my confusion by the time I got ready to write this comment. Unsure. Here goes:

My first* thought:

Am I not just allowed to precommit to "be the sort of person who always figures out whatever the optimal game theory was, and commit to that?". I thought that was the point.

i.e. I wouldn't precommit to treating either the Nash Bargaining Solution or Kalai-Smorodinsky Solution as "the permanent grim trigger bullying point", I'd precommit to something like "have a meta-policy of not giving into bullying, pick my best-guess-definition-of-bullying as my default trigger, and my best-guess grim-trigger response, but include an 'oh shit I didn't think about X' parameter." (with some conditional commitments thrown in)

Where X can't be an arbitrary new belief – the whole point of having a grim trigger clause is to be able to make appropriately weighted threats that AGI-Bob really thinks will happen. But, if I legitimately didn't think of the Kalai-Smordinwhatever solution as something an agent might legitimately think was a good coordination tool, I want to be able to say. depending on circumstances:

1. If the deal hasn't resolved yet "oh, shit I JUUUST thought of the Kalai-whatever thing and this means I shouldn't execute my grim trigger anti-bullying clause without first offering some kind of further clarification step."
2. If the deal already resolved before I thought of it, say "oh shit man I really should realized the Kalai-Smorodinsk thing was a legitimate schelling point and not started defecting hard as punishment. Hey, fellow AGI, would you like me to give you N remorseful utility in return for which I stop grim-triggering you and you stop retaliating at me and we end the punishment spiral?"

My second* thought:

Okay. So. I guess that's easy for me to say. But, I guess the whole point of all this updateless decision theory stuff was to actually formalize that in a way that you could robustly program an AGI that you were about to give the keys to the universe.

Having a vague handwavy notion of it isn't reassuring enough if you're about to build a god.

And while it seems to me like this is (relatively) straightforward... do I really want to bet that?

I guess my implicit assumption was that game theory would turn out to not be that complicated in the grand scheme of thing. Surely once you're a Jupiter Brain you'll have it figured out? And, hrmm, maybe that's true, but but maybe it's not, or maybe it turns out the fate of the cosmos gets decided with smaller AGIs fighting over Earth which much more limited compute.

Third thought:

Man, just earlier this year, someone offered me a coordination scheme that I didn't understand, and I fucked it up, and the deal fell through because I didn't understand the principles underlying it until just-too-late. (this is an anecdote I plan to write up as a blogpost sometime soon)

And... I guess I'd been implicitly assuming that AGIs would just be able to think fast enough that that wouldn't be a problem.

Like, if you're talking to a used car salesman, and you say "No more than $10,000", and then they say "$12,000 is final offer", and then you turn and walk away, hoping that they'll say "okay, fine, \$10,000"... I suppose metaphorical AGI used car buyers could say "and if you take more than 10 compute-cycles to think about it, the deal is off." And that might essentially limit you to only be able to make choices you'd precomputed, even if you wanted to give yourself the option to think more.

That seems to explain why my "Just before deal resolves, realize I screwed up my decision theory" idea doesn't work.

It seems like my "just after deal resolves and I accidentally grim trigger, turn around and say 'oh shit, I screwed up, here is remorse payment + a costly proof that that I'm not fudging my decision theory'" should still work though?

I guess in the context of Acausal Trade, I can imagine things like "they only bother running a simulation of you for 100 cycles, and it doesn't matter if on the 101st cycle you realize you made a mistake and am sorry." They'll never know it.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

I dunno. Maybe I'm still confused.

But, I wanted to check in on whether I was on the right track in understanding what considerations were at play here.

...

*actually there were like 20 thoughts before I got to the one I've labeled 'first thought' here. But, "first thought that seemed worth writing down."

Thanks! Reading this comment makes me very happy, because it seems like you are now in a similar headspace to me back in the day. Writing this post was my response to being in this headspace.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

This sounds like a plausibly good rule to me. But that doesn't mean that every AI we build will automatically follow it. Moreover, thinking about acausal trade is in some sense engaging in acausal trade. As I put it:

Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

As for your handwavy proposals, I do agree that they are pretty good. They are somewhat similar to the proposals I favor, in fact. But these are just specific proposals in a big space of possible strategies, and (a) we have reason to think there might be flaws in these proposals that we haven't discovered yet, and (b) even if these proposals work perfectly there's still the problem of making sure that our AI follows them:

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."
Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

If you want to think and talk more about this, I'd be very interested to hear your thoughts. Unfortunately, while my estimate of the commitment races problem's importance has only increased over the past year, I haven't done much to actually make intellectual progress on it.

I feel I should disclaim "much of what I'd have to say about this is a watered down version of whatever Andrew Critch would say". He's busy a lot, but if you haven't chatted with him about this yet you probably should, and if you have I'm not sure whether I'll have much to add.

But I am pretty interested right now in fleshing out my own coordination principles and fleshing out my understanding of how they scale up from "200 human rationalists" to 1000-10,000 sized coalitions to All Humanity and to AGI and beyond. I'm currently working on a sequence that could benefit from chatting with other people who think seriously about this.

Thanks for this post; this does seem like a risk worth highlighting.

I've just started reading Thomas Schelling's 1960 book The Strategy of Conflict, and noticed a lot of ideas in chapter 2 that reminded me of many of the core ideas in this post. My guess is that that sentence is an uninteresting, obvious observation, and that Daniel and most readers were already aware (a) that many of the core ideas here were well-trodden territory in game theory and (b) that this post's objectives were to:

• highlight these ideas to people on LessWrong
• highlight their potential relevance to AI risk
• highlight how this interacts with updateless decision theory and acausal trade

But maybe it'd be worth people who are interested in this problem reading that chapter of The Strategy of Conflict, or other relevant work in standard academic game theory, to see if there are additional ideas there that could be fruitful here.

I'm about halfway through Strategy of Conflict and so far it's not really giving solutions to any of these problems, just sketching out the problem space.

This feels like an important question in Robust Agency and Group Rationality, which are major topics of my interest.

I have another "objection", although it's not a very strong one, and more of just a comment.

One reason game theory reasoning doesn't work very well in predicting human behavior is because games are always embedded in a larger context, and this tends to wreck the game-theory analysis by bringing in reputation and collusion as major factors. This seems like something that would be true for AIs as well (e.g. "the code" might not tell the whole story; I/"the AI" can throw away my steering wheel but rely on an external steering-wheel-replacing buddy to jump in at the last minute if needed).

In apparent contrast to much of the rationalist community, I think by default one should probably view game theoretic analyses (and most models) as "just one more way of understanding the world" as opposed to "fundamental normative principles", and expect advanced AI systems to reason more heuristically (like humans).

But I understand and agree with the framing here as "this isn't definitely a problem, but it seems important enough to worry about".

I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.

Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even after making an oath - often with some reputational consequences, but still possible if we deem it worthwhile. Even so, we're rightly reluctant to make serious committments. An agent who can actually enforce it's self-limitations is going to be orders of magnitude more hesitant to do so.

All that said, it's worth recognizing that an agent that's significantly better at predicting the consequences of potential commitments will pay a lower cost for the best of them, and has a material advantage over those who need flexibility because they don't have information. This isn't a race in time, it's a race in knowledge and understanding. I don't think there's any way out of that race - more powerful agents are going to beat weaker ones most of the time.

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate.

I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not.

I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016.

I found some related discussions going back to 2009. It's mostly highly confused, as you might expect, but I did notice this part which I'd forgotten and may actually be relevant:

But if you are TDT, you can’t always use less com­put­ing power, be­cause that might be cor­re­lated with your op­po­nents also de­cid­ing to use less com­put­ing power

This could potentially be a way out of the "racing to think as little as possible before making commitments" dynamic, but if we have to decide how much to let our AIs think initially before making commitments, on the basis of reasoning like this, that's a really hairy thing to have to do. (This seems like another good reason for wanting to go with a metaphilosophical approach to AI safety instead of a decision theoretic one. What's the point of having a superintelligent AI if we can't let it figure these kinds of things out for us?)

If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can “move first” can get much more than the one that “moves second.”

I'm not sure how the folk theorem shows this. Can you explain?

going updateless is like making a bunch of commitments all at once

Might be a good idea to offer some examples here to help explain updateless and for pumping intuitions.

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn’t actually commit to anything then.

Interested to hear more details about this. What would have happened if you were actually able to become updateless?

Would trying to become less confused about commitment races before building a superintelligent AI count as a metaphilosophical approach or a decision theoretic one (or neither)? I'm not sure I understand the dividing line between the two.

Trying to become less confused about commitment races can be part of either a metaphilosophical approach or a decision theoretic one, depending on what you plan to do afterwards. If you plan to use that understanding to directly give the AI a better decision theory which allows it to correctly handle commitment races, then that's what I'd call a "decision theoretic approach". Alternatively, you could try to observe and understand what humans are doing when we're trying to become less confused about commitment races and program or teach an AI to do the same thing so it can solve the problem of commitment races on its own. This would be an example of what I call "metaphilosophical approach".

Thanks, edited to fix!

I agree with your push towards metaphilosophy.

I didn't mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1's preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be "earlier in logical time" than player 2 and make a credible commitment.

As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don't do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don't have a super clear example of how this might lead to disaster, but I intend to work one out in the future...

Same goes for my own experience. I don't have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.

I think this undervalues conditional commitments. The problem of "early commitment" depends entirely on you possibly having a wrong image of the state of the world. So if you just condition your commitment on the information you have available, you avoid premature commitments made in ignorance and give other agents an incentive to improve your world model. Likewise, this would protect you from learning about other agents' commitments "too late" - you can always just condition on things like "unless I find an agent with commitment X". You can do this whether or not you even know to think of an agent with commitment X, as long as other agents who care about X can predict your reaction to learning about X.

Commitments aren't inescapable shackles, they're just another term for "predictable behavior." The usefulness of commitments doesn't require you to bind yourself regardless of learning any new information about reality. Oaths are highly binding for humans because we "look for excuses", our behavior is hard to predict, and we can't reliably predict and evaluate complex rule systems. None of those should pose serious problems for trading superintelligences.

I don't think this solves the problem, though it is an important part of the picture.

The problem is, which conditional commitments do you make? (A conditional commitment is just a special case of a commitment) "I'll retaliate against A by doing B, unless [insert list of exceptions here." Thinking of appropriate exceptions is important mental work, and you might not think of all the right ones for a very long time, and moreover while you are thinking about which exceptions you should add, you might accidentally realize that such-and-such type of agent will threaten you regardless of what you commit to and then if you are a coward you will "give in" by making an exception for that agent. The problem persists, in more or less exactly the same form, in this new world of conditional commitments. (Again, which are just special cases of commitments, I think.)

I concur in general, but:

you might accidentally realize that such-and-such type of agent will threaten you regardless of what you commit to and then if you are a coward you will “give in” by making an exception for that agent.

this seems like a problem for humans and badly-built AIs. Nothing that reliably one-boxes should ever do this.

EDT reliably one-boxes, but EDT would do this.

Or do you mean one-boxing in Transparent Newcomb? Then your claim might be true, but even then it depends on how seriously we take the "regardless of what you commit to" clause.

True, sorry, I forgot the whole set of paradoxes that led up to FDT/UDT. I mean something like... "this is equivalent to the problem that FDT/UDT already has to solve anyways." Allowing you to make exceptions doesn't make your job harder.