Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Less Threat-Dependent Bargaining Solutions?? (3/2)

12Alex Vermillion

6Vladimir_Nesov

3abramdemski

3Martín Soto

3Alexander Gietelink Oldenziel

2Dagon

1Sonata Green

New Comment

Please add ALL of these to the Sequence that got made! I have 2 times now found more of these really fun posts and wish I'd read it earlier!

My current guess is that the problem with usual framings of bargaining is that they look at individual episodes of what should instead be a whole space of situations where acausal coordination is performed by shared adjudicators, that act across large subsets of that space of episodes in a UDT-like manner, bargaining primarily *between* the episodes rather than *within* the episodes. These adjudicators could exist at different levels of sophistication, they could be vague concepts, precise word-wielding concepts, hypotheses/theories, norms, paradigms/ideologies, people/agents, bureaucracies.

Decisions of adjudicators express aspects of preference. Collections of adjudicators form larger agents, this way concepts become people, people become civilizations, and civilizations become CEV, in the limit of this process. Adjudicators are shared among the larger agents that channel them, thus giving them comparable preferences and coordinating their smaller less tractable within-episode bargains. Instead of utility functions expressing consistent preference, adjudicators make actions of agents more coherent directly, by bargaining among the episodes they influence, putting the episodes in comparison by being present in all of them.

This offers a process of extending the goodhart scope of an agent (the possibly off-distribution situations where it still acts robustly, aligned with intent of the training distribution). Namely, policy of an agent should be split into instances of local behavior, reframed as within-episodes small bargains struck by smaller adjudicators that acausally bind large collections of different episodes together. Behavior and scope of those adjudicators should reach an equilibrium where feedback from learning on the episodes (reflection) no longer changes them. Finally, new adjudicators can be introduced into old episodes to extend the goodhart scope of the larger agent.

I'm a bit uncomfortable with the "extreme adversarial threats aren't credible; players are only considering them because they know you'll capitulate" line of reasoning because it is a *very updateful* line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way.

I find the chicken example somewhat compelling, but I can also easily give the "UDT / FDT retort": since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent should choose that, conditional on bargaining breaking down (precisely because this choice maximizes the utility obtained in fact -- ie, the only sort of reasoning which moves UDT/FDT). Therefore, the coco line of reasoning isn't relying on an absurd hypothetical.

Another argument for this perspective: if we set the disagreement point via Nash equilibrium, then the agents have an extra incentive to change their preferences before bargaining, so that the Nash equilibrium is closer to the optimal disagreement point (IE the competition point from coco). This isn't a very strong argument, however, because (as far as I know) the whole scheme doesn't incentivize honest reporting in any case. So agents may be incentivised to modify their preferences one way or another.

One simple idea: the disagreement point should reflect whatever *really* happens when bargaining breaks down. This helps ensure that players are happy to use the coco equilibrium instead of something else, in cases where "something else" implies the breakdown of negotiations. (Because the coco point is always a pareto-improvement over the disagreement point, if possible -- so choosing a realistic disagreement point helps ensure that the coco point is realistically an improvement over alternatives.)

However, in reality, the outcome of conflicts we avoid remain unknown. The realist disagreement point is difficult to define or measure if in reality agreement is achieved.

So perhaps we should suppose that agreement cannot always be reached, and base our disagreement point on the observed consequences of bargaining failure.

One is that set of possible fallback points will, in general, not be a single point

Thinking out loud: Might you be able to iterate the bargaining process between the agents to decide which fallback point to choose? This of course will yield infinite regress if no one of the iterations yields a single fallback point. But might it be that the set of fallback points will in some sense become smaller with each iteration? (For instance have smaller worst-case consequences for each players' utilities) (Or at least this might happen in most real-world situations, even if there are fringe theoretical counter-examples) If that were the case, at a certain finite iteration one of the players would be willing to let the other decide the fallback point (or let it be randomly decided), since the cost of further computations might be higher than the benefit of adjusting more finely the fallback point.

On a more general note, maybe considerations about a real agent's bounded computation can pragmatically resolve some of these issues. Don't get me wrong: I get that you're searching for theoretical groundings, and this would a priori not be the stage in which to drop simplifications. But maybe dropping this one will dissolve some of the apparent grounding under-specifications (because real decisions don't need to be as fine-grained as theoretical abstraction can make it seem).

But just because I can diagnose that as the problem doesn't mean I've got an entirely satisfactory replacement lined up as of the writing of this post.

Thank you for acknowledging this. The dynamic of threat/power differentials is so pervasive in real-world human interactions (negotiations, implicit and explicit) that it's shocking how little attention it gets in theory.

Step 1 in figuring out how to get an outcome which is Not That is to look at the list of nice properties which the CoCo solution uniquely fulfills, and figure out which one to break.

It seems to me that, ideally, one would like to be able to identify *in advance* which axioms one doesn't actually want/need, *before* encountering a motive to go looking for things to cut.

In the previous two posts, we went over various notions of bargaining. The Nash bargaining solution. The CoCo value. Shapley values. And eventually, we managed to show they were all special cases of each other. The rest of this post will assume you've read the previous two posts and have a good sense for what the CoCo value is doing.

Continuing from the last post, the games which determine the payoff everyone gets (not to be confused with the games that directly entail what actions are taken) are all of the form "everyone splits into two coalitions S and N/S, and both coalitions are trying to maximize "utility of my coalition - utility of the opposite coalition"".

Now, in toy cases involving two hot-dog sellers squabbling over whether to hawk their wares at a beach or an airport, this produces acceptable results. But, in richer environments, it's VERY important to note that adopting "let's go for a CoCo equilibria" as your rule for how to split gains amongst everyone incentivizes everyone to invent increasingly nasty ways to hurt their foes. Not to actually be used, mind you. To affect how negotiations go.

After all, if you invent the Cruciatus curse, then in all those hypothetical games where your coalition faces off against the foe coalition, and everyone's utility is "the utility of my coalition - the utility of the foe coalition"... well, you sure can reduce the utility of the foe coalition by a whole lot! And so, your team gets a much higher score in all those games.

Of course, these minimax games aren't actually

played. They just determine everyone's payoffs. And so, you'd end up picking up a whole lot of extra money from everyone else, because you have the Cruciatus curse and everyone's scared of it so they give you money. In the special case of a two-player game, getting access to an option which you don't care about and the foe would pay $1000 to avoid, should let you demand a 500$ payment from the foe as a "please don't hurt me" payment.But Which Desiderata Fails?Step 1 in figuring out how to get an outcome which is Not That is to look at the list of nice properties which the CoCo solution uniquely fulfills, and figure out which one to break.

As it turns out from picking through the paper, the assumption that must be broken is the axiom of "gaining access to more actions shouldn't lead to you getting less value". As a quick intuitive way of seeing

whyit should fail, consider the game of Chicken. If youphysically can'tswerve (because your car started off not having a steering wheel), and are locked into always going straight through no fault of your own, then any sensible opponent will swerve and you will get good payoffs. Adding in the new option of being able to swerve means the opponent will start demanding that you swerve sometimes, lowering your score.As a rough intuition for how the "the CoCo value is the only way to fulfill these axioms" proof works, it is reasoning as follows:

"Hm, let's say I only had access to my minimax action, that maximized minb(U1(a,b)−U2(a,b)). It makes me relatively much better off than my foe. Any sensible foe going up against this threat would simply press a button that'd just give us both the CoCo value, instead of playing any other action in response. By the axiom of "adding more options can't hurt me", I can add in all my other actions and not get less money. And then by the axiom of "adding redundant actions doesn't affect anything", I can take away the foe's CoCo value button and nothing changes. And so, in this game against the foe, I have to get a value equal or greater to my CoCo value. But the foe can run through this same reasoning from their side, and so in this game, we must both get the CoCo value."

And this makes it very clear exactly what's going wrong in the reasoning. Just because I'd be willing to give a foe a lot of money if they were unavoidably locked into firing off the Cruciatus curse through the random whims of nature and no action of their own or from anyone else could have prevented it, that does

notmean that I'd be willing to give them a bunch of money if they had any other options available.But just because I can diagnose that as the problem doesn't mean I've got an entirely satisfactory replacement lined up as of the writing of this post. I just have a few options which seem to me to have promise, so don't go "oh, Diffractor solved this". Treat it as an open question deserving of your thought where I have a potential way forward, but where there might be a solution from a different direction.

In the spirit of proposing alternate options which have some unsatisfactory parts but at least are less dependent on threats, here's my attempt at fixing it.

Kinder Fallback PointsAn alternate way of viewing what's going on with the CoCo solution is as follows: Both players are tasked with coming up with a fallback strategy if they can't come to an agreement. The pair of fallback strategies played against each other defines a disagreement point. And then,

anybargaining solution fulfilling symmetry (which is one of the essential properties of a bargaining solution, Nash fulfills it, Kalai-Smorodinsky fulfills it, all the other obscure ones fulfill it) will say "hey, instead of playing this disagreement point, what if you maximized the total pile of money instead and split the surplus 50/50?"The place where the CoCo value

specificallyappears from this protocol is that, if you know that this is what happens, your best fallback strategy to win in the overall bargaining game is the action a that maximizes minb(U1(a,b)−U2(a,b)). The foe playinganyfallback strategy other than the counterpart of this means you'll get more money than the CoCo value says, and the foe will get less, after bargaining concludes.But there's something quite suspicious about that disagreement point. It only is what it is because the foe knows you're guaranteed to go for a 50/50 split of surplus afterwards. Your foe isn't picking its disagreement strategy because it actually likes it, or because it's the best response to your disagreement strategy. Your foe is behaving in a way that it

ordinarilywouldn't (namely: minimizing your utility instead of purely maximizing its own), and it'sonlydoing thatbecause it knows you'll give in to the threat.A disagreement point to take

seriouslyis one where the foe is like "this is legitimately my best response to you if negotiations break down", instead of one that the foe only picked because you really hate it and it knows you'll cut a bargain. For the former, if negotiations blow up, you'll find that, oh hey, your foe is actually serious. For the latter, if negotiations actually start failing sometimes, the foe will suddenly start wishing that they had committed to a different fallback strategy.And so, this points towards an alternate notion instead of the CoCo value. Namely, one where the disagreement point is a Nash equilibrium (ie, both players are purely trying to maximize the utility of "negotiations break down" given what the other player has as their disagreement strategy, instead of looking ahead and trying to win at bargaining by hurting the other), and surplus is maximized and split 50/50 after that.

Although it's a lot better to use correlated equilibria instead of Nash equilibria. Since we're already assuming that the agents can implement any joint strategy, it's not much of a stretch to assume that they can fall back to joint strategies if negotiations break down. Correlated equilibria generalize the "nobody wants to change their move" property of Nash equilibria to cases where the agents have a joint source of randomness. Alternately, Nash equilibria can be thought of as the special case of correlated equilibria where the agents can't correlate their actions.

There are two other practical reasons why you'd want to use correlated equilibria over Nash equilibria. The first is that the set of correlated equilibria is convex (a mixture of two correlated equilibria is a correlated equilibrium, which doesn't hold for Nash), so the set of "possible disagreement payoffs" will be a convex subset of R2, letting you apply a lot more mathematical tools to it. Also, finding Nash equilibria is computationally hard, while finding correlated equilibria can be done in polynomial time and it's easy to come up with algorithms which converge to playing correlated equilibria in reasonable timeframes. (though I don't know if there's a good algorithm to converge to correlated equilibria which are Pareto-optimal among the set of correlated equilibria)

Ok, so our alternate non-threat-based notion instead of the CoCo value is "both players settle on a correlated equilibrium as a backup, and then evenly split the surplus from there". If you blow up the negotiation and resort to your fallback strategy, you'll find that, oh hey, the foe's best move is to play

theirpart of the fallback strategy, they didn't just pick their backup strategy to fuck you over.Two Big ProblemsOne is that set of possible fallback points will, in general, not be a single point (rather, it will be a convex set). So, uh... how do you pick one? Is there a principled way to go "y'know, this is a somewhat unfair correlated equilibrium"?

The second problem is generalizing to the n-player case. In general, the surplus gets split according to the Shapley value, to give a higher share to players who can significantly boost the value of most coalitions they're in. Player i gets

∑i∈S⊆N(n−s)!(s−1)!n!(v(S)−v(N/S))

as their payoff.

But this has an issue. How is v(S) defined? It should be something like "the value earned by coalition S". For the CoCo value, v(S) is defined as "the value that team S gets in their zero-sum game against team Everyone Else". But for using correlated equilibria as the fallback points, it gets harder. Maybe v(S) is "the value that team S gets when they play a correlated equilibrium against not-S, where S and not-S are both modeled as individual large players". Or maybe it's "the value that team S gets when they play a correlated equilibrium against the uncoordinated horde of Everyone Else, where S is modeled as one big player, and everyone else is modeled as individuals". And this completely overlooks the issue of, again, "which correlated equilibrium" (there are a lot)

I mean, it'd be nice to just say "give everyone their Shapley payoffs", but the value produced by a team working together is pretty hard to define due to the dependence on what everyone else is doing. Is everyone else fighting each other? Coordinating and making teams of their own?

So, it's a fairly dissatisfying solution, with waaaay too many degrees of freedom, and a whole lot of path-dependence, but I'm pretty confident it's attacking the core issue.

Suggestions for alternate threat-resistant strategies in the comments are

extremely welcome, as are pointing out further reasons why this or other strategies suck or don't fulfill their stated aim, as well as coming up with clever ways to mitigate the issues with the thing I proposed.