*[content warning: simulated very hot places; extremely bad Nash equilibria]*

*(based on a Twitter thread)*

Rowan: "If we succeed in making aligned AGI, we should punish those who committed cosmic crimes that decreased the chance of an positive singularity sufficiently."

Neal: "Punishment seems like a bad idea. It's pessimizing another agent's utility function. You could get a pretty bad equilibrium if you're saying agents should be intentionally harming each others' interests, even in restricted cases."

Rowan: "In iterated games, it's correct to defect when others defect against you; that's tit-for-tat."

Neal: "Tit-for-tat doesn't pessimize, though, it simply withholds altruism sometimes. In a given round, all else being equal, defection is individually rational."

Rowan: "Tit-for-tat works even when defection is costly, though."

Neal: "Oh my, I'm not sure if you want to go there. It can get real bad. This is where I pull out the game theory folk theorems."

Rowan: "What are those?"

Neal: "They're theorems about Nash equilibria in iterated games. Suppose players play normal-form game G repeatedly, and are infinitely patient, so they don't care about their positive or negative utilities being moved around in time. Then, a given payoff profile (that is, an assignment of utilities to players) could possibly be the mean utility for each player in the iterated game, if it satisfies two conditions: feasibility, and individual rationality."

Rowan: "What do those mean?"

Neal: "A payoff profile is feasible if it can be produced by some mixture of payoff profiles of the original game G. This is a very logical requirement. The payoff profile could only be the average of the repeated game if it was some mixture of possible outcomes of the original game. If some player always receives between 0 and 1 utility, for example, they can't have an average utility of 2 across the repeated game."

Rowan: "Sure, that's logical."

Neal: "The individual rationality condition, on the other hand, states that each player must get at least as much utility in the profile as they could guarantee getting by min-maxing (that is, picking their strategy assuming other players make things as bad as possible for them, even at their own expense), and at least one player must get strictly more utility."

Rowan: "How does this apply to an iterated game where defection is costly? Doesn't this prove my point?"

Neal: "Well, if defection is costly, it's not clear why you'd worry about anyone defecting in the first place."

Rowan: "Perhaps agents can cooperate or defect, and can also punish the other agent, which is costly to themselves, but even worse for the other agent. Maybe this can help agents incentivize cooperation more effectively."

Neal: "Not really. In an ordinary prisoner's dilemma, the (C, C) utility profile already dominates both agents' min-max utility, which is the (D, D) payoff. So, game theory folk theorems make mutual cooperation a possible Nash equilibrium."

Rowan: "Hmm. It seems like introducing a punishment option makes everyone's min-max utility worse, which makes more bad equilibria possible, without making more good equilibria possible."

Neal: "Yes, you're beginning to see my point that punishment is useless. But, things can get even worse and more absurd."

Rowan: "How so?"

Neal: "Let me show you my latest game theory simulation, which uses state-of-the-art generative AI and reinforcement learning. Don't worry, none of the AIs involved are conscious, at least according to expert consensus."

Neal turns on a TV and types some commands into his laptop. The TV shows 100 prisoners in cages, some of whom are screaming in pain. A mirage effect appears across the landscape, as if the area is very hot.

Rowan: "Wow, that's disturbing, even if they're not conscious."

Neal: "I know, but it gets even worse! Look at one of the cages more closely."

Neal zooms into a single cage. It shows a dial, which selects a value ranging from 30 to 100, specifically 99.

Rowan: "What does the dial control?"

Neal: "The prisoners have control of the temperature in here. Specifically, the temperature in Celsius is the average of the temperature selected by each of the 100 denizens. This is only a hell because they have made it so; if they all set their dial to 30, they'd be enjoying a balmy temperature. And their bodies repair themselves automatically, so there is no release from their suffering."

Rowan: "What? Clearly there is no incentive to turn the dial all the way to 99! If you set it to 30, you'll cool the place down for everyone including yourself."

Neal: "I see that you have not properly understood the folk theorems. Let us assume, for simplicity, that everyone's utility in a given round, which lasts 10 seconds, is the negative of the average temperature. Right now, everyone is getting -99 utility in each round.. Clearly, this is feasible, because it's happening. Now, we check if it's individually rational. Each prisoner's min-max payoff is -99.3: they set their temperature dial to 30, and since everyone else is min-maxing against them, everyone else sets their temperature dial to 100, leading to an average temperature of 99.3. And so, the utility profile resulting from everyone setting the dial to 99 is individually rational."

Rowan: "I see how that follows. But this situation still seems absurd. I only learned about game theory folk theorems today, so I don't understand, intuitively, why such a terrible equilibrium could be in everyone's interest to maintain."

Neal: "Well, let's see what happens if I artificially make one of the prisoners select 30 instead of 99."

Neal types some commands into his laptop. The TV screen splits to show two different dials. The one on the left turns to 30; the prisoner attempts to turn it back to 99, but is dismayed at it being stuck. The one on the right remains at 99. That is, until 6 seconds pass, at which point the left dial releases; both prisoners set their dials to 100. Ten more seconds pass, and both prisoners set the dial back to 99.

Neal: "As you can see, both prisoners set the dial to the maximum value for one round. So did everyone else This more than compensated for the left prisoner setting the dial to 30 for one round, in terms of average temperature. So, as you can see, it was never in the interest of that prisoner to set the dial to 30, which is why they struggled against it."

Rowan: "That just passes the buck, though. Why does everyone set the dial to 100 when someone set it to 30 in a previous round?"

Neal: "The way it works is that, in each round, there's an equilibrium temperature, which starts out at 99. If anyone puts the dial less than the equilibrium temperature in a round, the equilibrium temperature in the next round is 100. Otherwise, the equilibrium temperature in the next round is 99 again. This is a Nash equilibrium because it is never worth deviating from. In the Nash equilibrium, everyone else selects the equilibrium temperature, so by selecting a lower temperature, you cause an increase of the equilibrium temperature in the next round. While you decrease the temperature in this round, it's never worth it, since the higher equilibrium temperature in the next round more than compensates for this decrease."

Rowan: "So, as a singular individual, you can try to decrease the temperature relative to the equilibrium, but others will compensate by increasing the temperature, and they're much more powerful than you in aggregate, so you'll avoid setting the temperature lower than the equilibrium, and so the equilibrium is maintained."

Neal: "Yes, exactly!"

Rowan: "If you've just seen someone else violate the equilibrium, though, shouldn't you rationally expect that they might defect from the equilibrium in the future?"

Neal: "Well, yes. This is a limitation of Nash equilibrium as an analysis tool, if you weren't already convinced it needed revisiting based on this terribly unnecessarily horrible outcome in this situation. Possibly, combining Nash equilibrium with Solomonoff induction might allow agents to learn each others' actual behavioral patterns even when they deviate from the original Nash equilibrium. This gets into some advanced state-of-the-art game theory (1, 2), and the solution isn't worked out yet. But we know there's something wrong with current equilibrium notions."

Rowan: "Well, I'll ponder this. You may have convinced me of the futility of punishment, and the desirability of mercy, with your... hell simulation. That's... wholesome in its own way, even if it's horrifying, and ethically questionable."

Neal: "Well, I appreciate that you absorbed a moral lesson from all this game theory!"

Seriously, what? I'm missing something critical. Under the stated rules as I understand them, I don't see why anyone would punish another player for reducing their dial.

You state that 99 is a nash equilibrium, but this just makes no sense to me. Is the key that you're stipulating that everyone must play as though everyone else is out to make it as bad as possible for them? That sounds like an incredibly irrational strategy.

I

thinkit's not that 99 is a Nash equilibrium, it's that everyone doing "Play 99 and, if anyone deviates, play 100 to punish them until they give in" is a Nash equilibrium. (Those who think they understand the post: am I correct?)I think what people are missing (I know I am) is where does the "supposed to" come from? I totally understand the debt calculation to get altruistic punishment for people who deviate in ways that hurt you - that's just maximizing long-term expectation through short-term loss. I don't understand WHY a rational agent would punish someone who is BENEFITTING you with their deviant play.

I'd totally get it if you reacted to someone playing MORE than they were supposed to. But if someone plays less than, there's no debt or harm to punish.

Formally, it's an arbitrary strategy profile that happens to be a Nash equilibrium, since if everyone else plays it, they'll punish if you deviate from it unilaterally.

In terms of more realistic scenarios there are some examples of bad "punishing non punishers" equilibria that people have difficulty escaping. E.g. an equilibrium with honor killings, where parents kill their own children partly because they expect to be punished if they don't. Rober Trivers, an evolutionary psychologist, has studied these equilibria, as they are anomalous from an evolutionary psychology perspective.

If a mathematical model doesn't reflect at all the thing it's supposed to represent, it's not a good model. Saying "this is what the model predicts" isn't helpful.

There is absolutely zero incentive to anyone to put the temperature to 100 at any time. Even as deterrence, there is no reason for the equilibrium temperature to be an unsurvivable 99. It makes no sense, no one gains anything from it, especially if we assume communication between the parties (which is required for there to be deterrence and other such mechanisms in place). There is no reason to punish someone putting the thermostat

lowerthan the equilibrium temperature either, since the lowest possible temperature is still comfortable. The model is honestly just wrong to describe any actual situation of interest.At the very least, the utility function is wrong: it's not linear in temperature, obviously. It skyrockets around where temperatures exceed the survivable limit and then plateaus. There's essentially no difference between 99 and 99.3, but there's a much stronger incentive to go back below 40 as quickly as possible.

The problem is that the model is so stripped down it doesn't illustrate the principle any more. The principle, as I understand it, is that there are certain "everyone does X" equilibria in which X doesn't

haveto be useful or even good per se, it's just something everyone's agreed upon. That's true, but only to a certain point. Past a certain degree of utter insanity and masochism, people start solving the coordination problem by reasonably assuming that no one else can actually want X, and may try rebellion. In the thermostat example, a turn in which simplytwoprisoners rebelled would be enough to get a lower temperature even if the others tried to punish them. At that point the process would snowball. It's only "stable" to the minimum possible perturbation of asingleperson turning the knob to 30, and deciding it's not worth it any more after one turn at a mere 0.3 C above the already torturous temperature of 99 C.I think the claim is that, while it may be irrational, it can be a Nash equilibrium. (And sometimes agents are more Nash-rational than really rational.)

I think ideas like Nash equilibrium get their importance from predictive power: do they correctly predict what will happen in the real world situation which is modeled by the game. For example, the biological situations that settle on game-theoretic equilibria even though the "players" aren't thinking at all.

In your particular game, saying "Nash equilibrium" doesn't really narrow down what will happen, as there are equilibria for all temperatures from 30 to 99.3. The 99 equilibrium in particular seems pretty brittle: if Alice breaks it unilaterally on round 1, then Bob notices that and joins in on round 2, neither of them end up punished and they get 98.6 from then on.

More generally, in any game like this where everyone's interests are perfectly aligned, I'd expect cooperation to happen. The nastiness of game theory really comes from the fact that some players can benefit by screwing over others. The game in your post doesn't have that, so any nastiness in such a game is probably an analysis artifact.

Reminds me of this from Scott Alexander's Meditations on Moloch:

Scott Alexander's

I don't understand the motivation to preserve the min-max value, and perhaps that's why it's a folk theorem rather than an actual theorem. Each participant knows that they can't unilaterally do better than 99.3, which they get by choosing 30 while the other players all choose 100. But a player's maxing (of utility; min temperature) doesn't oblige them to correct or reduce their utility (by raising the temperature) just because the opponents fail to minimize the player's utility (by raising the temperature).

There is no debt created anywhere in the model or description of the players. Everyone min-maxes as a strategy, picking the value that maximizes each player's utility assuming all opponents minimize that player's utility. But the other players aren't REQUIRED to play maximum cruelty - they're doing the same min-max strategy, but for their own utility, leading everyone to set their dial to 30.

I am confused.

Whydoes everyone else select the equilibrium temperature? Why would they push it to 100 in the next round? You never explain this.I understand you may be starting off a theorem that I don’t know. To me the obvious course of action would be something like: the temperature is way too high, so I’ll lower the temperature. Wouldn’t others appreciate that the temperature is dropping and getting closer to their own preference of 30 degrees ?

Are you saying what you’re describing makes sense, or are you saying that what you’re describing is a weird (and meaningless?) consequence of Nash theorem?

The Wikipedia article has an example that is easier to understand:

Correct me if I'm wrong:

The equilibrium where everyone follows "set dial to equilibrium temperature" (i.e. "don't violate the taboo, and punish taboo violators") is only a weak Nash equilibrium.

If one person instead follows "set dial to 99" (i.e. "don't violate the taboo unless someone else does, but don't punish taboo violators") then they will do just as well, because the equilibrium temp will still always be 99. That's enough to show that it's only a weak Nash equilibrium.

Note that this is also true if an arbitrary number of people deviate to this strat... (read more)

Nice provocative post :-)

It's good to note that Nash equilibrium is only one game-theoretic solution concept. It's popular in part because under most circumstances at least one is guaranteed to exist, but folk theorems can cause there to be a lot of them. In contexts with lots of Nash equilibria, game theorists like to study

refinementsof Nash equilibrium, i.e., concepts that rule out some of the Nash equilibria. One relevant refinement for this example is that ofstrong Nash equilibrium, where no subset of players can beneficially devia... (read more)Example origin scenario of this Nash equilibrium from GPT-4:

In this hypothetical scenario, let's imagine that the prisoners are all part of a research experiment on group dynamics and cooperation. Prisoners come from different factions that have a history of rivalry and distrust.

Initially, each prisoner sets their dial to 30 degrees Celsius, creating a comfortable environment. However, due to the existing distrust and rivalry, some prisoners suspect that deviations from the norm—whether upward or downward—could be a secret signal from one faction to ... (read more)

Two (related) morals of the story:

A really, really stupid strategy can still meet the requirements for being a Nash equilibrium, because a strategy being a Nash equilibrium only requires that no one player can get a better result for themselves when

only that playeris allowed to change strategies.A game can have more than one Nash equilibrium, and a game with more than one Nash equilibrium can have one that's arbitrarily worse than another.

A copy of my comment from the other thread:

The only thing I learned from this post is that, if you use mathematically precise axioms of behavior, then you can derive weird conclusions from game theory. This part is obvious and seems rather uninteresting.

The strong claim, namely the hell scenario, comes from then back-porting the conclusions from this mathematical rigor to our intuitions about a suggested non-rigorous scenario.

But this you cannot do unless you've confirmed that there's a proper correspondence from your axioms to the scenario.

For example, th... (read more)

I don't think the 'strategy' used here (set to 99 degrees unless someone defects, then set to 100) satisfies the "individual rationality condition". Sure, when everyone is setting it to 99 degrees, it beats the minmax strategy of choosing 30. But once someone chooses 30, the minmax for everyone else is now to also choose 30 - there's no further punishment that will or could be given. So the behavior described here, where everyone punishes the 30, is worse than minmaxing. At the very least, it would be an unstable equilibrium that would have broken down in the situation described - and knowing that would give everyone an incentive to 'defect' immediately.

In a realistic setting agents will be highly incentivized to seek other forms of punishment besides turning dial. But nice toy hell.

I have a suspicion that the viscerally unpleasant nature of this example is making it harder for readers to engage with the math.

Curated. As always, I'm fond of a good dialog. Also usually (though comes out less often), I'm fond of getting taught interesting game theory results, particularly outside the range of games most discussed, in this case, more focusing on iterated stuff which can get weirder. Kudos, I really like having posts like these on LW.

Why would anyone assume this and make decisions based on it?

~~I have never understood this aspect of the Nash equilibrium.~~(Edit: never mind, I thought it was claimed that this was part of how Nash equilibria worked, and assumed this was the thing I remembered not understanding about Nash equilibria, but that seems wrong)Any Nash Equilibrium can be a local optimum. This example merely demonstrates that not all local optima are desirable if you are able to view the game from a broader context. Incidentally, evolution has provided us with some means to try and get out of these local optima. Usually by breaking the rules of the game or leaving the game or seemingly not acting rationally from the perspective of the local optimum.

Just to clarify, the complete equilibrium strategy alluded to here is:

"Play 99 and, if anyone deviates

from any part of the strategy, play 100 to punish them until they give in"Importantly, this includes deviations from the

punishment. If you don't join the punishment, you'll get punished. That makes it rational to play 99 and punish deviators.The point of the Folk Theorems are that the Nash Equilibrium notion has limited predictive power in repeated games like this, because essentially any payoff could be implemented as a similar Nash equilibrium. That do... (read more)

I don’t think this works. Here is my strategy:

Now, this was a perfectly rational course of action for me. I knew that I will suffer temporarily, but in exchange I got a comfortable temperature for eternity.

Prove me wrong.

I can see why feasibility + individual rationality makes a payoff profile more likely than any profile missing one of these conditions, but I can’t see why I should consider every profile satisfying these conditions as likely enough to be worth worrying about

This is about as convincing as a scarecrow. Maybe I'm commiting some kinda taboo not trying to handle a strawman like a a real person but to me the mental exercise of trying faulty thought experiments is damaging when you can just point to the fallacy and move on.

I'd be interested in someone trying to create the proposed simulation without the presumptive biases. Would AI models given only pain and relief with such a dial to set themselves come to such equilibrium? I think obviously not. I don't hear anyone arguing that's wrong just that it's a misundersta... (read more)