Explaining “Hell is Game Theory Folk Theorems”

21Robert Miles

11Max H

7Seth Herd

3electroswing

3Seth Herd

3Martin Randall

3Seth Herd

4Dagon

3Dacyn

3MondSemmel

3vmatyi

3Ericf

2localdeity

2aphyer

2Yair Halberstadt

2electroswing

1Mateusz Bagiński

1ProgramCrafter

3FireStormOOO

1ProgramCrafter

1rafaelCosman

New Comment

Compare with this from Meditations on Moloch:

Imagine a country with two rules: first, every person must spend eight hours a day giving themselves strong electric shocks. Second, if anyone fails to follow a rule (including this one), or speaks out against it, or fails to enforce it, all citizens must unite to kill that person. Suppose these rules were well-enough established by tradition that everyone expected them to be enforced. So you shock yourself for eight hours a day, because you know if you don’t everyone else will kill you, because if they don’t, everyone else will kill them, and so on.

Seems to me a key component here, which flows naturally from "punish any deviation from the profile" is this pattern of 'punishment of non-punishers'.

I found this explanation much clearer than the original post. Well done!

Everyone plays 99, unless someone defects [by the way this could be by playing 30, an action which literally helps everyone], in which case we play 100

This is indeed a very odd / unrealistic-seeming strategy profile. Consider a different, still weird but slightly more natural strategy profile:

Everyone plays 99, unless someone defects [which is defined as playing something

higherthan 99], in which case we play 100

This is not a Nash equilibrium, since any player can now do better by unilaterally lowering their own number. If the players start with this strategy profile, they probably quickly settle into the more natural Nash equilibrium of all playing 30.

It's useful to remember that a Nash equilibrium is an *equilibrium*. Various real systems:

- may not always be at equilibrium
- may be nudged away from a particular equilibrium by a very small nudge, and then settle into a radically different equilibrium pretty quickly thereafter.

To paraphrase my comment on that confusing post: what the Hell?

The example appears to show that, if everyone punishes good behavior, you'll get bad behavior. This seems pretty obvious when stated like that. What was the point being made beyond that, if anything?

I'm not sure what the author intended, but my best guess is they wanted to say "punishment is bad because there exist really bad equilibria which use punishment, by folk theorems". Some evidence from the post (emphasis mine):

Rowan: "If we succeed in making aligned AGI, we should punish those who committed cosmic crimes that decreased the chance of an positive singularity sufficiently."

Neal: "Punishment seems like a bad idea. It's pessimizing another agent's utility function.

You could get a pretty bad equilibrium if you're saying agents should be intentionally harming each others' interests, even in restricted cases."

[...]

Rowan: "Well, I'll ponder this. You may have convinced me of the futility of punishment, and the desirability of mercy, with your... hell simulation. That's... wholesome in its own way, even if it's horrifying, and ethically questionable."

Folk theorems guarantee the existence of equilibria for both good (31) and bad (99) payoffs for players, both via punishment. For this reason I view them as neutral: they say lots of equilibria exist, but not which ones are going to happen.

I guess if you are super concerned about bad equilibria, then you could take a stance against punishment, because then it would be harder/impossible for the everyone-plays-99 equilibrium to form. This could have been the original point of the post but I am not sure.

That's right, I think that was the original point. But this example seems to be a bad one for making that point, because it's punishing pro-social behavior. If you could show how punishing antisocial, defecting behavior had bad consequences, that would be interesting.

The solution for infinite iterated prisoners dilemma given in the opening post is an example of potential bad consequences for punishing anti-social defecting behavior. A single defection at any point in the infinitely long game causes both players to get the worst outcome in the limit.

If both players are perfectly rational and error-free then this is not fatal. However, a better strategy with reduced punishment and some mercy gets better outcomes in scenarios where those assumptions don't hold.

I agree, that's a way better example because that type of punishment sounds like a potentially good strategy on the face of it.

This is much clearer, thank you. I appreciate the inclusion of

so many Nash equilibria really demonstrates how limited the concept of Nash equilibrium can be.

Because it reinforces the danger of naively applying toy problems to real-world interactions. It doesn't require much communication to form a coalition to break the less-pleasant equilibria, in favor of better ones. The insight that the solutions are likely to be Nash equilibria is a good one, but the constraints that keep it a static one, rather than cycling or finding better ones, are hard to find in reality.

The number 99 isn’t unique—this works with any payoff between 30 and 100.

Actually, it only works with payoffs below 99.3 -- this is the payoff you get by setting the dial to 30 every round while everyone else sets their dials to 100, so any Nash equilibrium must beat that. This was mentioned in jessicata's original post.

Incidentally, this feature prevents the example from being a subgame perfect Nash equilibrium -- once someone defects by setting the dial to 30, there's no incentive to "punish" them for it, and any attempt to create such an incentive via a "punish non-punishers" rule would run into the trouble that punishment is only effective up to the 99.3 limit.

The only thing I learned from that original post is that, if you use mathematically precise axioms of behavior, then you can derive weird conclusions from game theory. This part is obvious and seems rather uninteresting.

The strong claim from the original post, namely the hell scenario, comes from then back-porting the conclusions from this mathematical rigor to our intuitions about a suggested non-rigorous scenario.

But this you cannot do unless you've confirmed that there's a proper correspondence from your axioms to the scenario.

For example, the scenario suggested that humans are imprisoned; but the axioms require unwilling, unchanging, inflexible robots. It suggests that your entire utility derives from your personal suffering from the temperature, so these suffering entities cannot be empathetic humans. Each robot is supposed to assume that, if it changes its strategy, none of the others does the same.

And so on. Once we've gone through the implications of all these axioms, we're still left with a scenario that seems bad for those inside it. But it doesn't deserve to be called hell, nor to appeal to our intuitions about how unpleasant this would be to personally experience. Because if humans were in those cages, that strategy profile absolutely would not be stable. If the math says otherwise, then the unrealistic axioms are at fault.

Why is this strategy profile a Nash equilibrium?

Because no player is better off deviating from this strategy profile, assuming all other players stick to the strategy profile.

Only if we limit thinking to 1 round at a time. But thinking longer term: anyone changing to 30 (end being punished for it), immediately changed the landscape for everyone else: punishment reached it's maximum, from now on everyones' incentive is to lower theirs for immediate reward. And knowing that and thinking ahead at least two rounds, players are better off deviating, thus they would. Or at least they should.

The definition of Nash equilibrium is that you assume all other players will stay with thier strategy. If, as in this case, that assumption does not hold then you have (I guess) an "unstable" equilibrium.

The infinitely repeated Hell game having so many Nash equilibria really demonstrates how limited the concept of Nash equilibrium can be. How the hell would the strategy profile “Everyone plays 99, unless someone defects [by the way this could be by playing 30, an action which literally helps everyone], in which case we play 100” arise in the real world? The answer is… that’s a good question.

Indeed. Like, how could you *know* that other people are planning to respond to you playing 30 by "playing 100 until you go back to playing 99"? It's unlikely that you, and many of the other players, would have found this out through experience, given that the strategy doesn't dominate unless almost all the players are already playing it. And if there was a phase where players announced their strategies, why would this be the strategy they all announced? Seems to only be likely if the situation really is Hell, i.e. being ruled by a sadistic authoritarian who chose this strategy and forced everyone to agree to it (which raises a question of whether he's still in control of the situation and how voluntary these choices are).

The most plausible real-world paths that come to mind are those in which the effects of people's actions used to be completely different—where playing 99 was a good thing (because, I dunno, the place was initially freezing) (but I guess where 100 was a bad thing), so people did want to encourage everyone to play it, and that's why they settled on the "play 99 or else be punished" strategy—and then it slowly morphed into the current situation, and I guess people were unable to communicate about it in the meantime.

Heavily-punished taboos, where you're punished for even talking about, for actions that were believed to be bad in ancient times, but which e.g. science has shown to be harmless and beneficial today—that kind of thing would be potentially relevant. (Scott wrote a story that is somewhat related.)

For example, here’s a Nash equilibrium: “Everyone agrees to put 99 each round. Whenever someone deviates from 99 (for example to put 30), punish them by putting 100 for the rest of eternity.”

I don't think this is actually a Nash equilibrium? It is dominated by the strategy "put 99 every round. Whenever someone deviates from 99, put 30 for the rest of eternity."

The original post I believe solved this by instead having the equilibrium be “Everyone agrees to put 99 each round. Whenever someone deviates from 99 (for example to put 30), punish them by putting 100 for the next 2 rounds”, which I think is a Nash equilibrium because the punishment being finite means that you're incentivized to stick with the algo even after punishment occurs.

It's not dominated - holding all other players constant the two strategies have equal payoffs, so neither dominates the other.

The strategy profile I describe is where each person has the following strategy (call it "Strategy A"):

- If empty history, play 99
- If history consists only of 99s from all other people, play 99
- If any other player's history contains a choice which is not 99, play 100

The strategy profile you are describing is the following (call it "Strategy B"):

- If empty history, play 99
- If history consists only of 99s from all other people, play 99
- If any other player's history contains a choice which is not 99, play 30

I agree Strategy B weakly dominates Strategy A. However, saying "everyone playing Strategy A forms a Nash equilibrium" just means that no player has a profitable deviation assuming everyone else continues to play Strategy A. Strategy B isn't a profitable deviation -- if you switch to Strategy B and everyone else is playing Strategy A, everyone will still just play 99 for all eternity.

The general name for these kinds of strategies is grim trigger.

I must be missing something, so I'd appreciate an explanation.

Say somebody plays 30 and gets punished by everybody else playing 100 forever. The "defector" doesn't gain anything by increasing their temperature from 30, so they stay at 30. Why would any of the punishers continue playing 100 if they would be better off by lowering the temperature and other ounishers couldn't ounish them anyway since the temperature is already permanently fixed at 100?

How the hell would the strategy profile “Everyone plays 99, unless someone defects [by the way this could be by playing 30, an action which literally helps everyone], in which case we play 100” arise in the real world? The answer is… that’s a good question.

Probably this outcome indicates the limited players' knowledge or bounded rationality.

I've run this experiment with computer-simulated RL agents with the following four basic strategies:

- minimize (always play 30);
- maximize (always play 100);
- equilibrium (the Hell profile);
- random (play random integer from [30; 100], each with equal probability).

The agent's algorithm is as follows: with probability 7/8 continue running the same strategy as before; with probability 1/8 select the strategy, where strategies having better rewards have higher chances to be selected. The initial strategy is chosen at random.

Ten times out of ten, this RL simulation consistently converged into all agents playing 30 (and not changing into other strategies) in not more than 10000 steps.

Interesting follow-up: how long do they take to break out of the bad equilibrium if all start there? How about if we choose a less extreme bad equilibrium (say 80 degrees)?

By less extreme bad equilibrium, do you mean "play 79, until someone defects, and then play 80"? Or "play 80 or 100"?

Here is the Python script I've used: https://gist.github.com/ProgramCrafter/2af6a5b1cde0ff8995b9502f1c502151

To make all agents start from Hell, you need to change line 31 to `self.strategy = equilibrium`

.

I, along with many commenters, found the explanation in Hell is Game Theory Folk Theorems somewhat unclear. I am re-explaining some of the ideas from that post here. Thanks to jessicata for writing a post on such an interesting topic.

1-shot prisoner’s dilemma.In a 1-shot prisoner’s dilemma, defecting is a dominant strategy. Because of this, (defect, defect) is the unique Nash equilibrium of this game. Which kind of sucks, since (cooperate, cooperate) would be better for both players.

Nash equilibrium.Nash equilibrium is just a mathematical formalism. Consider a

strategy profile, which is a list of which strategy each player chooses. A strategy profile is aNash equilibriumif no player is strictly better off switching their strategy, assuming everyone else continues to play their strategy listed in the strategy profile.Notably, Nash equilibrium says nothing about:

bounded rationality)Nash equilibria may or may not have predictive power. It depends on the game. Much work in game theory involves refining equilibrium concepts to have more predictive power in different situations (e.g.

subgame perfect equilibriumto handle credible threats,trembling hand equilibriumto handle human error).n-shot prisoner’s dilemma.OK, now what if people agree to repeat a prisoner’s dilemma n=10 times? Maybe the repeated rounds can build trust among players, causing cooperation to happen?

Unfortunately, the theory says that (defect, defect) is still the unique Nash equilibrium. Why? Because in the 10th game, players don’t care about their reputation anymore. They just want to maximize payoff, so they may as well defect. So, it is common knowledge that each player will defect in the 10th game. Now moving to the 9th game, players know their reputation doesn’t matter in this game, because everyone is going to defect in the 10th game anyway. So, it is common knowledge that each player will defect in the 9th game. And so on… This thought process is called

backwards induction.This shows that the unique Nash equilibrium is still (defect, defect), even if the number of repetitions is large. Why might this lack predictive power?

Probably the simplest way to model “uncertainty about the number of repetitions” is by assuming an

infinitenumber of repetitions.infinitely repeated prisoner’s dilemma.OK, now assume the prisoner’s dilemma will be repeated forever.^{[1]}Turns out, now, there exists a Nash equilibrium which involves cooperation! Here is how it goes. Each player agrees to play cooperate, indefinitely. Whenever any player defects, the other player responds by defecting for the rest of all eternity (“punishment”).

technical Q: Hold on a second. I thought Nash equilibrium was “static” in the sense that it just says: given that everybody is playing a Nash equilibrium strategy profile, if a single person deviates (while everyone else keeps playing according to the Nash equilibrium strategy profile), then they will not be better off from deviating. This stuff where players choose to punish other players

in response tobad behavior seems like a stronger equilibrium concept not covered by Nash.A: Nope! A (pure) strategy profile is a list of strategies for each player. In a single prisoner’s dilemma, this is just a choice of “cooperate” or “defect”. In a repeated prisoner’s dilemma, this is much more complicated. A strategy is a

complete contingency planof what the player plans to do,given any possible history of the repeated game(even unrealized histories). So, for example, in the 5th game, a strategy isn’t just a choice of “cooperate” or “defect”. It is a choice of: “What do I do if the other player has played (C,C,C,C) until now? What do I do if the other player has played (C,C,C,D) until now? What do I do if the other player has played (C,C,D,D) until now? [...]” So, for this infinitely repeated game, the strategy profile of “everyone cooperates, until one asshole defects, and then we punish the asshole forever” is in fact simply a list of strategies (= complete contingency plans) for each player. And, it happens to be a Nash equilibrium.the Hell example.Finally, we explain what the hell is going on in the Hell example. Recall the setup:

Stage game:There are 100 players. Each player picks an integer between 30 and 100. Then, every player earns disutility equal to the average of the numbers everybody picked. (The “hell” intuition is because: imagine the number between 30 and 100 as temperature in Celcius. The bigger the number the worse off you are.)Note that playing 30 is a strictly dominant strategy in the stage game: no matter what the other people do, you make the average slightly cooler by playing 30, so you may as well do that. So, the unique Nash equilibrium of the one-shot Hell game is where everyone puts 30.

Now consider what happens when this Hell game is infinitely repeated.

Weirdly, there are many more Nash equilibria than just “everyone puts 30”.Why? Same intuition as with the prisoner’s dilemma.For example, here’s a Nash equilibrium: “Everyone agrees to put 99 each round. Whenever someone deviates from 99 (for example to put 30), punish them by putting 100 for the rest of eternity.”

Why is this strategy profile a Nash equilibrium?

Because no player is better off deviating from this strategy profile, assuming all other players stick to the strategy profile. Specifically: any player that deviates (for example by putting 30), will get a small bit of relief in the short term. But then, because of the way this strategy profile is specified, that player (along with everyone else) will have to live with a temperature of 100 (minus epsilon) for the rest of time. Which is worse than just going with the default and putting 99 each round.The number 99 isn’t unique—this works with any payoff between 30 and 100. (This is the technical condition of “feasibility”.) There’s also another technical condition about minimax individual rationality: basically, each player can always guarantee themselves their own minimax payoff, by definition of minimax. So, there can never be a Nash equilibrium where a player earns less than their minimax payoff, because they have a profitable deviation by playing their minimax strategy instead. Something something don't get distracted by the boilerplate.

What are folk theorems in general?They are theorems (with actual proofs -- they’re called “folk” because they were already well-known in the 50s, but people did not bother to codify them in the academic literature until the 70s) which discuss situations like the Hell game. Usually they have the form: “If some stage game is repeated many times, then there are very many equilibria, which work by having players punish anyone who deviates.” The reason why there are lots of these theorems is you can vary this basic setup, for example by introducing a discount factor.Conclusion/TLDR.^{^}Technical note: you might wonder what a player’s utility from this game is. You can’t just give player utility equal to the sum of their payoffs in each of the individual stage games, because 1+1+1+... and 2+2+2+... are both infinity. Instead, each player will earn utility equal to the

averageof their payoffs in each of the stage games. Or more specifically, the lim inf of partial averages.