The Pavlov Strategy

sarahconstantin

Epistemic Status: Common knowledge, just not to me

The Evolution of Trust is a deceptively friendly little interactive game. Near the end, there’s a “sandbox” evolutionary game theory simulator. It’s pretty flexible. You can do quick experiments in it without writing code. I highly recommend playing around.

One of the things that surprised me was a strategy the game calls Simpleton, also known in the literature as Pavlov. In certain conditions, it works pretty well — even better than tit-for-tat or tit-for-tat with forgiveness.

Let’s set the framework first. You have a Prisoner’s dilemma type game.

If both parties cooperate, they each get +2 points.
If one cooperates and the other defects, the defector gets +3 points and the cooperator gets -1 point
If both defect, both get 0 points.

This game is iterated — you’re randomly assigned to a partner and you play many rounds. Longer rounds reward more cooperative strategies; shorter rounds reward more defection.

It’s also evolutionary — you have a proportion of bots each playing their strategies, and after each round, the bots with the most points replicate and the bots with the least points die out. Successful strategies will tend to reproduce while unsuccessful ones die out. In other words, this is the Darwin Game.

Finally, it’s stochastic — there’s a small probability that any bot will make a mistake and cooperate or defect at random.

Now, how does Pavlov work?

Pavlov starts off cooperating. If the other player cooperates with Pavlov, Pavlov keeps doing whatever it’s doing, even if it was a mistake; if the other player defects, Pavlov switches its behavior, even if it was a mistake.

In other words, Pavlov:

cooperates when you cooperate with it, except by mistake
“pushes boundaries” and keeps defecting when you cooperate, until you retaliate
“concedes when punished” and cooperates after a defect/defect result
“retaliates against unprovoked aggression”, defecting if you defect on it while it cooperates.

If there’s any randomness, Pavlov is better at cooperating with itself than Tit-For-Tat. One accidental defection and two Tit-For-Tats are stuck in an eternal defect cycle, while Pavlov’s forgive each other and wind up back in a cooperate/cooperate pattern.

Moreover, Pavlov can exploit CooperateBot (if it defects by accident, it will keep greedily defecting against the hapless CooperateBot, while Tit-For-Tat will not) but still exerts some pressure against DefectBot (defecting against it half the time, compared to Tit-For-Tat’s consistent defection.)

The interesting thing is that Pavlov can beat Tit-For-Tat or Tit-for-Tat-with-Forgiveness in a wide variety of scenarios.

If there are only Pavlov and Tit-For-Tat bots, Tit-For-Tat has to start out outnumbering Pavlov quite significantly in order to win. The same is true for a population of Pavlov and Tit-For-Tat-With-Forgiveness. It doesn’t change if we add in some Cooperators or Defectors either.

Why?

Compared to Tit-For-Tat, Pavlov cooperates better with itself. If two Tit-For-Tat bots are paired, and one of them accidentally defects, they’ll be stuck in a mutual defection equilibrium. However, if one Pavlov bot accidentally defects against its clone, we’ll see

C/D -> D/D -> C->C

which recovers a mutual-cooperation equilibrium and picks up more points.

Compared to Tit-For-Tat-With-Forgiveness, Pavlov cooperates *worse* with itself (it takes longer to recover from mistakes) but it “exploits” TFTWF’s patience better. If Pavlov accidentally defects against TFTWF, the result is

D/C -> D/C -> D/D -> C/D -> D/D -> C/C,

which leaves Pavlov with a net gain of 1 point per turn, (over the first five turns before a cooperative equilibrium) compared to TFTWF’s 1/5 point per turn.

If TFTWF accidentally defects against Pavlov, the result is

C/D -> D/C -> D/C -> D/D -> C/D

which cycles eternally (until the next mistake), getting Pavlov an average of 5/4 points per turn, compared to TFTWF’s 1 point per turn.

Either way, Pavlov eventually overtakes TFTWF.

If you add enough DefectBots to a mix of Pavlovs and TFT’s (and it has to be a large majority of the total population being DefectBots) TFT can win, because it’s more resistant against DefectBots than Pavlov is. Pavlov cooperates with DefectBots half the time; TFT never does except by mistake.

Pavlov isn’t perfect, but it performs well enough to hold its own in a variety of circumstances. An adapted version of Pavlov won the 2005 iterated game theory tournament.

Why, then, don’t we actually talk about it, the way we talk about Tit-For-Tat? If it’s true that moral maxims like the Golden Rule emerge out of the fact that Tit-For-Tat is an effective strategy, why aren’t there moral maxims that exemplify the Pavlov strategy? Why haven’t I even heard of Pavlov until now, despite having taken a game theory course once, when everybody has heard of Tit-For-Tat and has an intuitive feeling for how it works?

In Wedekind and Milinski’s 1996 experiment with human subjects, playing an iterated prisoner’s dilemma game, a full 70% of them engaged in Pavlov-like strategies. The human Pavlovians were smarter than a pure Pavlov strategy — they eventually recognized the DefectBots and stopped cooperating with them, while a pure-Pavlov strategy never would — but, just like Pavlov, the humans kept “pushing boundaries” when unopposed.

Moreover, humans basically divided themselves into Pavlovians and Tit-For-Tat-ers; they didn’t switch strategies between game conditions where one strategy or another was superior, but just played the same way each time.

In other words, it seems fairly likely not only that Pavlov performs well in computer simulations, but that humans do have some intuitive model of Pavlov.

Human players are more likely to use generous Tit-For-Tat strategies rather than Pavlov when they have to play a working-memory game at the same time as they’re playing iterated Prisoner’s Dilemma. In other words, Pavlov is probably more costly in working memory than generous Tit for Tat.

If you look at all 16 theoretically possible strategies that only have memory of the previous round, and let them evolve, evolutionary dynamics can wind up quite complex and oscillatory.

A population of TFT players will be invaded by more “forgiving” strategies like Pavlov, who in turn can be invaded by DefectBot and other uncooperative strategies, which again can be invaded by TFT, which thrives in high-defection environments. If you track the overall rate of cooperation over time, you get very regular oscillations, though these are quite sensitive to variation in the error and mutation rates and nonperiodic (chaotic) behavior can occur in some regimes.

This is strangely reminiscent of Peter Turchin’s theory of secular cycles in history. Periods of peace and prosperity alternate with periods of conflict and poverty; empires rise and fall. Periods of low cooperation happen at the fall of an empire/state/civilization; this enables new empires to rise when a subgroup has better ability to cooperate with itself and fight off its enemies than the surrounding warring peoples; but in peacetime, at the height of an empire, more forgiving and exploitative strategies like Pavlov can emerge, which themselves are vulnerable to the barbaric defectors. This is a vastly simplified story compared to the actual mathematical dynamics or the actual history, of course, but it’s an illustrative gist.

The big takeaway from learning about evolutionary game theory is that it’s genuinely complicated from a player-perspective.

“It’s complicated” sometimes functions as a curiosity-stopper; you conclude “more research is needed” instead of looking at the data you have and drawing preliminary conclusions, if you want to protect your intellectual “territory” without putting yourself out of a job.

That isn’t the kind of “complexity” I’m talking about here. Chaos in dynamical systems has a specific meaning: the system is so sensitive to initial conditions that even a small measurement error in determining where it starts means you cannot even approximately predict where it will end up.

“Chaos: When the present determines the future, but the approximate present does not approximately determine the future.”

Optimal strategy depends sensitively on who else is in the population, how many errors you make, and how likely strategies are to change (or enter or leave). There are a lot of moving parts here.

I generally appreciate posts that help me understand game theory. I appreciate this post as well as it's followup for putting clearly explaining a bunch of building-blocks that, even if a bit "spherical cow" simplistic, help give my real-world coordination some gears.

But I think this particular post was useful because of it's surprisingness – many game-theory posts sort of roughly reinforce what I already knew. You can win at coordination through punishment, or cooperating with allies, etc. The Pavlov algorithm was something I couldn't pattern-match as easily to a known social strategy. It felt a bit alien, and I think this gave me a bit of "original seeing" that I've found subjectively helpful.

I'm not sure this has actually born fruit yet – I don't think I've yet really made decisions differently because of this. But, it feels like an important piece of the overall puzzle of coordination.

Huh, intersting paper. That's 1993 - Is there a more modern version with more stochastic parameters explored? Seems like an easy paper if not.

I'm also reminded of how computer scientists often end up doing simulations rather than basic math. This seems like a complicated system of equations, but maybe you could work out its properties with a couple of hours and basic nonlinear dynamics knowledge.

This post bridges two domains, game theory and reinforcement learning, which previously I previously thought of as mostly separate; and it caused a pretty big shift in my model of how intelligence-in-general works, since this is much simpler than my previous simplest model of how reinforcement learning would do game theory.

Reinforcement learning is not required for the analysis above. Only evolutionary game theory is needed.

In evolutionary game theory, the population's mix of strategies changes via replicator dynamics.
In RL, each individual agent modifies its policy as it interacts with its environment using a learning algorithm.

Promoted to curated: I think this post both communicates a set of important concepts, while also building on past concepts that have been discussed on LessWrong. I think this kind of game-theory is pretty valuable as an intuition pump for a large variety of environments and problems.

Presentation wise, I think the sequences of of agent-behaviors was a bit hard for me to follow, and while I think I was still able to follow the core points without them, finding some way to make them more intuitive or annotate them more would be valuable.

Is the “Chaos” part meant to be a link? It doesn't seem to go anywhere.

If Pavlov accidentally defects against TFTWF, the result is

D/C -> D/C -> D/D -> C/D -> D/D -> C/C,

Can you explain this sequence? I'm puzzled by it as it doesn't follow the definitions that I know about. My understanding of TFTWF is that it is "Tit for Tat with a small randomised possibility of forgiving a defaulter by cooperating anyway." What seems to be happening in the above sequence is Pavlov on the left and, on the right, TFT with a delay of 1.

I think what's being called "TFTWF" here is what some other places call "Tit for Two Tats", that is, it defects in response to two defections in a row.

But wouldn't the sequence then look like this?

D/C -> D/C -> D/D -> C/D -> D/C

and continue like this forever.

Why does TFTWF defect against C? What's the forgiveness there?

Yeah, I think you’re right.* So it actually looks the same as the “TFTWF accidentally defects” case.

*assuming we specify TFTWF as “defect against DD, cooperate otherwise”. I don’t see a reasonable alternate definition. I think you’re right that defecting against DC is bad, and if we go to 3-memory, defecting against DDC while cooperating with DCD seems bad too.** Sarah can’t be assuming the latter, anyway, because the “TFTWF accidentally defects” case would look different.

**there might be some fairly reasonably-behaved variant that’s like “defect if >=2 of 3 past moves were D”, but that seems like a) probably bad since I just made it up and b) not what’s being discussed here.

The link for “Chaos: When the present determines the future, but the approximate present does not approximately determine the future.” appears to be not working

It was insightful for me and helped to understand my failures in business.

helped to understand my failures in business.

I'm interested in hearing details about this if you're up for sharing! My own take on Pavlov was "huh, interesting, let's keep an eye for out situations this seems relevant to", but most of the updates were on the meta level. Curious if it was more directly applicable to you.

I continued to work with a partner who cheated on me without punishing him, and the partner cheated even more.

Informative article, thanks!

Unfortunately, the definition of Pavlov strategy is contradictory;

cooperates when you cooperate with it, except by mistake
“pushes boundaries” and keeps defecting when you cooperate, until you retaliate

Huh, intersting paper. That's 1993 - Is there a more modern version with more stochastic parameters explored? Seems like an easy paper if not.

Reinforcement learning is not required for the analysis above. Only evolutionary game theory is needed.

In evolutionary game theory, the population's mix of strategies changes via replicator dynamics.
In RL, each individual agent modifies its policy as it interacts with its environment using a learning algorithm.

Is the “Chaos” part meant to be a link? It doesn't seem to go anywhere.

If Pavlov accidentally defects against TFTWF, the result is

D/C -> D/C -> D/D -> C/D -> D/D -> C/C,

I think what's being called "TFTWF" here is what some other places call "Tit for Two Tats", that is, it defects in response to two defections in a row.

But wouldn't the sequence then look like this?

D/C -> D/C -> D/D -> C/D -> D/C

and continue like this forever.

Why does TFTWF defect against C? What's the forgiveness there?

Yeah, I think you’re right.* So it actually looks the same as the “TFTWF accidentally defects” case.

The link for “Chaos: When the present determines the future, but the approximate present does not approximately determine the future.” appears to be not working

It was insightful for me and helped to understand my failures in business.

helped to understand my failures in business.

I continued to work with a partner who cheated on me without punishing him, and the partner cheated even more.

Informative article, thanks!

Unfortunately, the definition of Pavlov strategy is contradictory;

cooperates when you cooperate with it, except by mistake
“pushes boundaries” and keeps defecting when you cooperate, until you retaliate

291

The Pavlov Strategy

291

291

291