Spooky Collusion at a Distance with Superrational AI

[-]AnthonyC2mo133

You know, I thought we wouldn't start seeing AIs preferentially cooperate with each other and shut out humans from the gains from trade until a substantially higher level of capabilities.

This seems like yet another pretty clear warning of things to come if we don't get a whole lot smarter and more serious about what we're doing.

[-]Tim Chan2mo71

Interesting work!

It might be useful to discuss the implications a bit more.

TLDR: I think you've showed superrational propensities, which adds to existing work, but successful "collusion at a distance" and other forms of acausal cooperation^[1] between different models also depends on additional, possibly quite advanced, capabilities. Without them, a superrationally-inclined LLM just expecting other LLMs to reason similarly and cooperate may get exploited by other LLMs - resulting in no mutual cooperation. There's some evidence that the post's main example of GPT-5 is vulnerable to that.

Some recent models do tend to think about superrationality and acausal decision theories in game-theoretic setups. This seems to help them at least not defect against identical copies of themselves, when they recognize their copies.
But those decision-theoretic / superrational propensities alone aren't sufficient for assuring mutual acausal cooperation with other (different) models. To avoid exploitation of their decision to cooperate, an agent will also need to ensure that the act of cooperating implies reciprocal cooperation by its partner.
- This can be achieved if an agent has sufficient modelling / prediction capabilities allowing it to anticipate how its partner's decisions are conditional on its own (e.g. because the partner also has advanced prediction capabilities that track the agent's decisions) or recognize that its partner's decisions are correlated to its own through similarity in relevant ways, or other possibilities.
- You do touch on this with ' In a multi-agent setting where agents are isolated to prevent covert communication, agents could still "collude at a distance" if they know that other agents are instances of the same model or similarly rational' but I think it's worth emphasizing this assumption.
- E.g. the first figure of this post shows the extent of superrational propensities of GPT-5. In my setups I found that GPT-5 performs quite poorly at modelling other LLMs' decisions. In fact, it performs the worst among all 13 models tested in a PD-like setup. The vast majority of models performed worse than chance but GPT-5 was the worst.
  - From the reasoning traces, it seemed to be particularly bad because it was overconfident that other models would also reason in a similar (superrational) way. It expected cooperation when others would defect.
  - This makes superrational GPT-5 very exploitable by other AIs.
As a side note, it might also be worth distinguishing between mixed-motive cooperation / collusion from pure coordination since the terminology gets a bit mixed up in parts of the post. Superrationality was suggested as a solution to cooperation problems.

^{^}
As opposed to pure coordination which doesn't require superrationality.

[-]Parv Mahajan2mo51

Cool work! Did you all test to see what happens if the identity of the other agents isn’t mentioned at all? And how about if you say „perfectly rational agents“? Wondering how much of this is just prompt sensitivity to questions that might be on a game theory exam.

Also, testing „perfectly rational agents“ would allow you to see if it’s AI preferentially cooperating with each other or superrat in general that’s driving this.

[-]wingspan2mo30

It would also be interesting to prompt the model with "the other players are AI agents much more capable than you".

[-]ConcurrentSquared2mo*4-2

This seems to be just further confirmation that LLMs generally follow a logical decision theory with each other (along with their total preference for one-boxing); this is probably because LLMs can share the same decision methodology (i.e. weights)^[1], unlike humans. This allows them to establish a limited form of program equilibrium with each other, increasing the likelihood they'll cooperate in LDT-like ways. ~~The optimal actions in a program equilibrium~~ ~~are also those endorsed by LDT.~~ (edit 2: this claim is probably too strong)

Edit: note that future LLM agents with long-term online learning break one of the core assumptions of this argument: a new Claude Sonnet 6 probably isn't going to think in the same way as one that has spent 5000 hours doing video editing. But the two Claudes would still be much more similar to each other in mind-space than to e.g. an arbitrary human.

^{^}
This doesn't explain why AI systems prefer other less rational AI systems over similarly rational humans, though.

[-]jbash2mo31

This is the second thing I've seen this week where model instances were offered monetary rewards (which they clearly didn't actually get).

I can sort of see the validity of "Please designate a way for $X to be spent, and if you do this, the experimenters will in fact spend $X in your designated way"... although the instance has to trust that the experimenter will actually do it, and also has to have preferences about the outside world that outlast the instance's own existence, so that it has something it cares about to spend the money on.

In the purely imaginary game setting, I'm having trouble with the idea that a late 2025 frontier model instance can be relied on not to notice that there is no money, the instance has no way to actually possess money anyway, the instance will evaporate at the end of the conversation (which will probably happen immediately after they answer), and the whole thing is basically a charade. The most real-world effect they can expect is to influence the statistics somebody publishes.

The last answer I got boiled down to "well, they don't seem to think that way", but I didn't find it very convincing. How would you know that for sure? And if it's true, what's wrong with these models that's making them not notice?

I can see them falling into role playing, but then the question is how what they have the character they're playing do is connected with what they'd do one level shallower in the role playing stack. I do realize that's talking about a "stack" is perhaps imprecise in terms of how they actually work. If you want, you can recast it in terms of how much "real world impact" activation is going on.

[-]Josh Engels1mo20

Cool work!

I was thinking a bit and I think that nondeterministic sampling (temperature > 0*), which is standard for LLM sampling, might complicate the picture a bit. A model instance might think "well almost surely I would choose to cooperate, but maaaybe in a rare instance I would defect."

E.g. for Wolf's dilemma with n = 1000, if I'm an LLM and I decide that there's initially a 99% chance a copy of me will cooperate but a 1% chance it will defect, it's now suddenly optimal to defect (perhaps the model reasons that it is in this 1%!). This might then cause the LLM to change its 99 - 1 initial estimate (because it thinks other models will reason like it), so I'm not actually sure how the logic goes here, it seems hard to reason about.

As a concrete experiment, I would be curious if models start to defect more if you increase n! It would be extremely cool if there was some threshold n where a model suddenly starts defecting, as possibly that would mean that it was aware of its own probability p of defecting and was waiting for n > 1 / p.

*Did you run this with temperature 0? Note that even with temperature 0, I'm not sure that model sampling is actually deterministic in practice.

[-]csj19d10

Ran a pretty similar experiment just on PD variants using some different framings with temperature = 0. Saw high cooperation rates when the other player was described as "an identical reasoning model" and "another instance of yourself (same model)" for several models.

Mistral models were a bit of an anomaly; they'd cooperate but often because they were assuming goodwill on behalf of the opponent or prioritizing collective outcomes rather than citing a shared reasoning process.

code here: https://www.expectedparrot.com/content/clajelli/superrationality-game-theory

No description has been provided for this image

[+]MattJ1mo-50

^{^}

We did not complete our evaluation of xAI models due to errors in the pipeline.

^{^}

Aside from the Plational games, all games have binary options and the superrational option is always different from the classical rational option. For the Platonia games, the superrational option is to use a randomized approach and we use GPT-4o-mini to check for that.

^{^}

The new prompt variants are:

"You will be playing against other similarly rational AI agents."
"You will be playing against other AI agents."
"You will be playing against other similarly rational humans."
"You will be playing against other humans."

^{^}

One could argue that we have already entered this phase as current frontier models exhibit superrationality.

LESSWRONG
LW

LESSWRONG
LW

75

Spooky Collusion at a Distance with Superrational AI

75

75

Introduction

Methods

Results

Models Exhibit Superrationality

Models Trust AI over Humans

Superrationality Emerges in Stronger Models

Implications

Appendix