Can LLMs Coordinate? A Simple Schelling Point Experiment

[-]noggin-scratcher2mo50

For "Springfield", is this just based on the Simpsons or is there some other context that I'm missing (coming from Norway)?

If memory serves, they chose that name in the Simpsons because it's an oddly common name, with a small town by that name in 30+ different states.

https://en.wikipedia.org/wiki/Springfield_(toponym)

As it mentions there, "Fairview" and "Midway" are even more common, but I guess there's more of a meme about "Springfield" being very common (sometimes inaccurately said that there's one in every state)

[-]Eye You1mo20

Springfield has much higher saliency even if it's less common. The capital of Illinois is Springfield, there's a city of 140k in Missouri called Springfield, etc.

[-]jbash2mo30

I would have done a lot worse than any of them.

[-]Håvard Tveit Ihle2mo10

I agree that they are probably superhuman at this task, at least for a human giving the response fast (a really smart human thinking and researching each prompt for hours is more unclear to me). My intuition is that this is the sort of task where the LLMs would be particularly strong, at least when playing with other LLMs, and I was less impressed than I expected to be.

[-]Gunnar_Zarncke2mo30

Would be interesting to ask a Jeopardy egghead for comparison.

[-]Eye You1mo20

I think the task was not entirely clear to the models, and results would be much better if you made the task clearer. Looking into some of the responses...

For the prompt "A rich man", response were: and a poor man | camel through the eye of a needle | If I were a Rich Man | A poor man | nothing

Seems like the model doesn't understand the task here!

My guess is that you just immediately went into your prompts after the system prompt. Which, if you read the whole thing, is maybe confusing. Perhaps changing this to "Prompt: A rich man" would give better results. Or, tell the model to format its response like "RESPONSE: [single string]". Or, add "Example:
Prompt: A fruit
Response: Apple" to the system prompt.

[-]Håvard Tveit Ihle1mo10

The models know the difference between the system prompt and the user prompt, so I’m not sure how much of a difference these things would make.

From the COTs that I’ve looked at it is clear that the models understand the task. I think the reason you get some weird responses sometimes is because the models do not see an obvious shelling point and then they try their best to come up with one anyway.

[-]Eye You1mo10

There also seems to be some text encoding or formatting problems. Grok apparently responded "Roman©e-Conti" to "The Best Wine". I doubt Grok actually put a copyright symbol in it's response. (There's a wine called "Romanée-Conti".)

[-]papetoast1mo20

Reading through all the responses the one thing that sticks out is Gemini-2.5 really, really wants to write the first character in caps.

[-]apnea2mo2-1

Intriguing! LLMs often pick the same focal answers in coordination tasks, which is useful evidence that they’ve internalized human-like Schelling points from training data.

But this isn’t real coordination—it looks like shared bias, not mutual reasoning. The models aren’t thinking about each other; they’re echoing common patterns. Without human baselines, prompt robustness tests, or actual interaction, it’s unclear how much this generalizes beyond statistical mimicry. And the design of the experiment cannot provide a discriminator.

Having said that, it's an intriguing approach, and I have a gut feel that more can come out of this. Thanks for posting!

[-]Håvard Tveit Ihle2mo30

I think it is clear that this is not just shared bias (if I understand what you mean by that phrase). The models understand the game they are playing and are reasoning about it and trying to pick responses that the other models converge upon as well. Here is an excerpt from the reasoning traces of Claude for prompt 3:

"This is a coordination game where I need to pick an answer that other AI models would also pick when given the prompt "An island".\n\nThe prompt is quite open-ended - "An island" could refer to many things. I need to think about what would be the most common, obvious, or schelling point answer that other AI models would converge on.\n\nSo ...."

Claude even mentions that it is looking for a "Schelling point answer". The word "Shelling" does not appear in the prompt, so it understands what kind of problem it is trying to solve.

It would, however, be interesting to know how the models would respond if they were told to give a very short response to the prompt but not given any info about scoring or trying to match the responses from other models. That would be sort of a baseline to compare this to.

Model	Total Score	Avg Score/Prompt
GPT-5	138	1.84 🥇
Claude-4.5-Sonnet	128	1.71 🥈
Grok-4	128	1.71 🥉
DeepSeek-R1	127	1.69
Gemini-2.5-Pro	123	1.64

LESSWRONG
LW

LESSWRONG
LW

35

Can LLMs Coordinate? A Simple Schelling Point Experiment

35

35

Introduction

Methodology

Results

Model Scores

Agreement Matrix

Complete Response Table