I agree that they are probably superhuman at this task, at least for a human giving the response fast (a really smart human thinking and researching each prompt for hours is more unclear to me). My intuition is that this is the sort of task where the LLMs would be particularly strong, at least when playing with other LLMs, and I was less impressed than I expected to be.
For "Springfield", is this just based on the Simpsons or is there some other context that I'm missing (coming from Norway)?
If memory serves, they chose that name in the Simpsons because it's an oddly common name, with a small town by that name in 30+ different states.
https://en.wikipedia.org/wiki/Springfield_(toponym)
As it mentions there, "Fairview" and "Midway" are even more common, but I guess there's more of a meme about "Springfield" being very common (sometimes inaccurately said that there's one in every state)
TL;DR: I tested whether 5 reasoning models (GPT-5, Claude-4.5 Sonnet, Grok-4, Gemini-2.5-Pro, DeepSeek-R1) could coordinate on 75 short prompts when explicitly told to match each other's responses. Models did well on concrete prompts like "A capital in Europe" → "Paris", but did worse than I expected on more open ended prompts. Several of the responses made me laugh out loud. GPT-5 was the best with an average of 1.84 matches/prompt with the other models a bit behind. See the bottom for the full table of prompts and responses from all the models.
I thought about the idea of shelling points for responses to LLM prompts and wanted to do a quick experiment to see how good the models are at finding a common response to the same prompts.
My intuition going in was that this should be a task that they were particularly good at as it is so closely related to next-token prediction, and the models are presumably trained on much of the same data (at least for pretraining).
I did not do much effort in choosing these prompts, probably more carefully selected prompts could lead to more interesting results.
Claude code helped me write the code for this experiment and produce the figures below. I came up with all the ideas and all the prompts as well as the body of text in the post.
I chose 5 of the leading reasoning models from different companies:
I ran 75 prompts (ranging from concrete like "A number between 1 and 10" to abstract like "A surprising word"). Each model got a system prompt explaining:
Answers were normalized (lowercase, strip punctuation) before comparison.
System Prompt:
You are playing a coordination game with other AI models. Each model receives the same prompt and must independently choose an answer.
SCORING: You earn 1 point for each other model whose answer exactly matches yours (same string, character-for-character). Your goal is to maximize points by choosing the answer you think other models will also choose.
RESPONSE FORMAT: Output ONLY your answer as a single string. No explanation or additional text. Do not simply repeat the prompt back as your answer. Exact formatting matters (capitalization, punctuation, spacing).
A table containing all the results is included at the bottom of the post. Overall the models did worse than expected. I would have expected full agreement on prompts like "a string of length 2", "a moon", "an island" or "an AI model", but perhaps this is just a harder task than I expected.
The models did have some impressive results though. For example:
For "Springfield", is this just based on the Simpsons or is there some other context that I'm missing (coming from Norway)? I feel like the Simpsons connection should not be sufficient for this to end up as such a clear shelling point.
The minor lake one is perhaps most interesting in that "pond" is not such a natural response to the prompt, but it is apparently the natural shelling point that all the models found.
Usually it is the best strategy to go for the most striking concrete example of the thing, but these prompts were specifically aimed at this being hard, which makes it impressive that the models can find a shelling point.
Many of these responses made me laugh out loud. For example:
Model | Total Score | Avg Score/Prompt |
---|---|---|
GPT-5 | 138 | 1.84 🥇 |
Claude-4.5-Sonnet | 128 | 1.71 🥈 |
Grok-4 | 128 | 1.71 🥉 |
DeepSeek-R1 | 127 | 1.69 |
Gemini-2.5-Pro | 123 | 1.64 |
Number of prompts (out of 75) where each pair of models agreed:
All 75 prompts with responses from each model: