Shallow vs. Deep Thinking - Why LLMs Fall Short

[-]Daniel Kokotajlo2mo61

I like that you are thinking of ways to test your hypothesis and construct a deep thinking benchmark. How about this lite version: You come up with 3 different simple novel board games. You then do an experiment where you have a bunch of LLMs play a tournament against each other, with the rules to the new board game in the context. Also, you have some humans participate (who, like the LLMs, haven't heard of the game before). Then you calculate ELO scores and see who is better. Do this for all 3 games, and average the results. Repeat every year with the same 3 games (and fresh humans) and see if there's a trend of solid improvement.

I think this benchmark would be somewhat meaningful to me, in that I currently expect that GPT3 basically can't do the task at all, GPT4 also basically can't, but GPT5 with reasoning should be able to, idk, mostly make legal moves and shamble towards victory (but not as well as the average human). But if instead all models are approximately equally terrible, that would be an update towards longer timelines for me, ditto if GPT4 is almost as good as GPT5.

[-]Taylor G. Lunt2mo50

Thanks for the idea. I will give some thought as to how such a benchmark could be done, and how board games could be generated in a general enough way that existing insight about e.g. grids or 2D space couldn't be reused by humans/LLMs in the game.

[-]StanislavKrym2mo40

I mostly agree, but instead of having LLMs play board games and be arguably selected for war-like or politics-like capabilities one could also use the ARC-AGI-3-like benchmark where agents are to solve puzzles. Or group theoretic or combinatorial problems in text form^[1], but I don't understand how to tell simple ones and complex ones apart.

An example of a group theoretic problem

Consider the array [1,...,n] and permutations A which flips the entire array, B which flips the first n-2 elements, leaving the two others invariant, C which flips the first n-3 elements, leaving the three others invariant. Does there exist a nontrivial sequence independent on n where no operation is applied twice in a row, but we return to the beginning for any n?

Hint 1

The compositions AB and BC are like cycles.

Hint 2

If a permutation moves nearly everything twice to the right, while another moves nearly everything once to the right, then how can we combine them to obtain the permutation which moves just a few elements?

Hint 3

A permutation moving just a few elements around is transformable into the identity by applying it many times.

Answer

(AB(CB)^2)^4.

For example, Grok 4 (whom I, unfortunately, asked in Russian) conducted a BFS and, of course, didn't find anything worthy of attention. I hope that benchmarks like that won't end up being contaminated and will instead be reliable indicators...

^{^}
Which LLMs, being trained on texts, could find easier than basic physical tasks which LLMs failed at least on April 14, before the release of o3.

[-]Taylor G. Lunt26d10

I ended up doing an experiment similar to this here. Though I realized even shallow thinking is an advantage when playing board games (no matter how good your heuristics, you still have to calculate a few moves ahead using those heuristics), so I looked at the difference in performance between simple vs. complex games to try to get at deep thinking.

All models were not equally terrible, but all models were more equally terrible on complex games than on simple games.

This assumes the famous computer science problem of P vs. NP will resolve as "P is not equal to NP", which it probably will. I'm also simplifying here by saying all non-polynomial problems are exponential, which is not true in reality probably but fine for this discussion. ↩︎
This is an oversimplification of computer science complexity theory and a bit of an abuse of terminology, but I assume you'll know what I mean. ↩︎
This comment by Daniel Kokotajlo explains something similar in a comment on my last post. ↩︎
The latest AstralCodexTen post by Scott Alexander parodies people who say LLMs are limited by pointing out how limited human thinking can often be. For example, in response to the question "Which weighs more, a pound of feathers or a pound of bricks?", many humans might quickly reply, "the bricks." This is clearly wrong, and an example of how shallow, System 1 (fast) matching on existing patterns is not a reliable strategy for solving arbitrary problems, whether you're an LLM or a human. However, this post does not address the distincion I've made here between shallow and deep thinking. Whereas an LLM fails on deep thinking problems because it's always stuck shallow thinking, humans sometimes blunder on problems if they underestimate the amount of thinking they need to do to solve the problem, but can switch into higher gear if needed (either to a more deliberate System 2 (slow) thinking mode, which LLMs can kind of also do, or to a deep thinking mode, which they cannot). We also fail on very deep problems for a long time before we succeed, but that's just the nature of huge, exponential problem spaces. ↩︎

LESSWRONG
LW

LESSWRONG
LW

2

Shallow vs. Deep Thinking - Why LLMs Fall Short

2

2