LESSWRONG
LW

Navigating LLMs’ spiky intelligence profile is a constant source of delight; in any given area, it seems like almost a random draw whether they will be completely transformative or totally useless.
Scott Alexander (Links for September 2025)

If you extrapolate the curves that we’ve had so far, right? If you say, well, I don’t know, we’re starting to get to like PhD level, and last year we were at undergraduate level, and the year before we were at like the level of a high school student… If you just kind of like eyeball the rate at which these capabilities are increasing, it does make you think that we’ll get there by 2026 or 2027.
Anthropic CEO Dario Amodei on AGI

Large language models are about mimicking people… What we want is a machine that can learn from experience, where experience is the things that actually happen in your life. You do things, you see what happens, and that’s what you learn from.
Richard Sutton, on Dwarkesh Patel’s podcast

Many people have made predictions, such as AI 2027, that AI will become superhuman very soon. I disagree. (Dario Amodei and the AI 2027 authors express low confidence in these statements, to be fair.)

I made a post explaining why I thought AI 2027 was wrong with my own predictions for AI in 2027.

As a follow-up, I made another post explaining that LLMs have made a lot of progress recently in shallow thinking, but not deep thinking. This is why it sometimes seems hard to predict which tasks LLMs will succed or fail at. Some tasks require only shallow thinking, but other tasks require deep thinking. Deep thinking is the kind of computationally expensive thinking required to deeply explore vast idea spaces and form new connections and uncover new discoveries. Shallow thinking, on the other hand, is the striaghtforward processing of information using existing insights/heuristics. Basically, memory + simple processing/arithmetic. If LLMs can’t do deep thinking well, this means their “progress” in recent years has been dramatically overstated, and predictions for future progress are likely far too aggressive, barring breakthroughs in LLM architecture.

Learning a very complex game like chess requires deep thinking. There are more possible chess games than atoms in the universe. You cannot simply calculate your way to victory using shallow thinking. However, all games require some amount of shallow thinking too. When you play chess, you may discover heuristics to help you identify promising moves, but you must still calculate several moves ahead to properly evaluate each option according to your heuristics.

The existing AI benchmarks mostly test shallow thinking. So I thought I’d do a pilot experiment to see if I could test whether or not LLMs are really lagging behind in deep thinking, and whether or not you could benchmark deep thinking somehow. I created a website to let humans play board games against AI (and AI play against each other), wrangled up some friends and family to play, and recorded the results. This is just a small experiment, and can’t prove anything definitively, so think of it as a pilot experiment.

(This was also posted to my Substack: https://taylorgordonlunt.substack.com/p/llms-suck-at-deep-thinking-part-3)

The Experiment

Methodology

Six round-robin tournaments between 6 players, for a total of 90 games played.

The players:

Humans (five humans)
Random player (totally random legal moves)
GPT-3.5-Turbo (v0613; released June 2023)
GPT-4o (2024-05-13; released May 2024)
GPT-4.1 (reasoning effort set to ‘high’; released April 2025)
GPT-5 (reasoning effort set to ‘medium’; released August 2025) The five humans count as a single player, with each member of the coalitiion playing each board game a maximum of one time to eliminate the problem of learning across multiple games, which would be an “unfair” advantage of humans over AI. I am using the non-chat models here, so GPT-5 rather than GPT-5-Chat.

The four AI models are chosen to represent successive generations of OpenAI models. I would have included GPT-3, but it isn’t available on OpenRouter, and I suspect it might simply be too bad at shallow thinking to pass the minimum threshold needed to understand the rules and make consistently legal moves. In any case, I have chosen four successively better OpenAI models. Whether by scaling up, changing architectures, or both, these models represent what OpenAI believed to be the best they could do with LLMs at a given point in time.

The games:

Chess (classic)
Fantastical Chess (novel game similar to chess)
Go (9x9 board; classic)
Shove (novel game similar to Go)
Connect 4 (classic)
Hugs and Kisses (novel game similar to Connect 4)

I took three classic games (Chess, Go, and Connect 4) and created new games (Fantastical Chess, Shove, and Hugs and Kisses). The three novel games were designed to be similar to the existing classic games, but different enough that someone good at one wouldn’t be good at the other. From playing the games against myself, they seem similarly complex to the classic games.

I estimate that Chess/Fantastical chess are the most complex (branching factor ~35), Go/Shove are in the middle somewhere, and Connect 4/Hugs and Kisses are the simplest (branching factor less than 7). (Go on a 19x19 board has a huge branching factor, but this is only a 9x9 board.)

In each tournament, each player plays each other player once, for a total of 15 games per tournament, or 90 games total. Players who attempt illegal moves more than 20 times in a row will forfeit the match. LLM parameters will be chosen such that they don’t take too long to respond (< 10 minutes) and human players will be required to choose moves in less than 5 minutes.

A digital interface will be provided to humans to play the board games. LLMs are given a prompt on each turn with:

A description of the rules of the game
The past board states of the current match (up to a reasonable limit)
A list of moves that would be legal in the current position
Text encouraging the LLM to think very deeply before answering and to output a single legal move to try to win the game.

Players’ raw scores and Elo-like (Bradley-Terry) scores will be computed.

The digital interface is available online, and I encourage you to check it out. If you want to play against AI, you’ll need an OpenRouter API key, but you can play against yourself or the Random player.

Predictions

Prediction 1: There will be less of a spread in performance between AI models on complex games than on simple games. Justification: I believe AI models have improved significantly at shallow thinking, but not very much at deep thinking. Complex games require deep thinking for performance moreso than simple games, though both games require shallow thinking ability as well. That means later models will do better at all games, but more better at simple games.

Prediction 2: AI models will do worse compared to humans on complex games than on simple games. Justification: Humans are capable of deep thinking, and LLMs aren’t really, giving humans an advantage on complex games that are too computationally complex to allow brute-force shallow thinking and require deep thinking as a result.

Prediction 3: AI models will do worse on novel games than classic games. Justification: AI models may have strategies memorized from their training data that applies to classic games. However, this is not possible for the novel games that were invented for this experiment.

Results

Game	Random	GPT-3.5-Turbo	GPT-4o	GPT-4.1	GPT-5	Humans
Chess	1	1.5	2	1.5	4	5
Fantastical Chess	1	1	2	2.5	4	4.5
Go	2	0	2	2	4	5
Shove	2	0	1	3	5	4
4 in a Row	0	2	3	1	5	4
Hugs and Kisses	2	1	2	4	5	1
All Games	8	5.5	12	14	27	23.5
Win Rate	8.89%	6.11%	13.33%	15.56%	30.00%	26.11%

Table 1: Raw scores (a win against another player is 1 point, a draw is 0.5 points).

Game	Random	GPT-3.5-Turbo	GPT-4o	GPT-4.1	GPT-5	Humans
Chess	789	990	1186	990	2222	2823
Fantastical Chess	920	920	1307	1491	2074	2288
Go	1297	-271	1297	1297	2389	2992
Shove	1279	-776	376	2050	3339	2732
4 in a Row	-776	1279	2050	376	3339	2732
Hugs and Kisses	1186	775	1186	2238	2840	775
Simple Games	896	1083	1423	1423	2753	1423
Moderate Games	1274	-631	992	1549	2908	2908
Complex Games	841	944	1235	1235	2155	2591
Classic Games	637	710	1223	852	2675	2904
Novel Games	1176	750	1176	1733	2431	1733
All Games	1032	860	1283	1407	2360	2058

Table 2: Elo-like scores. I computed Bradley-Terry scores using the choix Python library (regularization factor 0.01), then computed Elo-like scores using the formula 1500 + 400 * Bradley-Terry score. The Elo scores from one row are not comparable to the scores in another row.

In general, successive AI models did better than previous models on this benchmark, with GPT-3.5-Turbo doing so badly it couldn’t beat the Random player, and GPT-5 having a higher total score and All-Games Elo-like score than humans. This is really impressive.

When comparing simple vs. complex games, the spread of performance between AI models was somewhat smaller on complex games than in simple games.

Humans totally failed to outperform GPT-5 on simple games, but beat GPT-5 (and the rest of the models) on complex games. The AI models just weren’t as impressive on the more complex games.

Surprisingly, AI models did not have any particular advantage on classic games vs. novel games. Also surprisingly, Elo scores can be negative.

Discussion

This experiment gives suggestive evidence in support of Prediction 1. The difference in performance between AI models was somewhat smaller on complex games, implying the improvements between AI models have been in the kinds of intelligence that make you do well at simple games (shallow thinking), not in the kind of intelligence that makes you do well at complex games (shallow thinking and deep thinking). That said, the difference isn’t huge in absolute terms, and the Future Directions section will discuss possible improvements to the study.

Prediction 2 is clearly supported by the experiment. The AI models were very impressive on simple games, and the matches between humans and AI on those games were fierce. On complex games, however, not only did they play poorly, but watching them play was a comedy of errors. My participants generally found the AI’s moves on complex games humorous.

For example, in one game GPT-4.1 lost its queen to the enemy king when the enemy king was the only piece on the board. The AI models, even GPT-5, constantly made terrible moves, sacrificing pieces for no reason, missing obvious one-turn victories, or losing to very simple strategies. One of my participants found the idea that people have compared GPT-5 to a PhD level researcher funny and said, “I believe that. I believe it’s smarter than a PhD researcher. But it’s not smarter than the average person.” Sure, if you memorize a bunch of knowledge that PhDs know, in a sense that makes you a PhD. And if you can do shallow thinking to perform basic manipulations of those existing ideas, you can even solve tricky sounding exam problems. You can do everything PhDs can do — except the actual research, since that alone requires deep thinking ability! My participant was just making a joke, but it points to something real. Academics live in an artificial environment where they learn from textbooks and papers and write exams. Regular folk live in the real world, which is messy and complex. Where it’s easier to fake it, but harder to make it.

You would expect a more intelligent entity to dominate at all kinds of board games, simple or complex. No matter the game, the smarter player should win, right? (Unless the game is so simple that both players play about optimally, and the game ends in a game-specific predetermined way, either a win for player 1, player 2, or draw every time). But that’s not what we saw here. The later-generation LLMs dominated at simple games, but humans dominated at more complex games. This implies there’s something more going on here than simple general intelligence. LLMs think faster, more deliberately, but not deeply. Humans are slow (not to mention impatient), but they can do deep thinking. That explains the difference. On games which are computationally simpler, shallow thinking is more valuable, so the LLMs win, especially later-generation LLMs which are much better than their early predecessors at shallow thinking. But on complex games, the LLMs can’t match the human deep thinking ability.

Could a superintelligent AI who has never heard of chess before immediately play like a grandmaster, knowing only the rules? The answer is a conditional no. The rules of chess don’t tell you what strategies will be optimal in chess, and the game is computationally irreducible. I don’t think any amount of genius will tell you in advance which strategies are good. Except if you can mentally simulate playing a bunch of games against yourself and notice patterns from these simulated games that allow you to develop heuristics and refine your search through exponential strategy space. Or do something mentally equivalent to that. Humans do this partially as they play. Not only do they derive insights from actual moves played in games, but they think several moves ahead, and this thinking can give them new insights. “Oh! If I take his Wizard, then he can’t move his Lame King! So I should take all his Wizards!”

Are modern LLMs capable of this? Do they do “deep thinking”? I claim no. Right now, if you have an LLM play a board game against you, it’s doing shallow, brute-force thinking using the basic “level one” heuristics you can figure out just from the rules of the game, like “try to capture other pieces” or “move your piece away when it’s threatened”. Or, if it remembers some higher-level strategies from its training data, it will use those to narrow its search and evaluate moves. But there’s nothing happening inside the LLM that lets it notice new patterns in strategy space and develop progressively more sophisticated strategies in a single game. At least, not really. As a result, LLMs play complex games poorly, making very basic mistakes that humans quickly learn not to make. This doesn’t apply to games that are simple enough that the LLM can get by just by thinking several moves into the future using brute-force. Like Connect 4. But this is computationally impossible for more complex games like Chess, and on these games, LLMs do worse.

(If you’ve heard of LLMs being good at chess and are confused, that’s coming in a few paragraphs, and more in Appendix A!)

This simple experiment is not hard proof of anything, but an indication that LLMs have made huge gains in memory and shallow thinking ability, giving the impression they’ve been getting smarter without actually getting much better at deep thinking, which will be required for lots of the cool things we want future AI to do.

When it comes to chess, if you can think two moves ahead, that’s better than if you can think only one move ahead, and even better than if you can’t even really think one move ahead and are just confused. But no amount of this kind of brute-force shallow thinking will make you a chess grandmaster. The game is just too computationally complex. With a branching factor of about 35, looking at the immediate implications of one move involves checking around 35 possibilities and picking the best one. But looking at your move, the possible responses of the opponent, and then at your next move, involves considering over 42 thousand possibilities. The game is too complex to brute-force. You have to use heuristics to narrow the search. So AI models improving at shallow thinking should give them an advantage at novel board games and other novel tasks up to a point. Then we expect LLMs will plateau if we don’t switch to an architecture that enables deep thinking (something that looks less like current LLM inference, and more like AlphaGo Zero’s training).

If it’s true that LLMs haven’t made much progress in deep thinking, as this experiment suggests, it would explain the popular sentiment among some software developers that the AI coding dream is basically not true and AI isn’t currently very useful. While waiting for the slow responses from GPT-5, one of my participants (a software developer) told me about a non-programmer coworker of his who always expects things to be done quickly because AI should be able to do it quickly, right? “When I tell my coworker it can’t write the code for me, he doesn’t believe me. He thinks I just don’t know how to use the AI effectively.” This seems to be pretty common. People who actually use AI in disciplines like coding in ways require deep thinking have a much more pessimistic view of its capabilities than business folk, or AI alignment people for that matter. Modern AI can look smart if it’s spitting up recombinated training data, but if you have to do “novel thinking”, its brain seems to turn to mush.

Prediction 3 was not supported by the experiment. I had expected the more advanced models would have absorbed some Chess/Go/Connect 4 knowledge from their training data. That’s why I created the novel games. So that if the AI models had memorized a bunch of stuff, we would see them succeed in the classic games but fail at the novel games. But that didn’t happen. Even GPT-5 just hasn’t memorized that much chess info from the training data. I had expected it to play at around 1600 Elo (FIDE) , which is a strong intermediate level, and better than me. And clearly it did remember some strategies. It made coherent opening moves, for example. But it didn’t remember anything in enough detail to actually capitalize on it. When playing another AI model, GPT-5 clearly knew a knight/queen checkmate was possible, and kept trying, for hours, to pull it off, failing over and over, taking just enough material to prevent an automatic draw. In the end, it won by other means (doing the obvious thing of marching its uncontested pawn a few squares, promoting, and getting a queen/queen checkmate). It turns out GPT-5 just didn’t have enough memorized from the training data to make a difference in its play. (None of the human players had much familiarity with these games either, by the way. In chess, for example, I estimate them all to be less than 1000 Elo, most of them having never played before. The last time one of the participants, my grandma, played chess was apparently when I taught her the rules when I was a little kid!)

To conclude, my belief that LLMs just aren’t that good at deep thinking has been somewhat reinforced. I think AI models stand to do some really impressive stuff in the coming years, and all that impressive stuff will come from improved shallow thinking ability, not deep thinking ability, and won’t involve things like doing novel research, out-competing programmers who have to novel, deep thinking as part of their job, etc. AI will be taking jobs. Just not real coding jobs. (Programmers are already “losing jobs to AI”, but I think that’s some weird mix of CEO delusion and the fact that some/many programming jobs weren’t even economically productive in the first place.)

Narrow AI models might also do some impressive stuff too, since they actually can do deep thinking at training time, not that it’s cheap. Could you train an AI model on videos of humans lying to make a superhuman lie-detection AI? Maybe, if you had the right data. Getting the data is the hard part, but there’s a lot of money going into getting better data, and shallow LLMs can definitely help you collect more types of data.

Exploring complex idea spaces is random. You need brute force + pattern matching to continually mine for new insights and use them to progressively reduce the search space. AlphaGo Zero did this at training time. The problem is that training AlphaGo Zero is a lot more expensive and time-consuming than running an already-trained LLM. We may not be able to expect that level of deep thinking until we can match that level of compute, and have an architecture that supports doing that online for novel problems. And for some problems, like curing cancer, the AI will be limited by its ability to do real experiments, which will slow things down further.

Future Directions

The humans were at a big, unfair disadvantage in this experiment. Did you notice it?

Deep thinking is slow and takes time. Nobody becomes a grandmaster after a single game. Yet all the players in this experiment got was a single game. The first game they’ve ever played, for the novel games. For the classic games, a few of the players had a bit of experience, maybe a handful of matches in the past, but none had ever played the games seriously and were all beginner-level. Despite their lack of familiarity, humans can learn and develop more sophisticated strategies over time, and the LLMs just can’t do this. They don’t currently have “online learning” at all. This is a huge advantage humans have over LLMs. I tried to eliminate this across-many-games advantage for humans to see if I could measure the difference between deep and shallow thinking within single games. What would have been better is having each participant play each game 100 times and see how they improved. But LLMs can’t do that, so they would obviously lose. Future experiments will be able to do this once LLMs acquire “online learning”, but before then, there’s no point even running the experiment.

The humans in this experiment clearly learned a lot over the course of the individual games they played. They often blundered early, but made realizations later on that led them to victory. “I kept sending my own guys packin’ like a moron,” said one participant about accidentally suiciding her own pieces in Shove. But eventually she figured out how the game worked, developed a strategy, and made a comeback, winning the game. Many such cases. While the LLMs playing the novel games seemed to follow the same basic, obvious strategies, humans developed interesting strategies like finding a way to move their Alchemist piece into enemy lines with the intention of it blowing up a bunch of enemy pieces.

Future experiments may need new hand-designed games. Eventually the LLMs will drink in the rules of Fantastical Chess, Hugs and Kisses, and Shove at training time, and will be able to cheat at this benchmark. There will be no way to know they’re not cheating without creating new games. But for now it’s not an issue, because they aren’t even cheating that well at Chess, and my made-up games will probably represent a much smaller part of their training data than Chess.

New types of games would be interesting to test on. More complex games like 19x19 Go or Tai Shogi (Japanese “supreme chess”, a game with 177 pieces on a 25x25 grid) would more directly test deep thinking ability, because shallow thinking would be even less computationally feasible than in Chess/Fantastical Chess.

Perhaps the prompt could be adjusted to make the experiment more robust? Improvements to the board game playing UI might also be in order. It mostly worked fine, but at one point one of my participants was panicking and playing moves quite quickly. “He’s playing so fast!” she said, worrying that she had to keep up with the AI. I told her to take her time and that she didn’t have to keep up with “him” (GPT-3.5-Turbo), and she slowed down after that. I did instruct each of the human players to take their time and think deeply about every move, but a UI with a fixed delay for all models might encourage the humans to take their time more. Though all the players were frustrated with the slow responses of GPT-5, so that has to be taken into account as well.

There are more straightforward ways this experiment could be improved. First of all, each player only played each other player once per board game. I was limited by my wallet (for LLMs) and the patience of my human players. More players total and more games would give a more robust result. Second, the players were not drawn randomly from the pool of all humans, but were all people I know. I think they’re a fairly normal group of Canadian humans, with a fairly typical (weak) understanding of board games, but there are probably better ways of getting participants in a more random, representative way.

Anyway, this experiment was a pain in my ass and more than I probably want to invest in singular blog posts, so I probably won’t be doing a follow up any time soon. Hope you enjoyed!

Appendix A: Why I Thought GPT-5 Would Rock at Chess

I was pretty surprised to find even the latest model wasn’t very good at chess. I suspected there would be a difference in performance between complex and simple games, but I still thought the latest models would do quite well on complex games if those complex games were classic games that existed in the training data.

The reason I thought this was because I’ve heard over and over how various AI models have chess Elos of between 1500 and 2000, which is intermediate to expert level and quite beyond my own beginner-level Elo of ~1000.

For example:

The AI Chess Leaderboard shows GPT-5 having an Elo of 1555.
This blog post by Mathieu Acher estimates the Elo of GPT-3.5-Turbo-Instruct at 1743. (In response to other Tweets of people saying similar things.)
This thread by Adam Karvonen estimates the Elo of GPT-3.5-Turbo-Instruct to be 1800.
This benchmark test by the founder of Kagi estimates the Elo of GPT-4.1 at 1345 and GPT-4.5-Preview at 1866.
Another reddit post estimated one of the latest GPTs to be 2400 Elo “according to Chess.com”.

Having seen a few of these, I was sure GPT-4.1 and GPT-5 would be better than me at chess, but it didn’t turn out that way.

I even tried playing a game against GPT-3.5-Turbo-Instruct, which was apparently supposed to be better than the chat models. Yeah, 1743 Elo my ass. It shuffled its rook back and forth while I took its pieces for free. It was very bad. I even tried on someone else’s “board game vs. AI” platform in case there was something wrong with my specific implementation/prompt.

(I just checked and my Elo on Chess.com is 807, and I haven’t really played chess since getting that rating. I think I was a bit under-rated, so maybe I’m like 900? “Beginner-level” in any case.)

Maybe I’m missing something, but I think it’s more likely the methodologies used in the above estimations just aren’t very good. At no point do they involve having humans with known Elo scores play games against AI models to estimate their score. Instead:

The AI Chess Leaderboard use their own Elo system, derived from games between AI models and each other. An Elo score such as 1700 does not have a fixed meaning across systems. For example, Elo scores in chess are not related to the Elo scores in Scrabble. If all the players are very weak in an Elo system, you can have an Elo of 1555 without even being close to humans who have 1555 scores in FIDE chess/Lichess/Chess.com!
Both of the estimates of GPT-3.5-Turbo-Instruct seem to be based on an approximation of Elo-scores, again without actual games being played against humans. They pit the LLMs against Stockfish models, estimate the Elo of the Stockfish models, then use those estimates to estimate the Elo of the LLMs. It seems like a decent analysis, but makes me suspicious. The model plays chess at an expert level, but plays illegal moves in 16% of games? ChatGPT-4 is around 1300 Elo, yet one of my participants who has never played chess before can dominate the presumably superior GPT-4.1, and the game isn’t even close? I think something is up here. I personally tried a game against GPT-3.5-Turbo-Instruct using the same model/settings/prompt as he used, inputting my moves manually in PGN notation into the prompt, and I was able to beat it, though it did notably better than the useless chat-mode GPT-3.5-Turbo-Instruct and seemed to play coherently at about my own level. I agree with the author that chat-style prompting is probably worse than complete-the-PGN-notation prompting.
The benchmark test by the founder of Kagi is based on chess puzzles, not real games against humans. Interestingly, they also said the models failed to play Connect 4, which was not my experience with those same models with Connect 4, but then it turned out to be a prompt issue they were having.
The reddit post claiming GPT-4 was 2400 Elo “according to Chess.com” made GPT-4 play a game against itself by outputting the moves for both white and black for the entire game as a single output(!), and then having Chess.com’s analyzer analyze the game. Yikes!

It makes me a bit concerned that in this lengthy online discussion about the Elo of LLMs at chess, nobody decided to estimate the Elo score the obvious way, by having AI models play against humans. These estimates are indirect, error-prone ways of getting at the truth. I trust my own experiment more. Unfortunately, my experiment does not produce an Elo estimate comparabile to FIDE/Lichess/Chess.com Elos, but it definitely wouldn’t be high, and definitely not 1500+. It could be there was something wrong with my exact prompt, but the Mathieu Acher blog post above showed the LLMs’ chess performance was fairly robust to small prompt changes, and I tried playing with someone else’s prompt and still had no problem stomping GPT-3.5-Turbo. GPT-3.5-Turbo-Instruct does definitely seem better than the chat-style models though.

None of this really affects the main experiment either way, which was about how well the AI models could play games they weren’t familiar with, not how well they can regurgitate memorized chess strategy. If it turned out they had memorized a bunch of chess info, I could have used only the novel games for my analysis and come to the same conclusions.

Appendix B: Randomly Generated Games

Instead of hand-designing board games, I originally wanted to randomly generate novel games, which would make this “benchmark” reusable. As it stands, once this experiment enters the training data of LLMs, these board games are potentially “burned”.

However, after thinking about it, I’ve come to the conclusion that generating new “playable” two-player, turn-based perfect-information, zero-sum, one-action-per-turn games randomly is extremely difficult without making any assumptions about the structure of the generated games (e.g. “grid-based”). In the future if I wanted to do this I would just give up on the “no assumptions about the structure of the generated games” part and it would be a lot easier, but I went down a bit of a rabbit hole with that part, and I wanted to share.

This is in an appendix because it’s not centrally important to the experiment or the LLM forecasting debate at all, I just thought it was interesting.

Let’s say a game consists of 2^n possible game states of n bits, and connections between the 2^n game states that form the possible moves a player can take on their turn. Players take turns taking actions until one player reaches a state which is a winning state. Different states may or may not be winning states, depending on the game.

If you randomly generated such a game, it would almost certainly be unplayable gobbledygook. Each game state would have different rules. Imagine playing chess, but the pieces all move differently on every single different board state. There would be no exploatable patterns, and you’d have to play randomly.

Instead, we’d want “playable” games, that is, games that have emergent, complex strategy spaces generated from small rulesets. Ideally, we could generate only playable games. This is clearly possible if we introduce constraints on the types of games, like insisting they’re grid-based like chess. But if we don’t introduce constraints, we’re left to somehow generating games with random small rulesets and checking if they have large strategy spaces.

But you can’t do that. (Boring math ahead, feel free to skip. I’m not a math guy so it’s probably wrong anyway.) Let’s say you have a function S that will tell you the shortest description of the optimal strategy for a given game g. If we made the rules of the game small, but S(g) is large, then the game is theoretically playable, with a large strategy space! However, given S and an arbitrary string of text (e.g. “ababababababababab”), we can build a game with states representing strings of text (e.g. “a”, “aa”, “bac”) with 26 possible moves from each state (e.g. “a” -> “ab” or “a” -> “ac”). Let’s say this game had a single win state “ababababababababab”. Our function S(g) would give us the shortest description of the optimal strategy for the game (let’s say in English), which in this case could be “‘ab’ 10 times” or something like that. In doing so, we will have found the shortest possible description of “abababababababababab”. The shortest possible description of a string of text is called its Kolmogorov complexity, and an algorithm for computing the Kolmogorov complexity for an arbitrary string of text is proven to be uncomputable. Therefore, our function S cannot exist.

Basically, you can’t have some fixed-size program that tells you how large the strategy space for your game is. Arbitrary games are computationally irreducible. According to Wikipedia, “Computational irreducibility suggests certain computational processes cannot be simplified and the only way to determine the outcome of a process is to go through each step of its computation.” This is true of sufficiently complex, arbitrary games. You can only discover their nature by playing them (or mentally “playing” them). Including whether or not they are “playable”, or too trivial or too chaotic.

This is why chess itself evolved over the years, rather than having been invented all at once. It took people playing many thousands of games to learn what the play space of the game actually looked like, and what the full implications of the rules were. And to tweak those rules to make a better game.

And this is why new games re-use existing concepts, such as grids, pieces, decks of cards, and so on. If you start searching the Board Game Shop of Babel, it’s better to start in a location near existing “playable” games, rather than starting at a random location. The game shop of all random games is so large that stumbling across new playable games with all-new game concepts is very unlikely (almost 100% of board games are not “playable”). I think it’s just easier to hand-design new board games starting from existing concepts.

The only way I can think of to automate the generation of novel board games that differ significantly from existing games (but are still playable) is to create a machine that can do deep thinking and have the machine hand-design the games instead of me. Other approaches (like specifying all possible ways a piece could move in chess and generating random combinations to form new games) are limited in the variety/novelty of board games they could create.

It would be nice if I could generate novel, playable board games that had nothing in common with existing board games. Then it would be trivial to test whether LLMs could do deep thinking. But it seems like creating novel, playable games can’t be done in a systematic way and itself requires deep thinking. Making something playable and yet truly unique is an extremely complex computational task.

In a way, I wonder if inventing new deep thinking benchmarks will always require deep thinking. If we can’t systematically generate simple, completely new, playable games with very complex optimal strategies, then maybe that means we can’t systematically generate completely new deep thinking challenges for LLMs. To create new deep thinking benchmarks for LLMs, you’ll have to do some deep thinking yourself. This gets at the heart of what deep thinking is. Deep thinking is exponential in nature. You cannot verify the solution to an arbitrary exponential problem in polynomial time. Therefore, no simple algorithmic process or shallow thinking behaviour can create benchmarks for deep thinking. If you want to prove a machine can do deep thinking, you’ll have to do some deep thinking too. It takes a deep mind to know a deep mind.

17

LLMs Suck at Deep Thinking Part 3 - Trying to Prove It (fixed)

17

17