I like that you are thinking of ways to test your hypothesis and construct a deep thinking benchmark. How about this lite version: You come up with 3 different simple novel board games. You then do an experiment where you have a bunch of LLMs play a tournament against each other, with the rules to the new board game in the context. Also, you have some humans participate (who, like the LLMs, haven't heard of the game before). Then you calculate ELO scores and see who is better. Do this for all 3 games, and average the results. Repeat every year with the same 3 games (and fresh humans) and see if there's a trend of solid improvement.
I think this benchmark would be somewhat meaningful to me, in that I currently expect that GPT3 basically can't do the task at all, GPT4 also basically can't, but GPT5 with reasoning should be able to, idk, mostly make legal moves and shamble towards victory (but not as well as the average human). But if instead all models are approximately equally terrible, that would be an update towards longer timelines for me, ditto if GPT4 is almost as good as GPT5.
I mostly agree, but instead of having LLMs play board games and be arguably selected for war-like or politics-like capabilities one could also use the ARC-AGI-3-like benchmark where agents are to solve puzzles. Or group theoretic or combinatorial problems in text form[1], but I don't understand how to tell simple ones and complex ones apart.
An example of a group theoretic problem
Consider the array [1,...,n] and permutations A which flips the entire array, B which flips the first n-2 elements, leaving the two others invariant, C which flips the first n-3 elements, leaving the three others invariant. Does there exist a nontrivial sequence independent on n where no operation is applied twice in a row, but we return to the beginning for any n?
Hint 1
The compositions AB and BC are like cycles.
Hint 2
If a permutation moves nearly everything twice to the right, while another moves nearly everything once to the right, then how can we combine them to obtain the permutation which moves just a few elements?
Hint 3
A permutation moving just a few elements around is transformable into the identity by applying it many times.
Answer
(AB(CB)^2)^4.
For example, Grok 4 (whom I, unfortunately, asked in Russian) conducted a BFS and, of course, didn't find anything worthy of attention. I hope that benchmarks like that won't end up being contaminated and will instead be reliable indicators...
Which LLMs, being trained on texts, could find easier than basic physical tasks which LLMs failed at least on April 14, before the release of o3.
(Crossposted from my Substack: https://taylorgordonlunt.substack.com/p/shallow-vs-deep-thinking-why-llms)
(This is a follow-up to my previous post outlining my AI 2027 predictions)
The history of “AI” is littered with tech that didn’t pan out and plateaued early in its development process. The history of tech, specifically, is one of constant mismanagement and extremely dodgy research. Promises of a glorious future abound, but those futures never pan out as promised.
--- Baldur Bjarnason
There is a problem you want to solve. Let's say the problem can be solved in polynomial time. That is, you can solve the problem in a small number of steps. Hooray! That means it's easy to get a computer to solve the problem for you in a step-by-step fashion. But some problems are harder[1], requiring exponential time to solve. You've heard the phrase "exponential growth". That's what you don't want to happen to the number of steps it's gonna take you to solve your problem.
Unfortunately, most of the problems we really want to solve can't be solved in polynomial time.[2] Curing cancer, writing bestselling novels, improving computer chips, researching better AI, writing good laws --- you can't solve these problems in polynomial time. If you could, then as Scott Aaronson says, "the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it’s found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss; everyone who could recognize a good investment strategy would be Warren Buffett." Clearly, that's not the world we inhabit. It seems to be much harder to solve a problem than to verify the solution after you've solved it. That means most valuable problems are not polynomial in nature.
Solving some exponential problems is brutally difficult. There are at least 10^120 possible chess games. Nobody knows the exact number, but it's a lot. You can't just memorize them all and choose the best path to victory. 10^120 is a way bigger number than the total number of atoms in the observable universe. There are way too many games to memorize even a tiny fragment of them.
Yet somehow people are still pretty good at chess. This is because, while chess is exponential, most of the moves are obviously terrible. If you remove obvious blunders, then the number of chess games is closer to 10^40. (This isn't one third of games. It's almost 0% of them. 10^40 is 0.000...1% of 10^120). Almost all chess moves obviously suck. Once you know the basic rules and a few basic heuristics, you can avoid almost all of the complexity of chess. Human chess play takes place in the tiny (still large) space that remains.
Chess is a very hard exponential problem, but by learning some heuristics and patterns, you can make it much, much easier. The more you learn, the better you can do. Grandmasters spend thousands of hours memorizing openings and endgames, and thousands more absorbing abstract patterns while they play. Spotting a "fork" (a move where one piece can threaten two other pieces at once) is hard for a newbie, but trivial for an intermediate player. As you learn chess, you're not "solving" chess. That's impossible. You're just ruling out branches of play you guess aren't good for you, narrowing down the complexity to something you can hopefully manage. Perfect play is out of the question.
Individual humans don't just learn patterns and heuristics as they play, but humanity does as a whole. Over time, strategies develop in competitive play. Nobody was using the Berlin Defense before it was played against Garry Kasparov in the Classical World Chess Championship 2000. After that, everyone was using it. Chess players today are better than chess players of the past, because they don't just have to rely on their own individual insights to play good chess. They get to leverage all the high-value insights discovered over centuries of play, leading to modern players being far better than historical players ever could have been.
Same goes for sports. For many years, people thought calf kicks were pointless in MMA. Then suddenly, they were a standard part of everyone's strategy. Everyone benefited from the insight of a small number of fighters who started using them. In this way, human progress is not an activity that takes place in one person's mind, but distributed across many minds. Human beings parallelize the search through exponential problem spaces.
But even this isn't enough. We want to solve problems right now! We want a cure for cancer right now! So we hope to make searching through problem spaces more efficient by using machines.
AlphaGo Zero learned to play Go at a superhuman level by rapidly playing games against itself, learning strategies without having access to any pre-existing human strategies. It played millions of games against itself in the three days of training time it took to reach a high level. This cost millions of dollars because of the insane amount of compute spent to train it. But now that the AI is trained, it can beat humans in a game of Go using much less compute than it took to train it in the first place.
It's much cheaper to play an advanced game of Go than it is to learn to play high-level go from scratch, because when learning Go you have to uncover novel heuristics and patterns that help you navigate a gargantuan exponential solution space. But when you play, most of your competitiveness comes from leveraging these "critical insights" you have already discovered, with a comparatively small amount of compute being required.
I call the kind of expensive thinking leading to the generation of new "critical insights" deep thinking, and the simple utilization of existing insights shallow thinking. Any system limited in its ability to do deep thinking will struggle to generate new critical insights, and thereby struggle to advance research, write incredible TV shows, cure aging, and so on. I claim that Large Language Models (LLMs) used by major AI companies today are basically incapable of deep thinking because of their architecture.
This isn't quite the same distinction as System 1 vs. System 2 thinking. I saw an AstralCodexTen comment today suggesting LLMs can only do System 1 (automatic, fast, unconscious) reasoning, and they will never achieve superintelligence until we allow them to do System 2 (deliberate, slow, conscious) thinking.
In response, Scott Alexander points out that System 2 (slow) thinking is basically just a specialized kind of System 1 (fast) thinking. I agree. This is not the limitation LLMs have. LLMs can do both System 1 thinking and System 2 thinking, as long as they're both shallow thinking. Clearly LLMs can blurt out an automatic response, and clearly they can follow step-by-step chains of reasoning, especially with modern advancements like chain of thought. What they can't do very well at all is deep thinking.
It's not easy to tell that LLMs can't do deep thinking. I think I have personal intuition about their limitations from working with them in deep, real-world contexts while trying to use them to augment my own deep exploration of an exponential idea space. In doing so, I have found LLMs to be fundamentally useless every time. They can bring me new knowledge, which is handy for coming up with my own insights, but they don't seem to be making new insights on their own.
But like I said, it's not easy to tell whether LLMs can do deep thinking, mostly because when an LLM solves a problem, there's no easy way to tell from the outside whether or not a problem required deep thinking to solve. Let's say I have a problem I want to solve. Will it require me to do some deep thinking?
This is part of what makes most existing AI benchmarks useless for testing deep thinking. Many of the problems tested by AI benchmarks are clearly simple/polynomial problems like arithmetic. Those that are not, such as those that test the AI's decision making ability, could be testing deep thinking, or they could just be testing the LLM's ability to use existing critical insights learned from humans.
It would be nice if we had a benchmark that could unambiguously test deep thinking. When generating questions for those benchmarks, we would need to make sure:
You can see why existing benchmarks fail. Any pre-written list of questions fails these criteria, because a new LLM can just memorize the solutions, or at least some critical insights, for the benchmark questions. And there's also no way to really know what information the model has. Even you make up a problem that sounds pretty unique, it may turn out that a problem structurally similar to a problem solved 20 years ago on the Turkish internet, and the LLM just happened to have that in its training data. How can we make sure the LLM has to do its deep thinking from scratch, the way AlphaGo Zero had to learn Go from scratch?
I'm not sure, but one idea is to randomly invent new exponential games (like chess or Go). Create a game with an exponential number of states (2^n), where the goal is for both players to reach a single "win state". Players take turns, choosing one of two moves (call them 0 and 1 if you want to). You would randomly generate a "small" ruleset mapping rules and current state to next state. (Such as with a polynomial-size circuit?) The goal for the LLM --- the benchmark question --- is to decide whether player 1 has a winning strategy to reach the "win node" within T steps, for a given state.
This is effectively like creating a new game like chess or Go, wherein knowledge of chess, Go, or other similar games will give you no advantage. Even a deep knowledge of one such random game will give you no advantage on another such random game. Therefore, I think you could use this to test an AI model's ability to do deep thinking. I believe AlphaGo Zero could be trained on such a random game instead, and would dominate for pretty large values of 'n'.
You do have to make sure the game is large enough in terms of its state/ruleset to not be trivial, and you have to make sure the ruleset has some structure that can be exploited. A truly random ruleset just makes the exponential space fully random, meaning there are no heuristics/patterns that can be learned in one part of the space to help with another part of the space. Imagine if every single board state of chess had its own set of rules for what the pieces could do. That would not be a learnable game. The only way to play would be randomly. But assuming we create a game with some structure like chess or Go, then AlphaGo Zero could learn it.
You could also try running an LLM on such a game to see if LLMs are capable of deep thinking, but it's so obvious to me that the LLM would fail there's no point running the experiment. AlphaGo Zero has the advantage that it's a narrow model, so we can afford to train it extensively on a single game. We don't care if it can play chess or Magic: the Gathering or solve any other problem besides the one it's being trained on. We spend a bunch of money training it to play our randomly-generated game, and then it can only play that game. With LLMs, we're demanding something further. We want artificial general intelligence. That is, AI that can solve tough problems without having to be specifically trained to solve that exact problem. Ideally, the AI can see the new problem and then do the same kind of deep thinking that AlphaGo Zero does when playing Go against itself and extracting new critical insights. The AI mines exponential space for critical insights one by one until it has narrowed the exponential space enough to find a solution.
LLMs don't do this. It's pretty clear they're not doing this. If they know how to play chess, it's because they copied a bunch of chess-playing insights from humans in the training data. Getting good at chess from scratch would require orders of magnitude more compute than LLMs currently have, and also seems like something the architecture of LLMs isn't conducive to. For example, how would an LLM simulate games against itself, or do anything cognitively similar to that?
An LLM with chain of thought can store textual information in an external scratchpad between runs. If we keep running the LLM, letting it write new text to the scratchpad each time, then in theory, the LLM could learn to play chess from scratch. It could simulate a bunch of chess games by writing out each move one by one in the scratchpad. After each game, it could write down whatever insights it had learned. Eventually, after an absurd amount of time and money has been burned, the LLM might be able to play chess well (until the chain of thought text is no longer available.) This is all assuming the LLM is large and robust enough to handle the monstrous chain of thought that would result from this process. Any problem is in theory solvable this way[3], but the architecture doesn't seem designed for it. Allowing for in-model recurrence or improving the external scratchpad to use non-textual "neuralese" information would help, but these are just optimizations to a flawed process.
Residual neural networks like AlphaGo Zero and LLMs are multi-layer neural networks where low-level features are encoded in early layers (like how a pawn moves) and high-level features are encoded in later layers (like a specific strategy for using knights in a game with fianchettoed bishops). This is an architecture for detecting known patterns in a chess game so you can leverage them to choose the next move. This is not an architecture for discovering new critical insights. That's why sometimes you hit a wall when using LLMs to solve problems. Once you're outside the training data, LLMs can't help you.
The architecture that can generate new critical insights is not AlphaGo Zero, but the reinforcement learning algorithm that trained AlphaGo Zero. If we want AI that can do deep thinking on the fly, we need AI that can run like how AlphaGo Zero was trained. "Online learning", and also way more compute. "ResNets" like AlphaGo Zero and LLMs are designed to store existing features learned at training time, not to generate new features on the fly.
If we want a new AI that is capable of deep thinking, that can solve all the problems we want AI to solve, we'll need a new architecture, and we'll need a lot more compute. Running AlphaGo Zero on a game of Go is 10-100 billion times less compute-expensive than training it. We should expect running a deep-thinking AI to be computationally more similar to training modern AI than to running modern AI. This is what we have to do to solve big, exponential problems.
What's worse: It's easy to explore an exponential space you can rapidly simulate. If I can play millions of chess games against myself in software, I can rapidly extract new critical insights until I'm a chess God within a few days. But not everything is as easy to simulate as chess. Physical systems like the human body or the economy, where the rules of the game aren't even fully known, would have a much longer feedback loop for doing deep thinking. An AI that has to build a large hadron collider to make progress won't learn physics as fast as AlphaGo Zero learned Go.
LLMs have made a bunch of progress on shallow thinking in a way that's changing the world, but they have made very little progress on deep thinking. Because of this, we cannot extrapolate from past (shallow thinking) progress and expect new (deep thinking) progress to emerge. Creating superintelligent AI is itself is a very hard, exponential problem. We should not expect it to be solvable as a matter of course. It will involve new critical insights, which are obtained essentially at random.
Some genius human researchers seem better at making breakthroughs than other researchers. Why is this possible? I think there are two reasons:
There's not much more you can do than that. Discovering critical insights from scratch is a painful, blind exploration of exponential space. A better researcher may have more existing insight that allows them to prune more of the exponential space, but the remaining paths in exponential space must be explored at random. There's no telling how hard the problem might be. We might discover one "trick" that renders the whole problem much simpler, or we may flounder for centuries until hardware advances make the problem basically impossible not to solve.
Having more/better researchers will certainly help, but nobody in the year 1900 could have invented an superintelligent AI. The hardware didn't exist. You would have had to really quick invent 21st century computers in order to then invent superintelligent AI. I think a similar thing is true today. We don't really have the hardware to do the amount of computation needed to run general superintelligent AI on deep thinking problems, as discussed above. GPT-4.5 was still, very rough guess, at least a million times less powerful than what we'd need to do deep thinking as described above, and we'd also need the algorithmic breakthroughs that let us do that on the fly for arbitrary problems, rather than at training time (for AlphaGo Zero) or basically not at all (LLMs).
How long would it take us to multiply the compute available to AIs by 1,000,000? I have no idea. It won't happen in the next ten years, I predict. If we extrapolate the aggressive 2.25x/year compute estimate from the AI 2027 Compute Forecast, then it'll be about 17 years. But predicting the future is exponential-hard and requires deep thinking. I am bottlenecked by my humanity.[4]
P.S. This is my second AI post in a row, but I'm probably not going to keep posting about AI over and over.
This assumes the famous computer science problem of P vs. NP will resolve as "P is not equal to NP", which it probably will. I'm also simplifying here by saying all non-polynomial problems are exponential, which is not true in reality probably but fine for this discussion. ↩︎
This is an oversimplification of computer science complexity theory and a bit of an abuse of terminology, but I assume you'll know what I mean. ↩︎
This comment by Daniel Kokotajlo explains something similar in a comment on my last post. ↩︎
The latest AstralCodexTen post by Scott Alexander parodies people who say LLMs are limited by pointing out how limited human thinking can often be. For example, in response to the question "Which weighs more, a pound of feathers or a pound of bricks?", many humans might quickly reply, "the bricks." This is clearly wrong, and an example of how shallow, System 1 (fast) matching on existing patterns is not a reliable strategy for solving arbitrary problems, whether you're an LLM or a human. However, this post does not address the distincion I've made here between shallow and deep thinking. Whereas an LLM fails on deep thinking problems because it's always stuck shallow thinking, humans sometimes blunder on problems if they underestimate the amount of thinking they need to do to solve the problem, but can switch into higher gear if needed (either to a more deliberate System 2 (slow) thinking mode, which LLMs can kind of also do, or to a deep thinking mode, which they cannot). We also fail on very deep problems for a long time before we succeed, but that's just the nature of huge, exponential problem spaces. ↩︎