Various thoughts:
If you're interested in working with us or discussing further, please feel free to email or dm me.
Will do!
-Thanks.
-For a bunch of reasons that I'll explain over dm if you want (among other things, having similar games and certainly haven't been exposed to the training process and not seeing models perform anonymously worse at those, but also particulars about what parts of the law are easy for models to figure out), I'm not that worried about what I had causing contamination issues. But out of an abundance of caution (and a desire to make the post spoil Starburst less for people), I edited it to remove details about the law.
-The 4- and 2-hour physics Starburst games are not spoiled by this post (other than slightly by hearing about how well the models did). Nor is the nearly-finished Chemistry Starburst or the fiendish technology Starburst. You're welcome to try any of those if you want. As for future readers, talking about human and AI performance is a sort of soft spoiler, but much better than it was.
-Those look interesting. Will take a look.
Starburst, a puzzle game, human intelligence test, and AI reasoning benchmark, was created in summer 2024, made publicly available in January 2025, and remains far from saturated. Cole Gaboriault – a friend of mine who playtested Starburst in summer 2024 – and I joke that Starburst comes from God. It wasn't intended as an AI benchmark until months after its creation; I accidentally created a reasoning benchmark that's text-in text-out, doesn't depend on specialized expertise, and has resisted saturation for nearly two years. It's difficult to convey how lucky we've been without telling the full story.
Note: this post will spoil Starburst.[1] If you're interested in trying it, reach out to me at chapinalc@gmail.com or via dm.
.
In March 2024, Cole and I started doing human intelligence research, particularly focusing upon the relationship between human perceptions of intelligence and performance on actual cognitive tasks. I made a mediocre intelligence test called the CRIE; thanks to the generosity of our friends, we got a small dataset of scores that we were able to compare to ratings of perceived intelligence. Seeing the disappointing results from the CRIE, we started brainstorming other intelligence test candidates.
Around that time, I also read the Orthogonal and Three-Body Problem trilogies. The former is set in a universe with alternate laws of physics; the latter involves characters trying to figure out the bizarre behavior of a 3-star system in a video game. Looking for a better intelligence test and inspired by those novels, I created a game called Starburst.
In Starburst, the player gets celestial observations of a simple fictional universe. Gradually, the player unlocks increasingly powerful observational technology that makes the game easier, eventually including a full map and catalog of every object in the universe. The goal is to figure out the laws governing the universe as early in the game (i.e. with as little technology and data) as possible.
Starburst was not created that carefully. A lot of game design decisions were made because they seemed reasonable after maybe a couple minutes of thought. When I made later Starburst-like games, I would do extensive testing to tailor the initial conditions, trying to ensure a smooth difficulty curve. When I made Starburst, I… didn't do any of that.
Cole played Starburst that summer, and solved it with 5 technological upgrades, i.e. technological era 6. His playthrough was uninspired, but highly intelligent and competent (and quite interesting to watch). It was also punctuated by various profane comments about the game, law, and UI design.
Sometime in fall 2024, Ryan, a friend of ours who was basically an AI true believer, suggested that chatGPT would be able to solve Starburst, perhaps as early as Cole. We initially laughed off the suggestion. AI can't solve Starburst. And after Ryan played Starburst, even he backed off his assertion. But he still suggested that we test it.
When we finally got around to testing LLMs in late 2024 (including then-SOTA ones), they couldn't solve Starburst, even in the final, easiest era. So I extended it from the original 13 eras to 20, with the last being trivially easy for almost all humans. And…the LLMs did badly. We designed a prompting scheme where we presented a single era at a time, removing agency and strategic decision making, eliminating data not immediately relevant, and otherwise handholding the LLMs through everything but the core reasoning faculty we sought to test. Even with a benchmarking scheme arguably designed to give an unfair advantage to LLMs, GPT-4 and 4o solved it in era 17. o1 was occasionally able to solve it in era 16. Very poor performance. See, LLMs can't reason.
At the time, my suspicion was that, in eras 16 and later, models were able to essentially throw their superhuman knowledge at Starburst without any reasoning ability. In era 17, Starburst is really simple in a way that's presumably described extensively in their training data[2]; with enough unthinking pattern-recognition, it might be possible to solve Starburst without any genuine novel reasoning, even novel reasoning at the level of a dumb person.
And even if Starburst in eras 16 and later were measuring AI intelligence in a human-comparable way, it would still put the models of the time roughly on par with a dumb person. The claim of a near-term threat from AI seemed patently absurd (especially coming from people claiming that GPT-4o could do graduate-level physics). How could it be a threat when it can't even do basic text-in text-out novel reasoning problems that most people can do? I was considering buying the domain aiisretarded.com and linking to a post about Starburst and the poor AI performance on it.
Okay, when you have a belief that's being challenged, it's a good practice to ask what evidence it'd take to change your mind. So what Starburst performance would it take to convince me that AI has real reasoning ability (and may represent a serious threat)? Cole and I discussed the subject and agreed on era 15. In Starburst eras 15 and earlier, there was a proper law of interaction to figure out, which almost certainly wouldn't be in the LLM's training data and would require some genuine, though basic, novel geometric reasoning to figure out. We nicknamed the boundary between eras 15 and 16 the reasoning threshold.
As January turned to February, I tested o3-mini. It was more reliable than o1 in era 16, but still couldn't solve Starburst before the reasoning threshold, even occasionally. Around the same time, I also tested GPT-3.5, which wasn't able to solve it until era 18. Okay, so there's a decent trend of improvement (though with a long plateau at GPT-4/4o), but I still suspected that none of the models had any novel reasoning ability, even if some were better at throwing knowledge at the problem (and perhaps also synthesizing that knowledge) than others. So I expected the reasoning threshold to hold off the tide for a while or longer. As I wrote in February 2025:
"I therefore suspect, with low confidence, that o3 and its contemporaries will fail to break this barrier. If a model passes this barrier, that would be a massive cause for concern."
And…I was wrong. Less than two months after I wrote those words, Gemini-2.5-pro-preview, released on March 25th, 2025, crossed the reasoning threshold. It actually might happen. Time to start prepping. Among other things, I offered to lend Cole a gun. Cole said something to the effect of not yet.
Since then, the waters have risen with the pace of a hurricane. Yes, it was funny when OpenAI advertised o3 as being "at or near genius level", but that was overshadowed by the fact that the models were rapidly becoming smarter, even if they were far from genius level in April 2025 (and are still a fair ways away in April 2026).
As spring turned into summer and summer into fall, the major AI companies kept leapfrogging each other's Starburst performance. Well, all but one did. Anthropic's models stayed stubbornly below the reasoning threshold throughout the summer and most of the fall, even while they became increasingly reliable at era 16. Cole and I started wondering whether Anthropic, with their reputation as the “most responsible” major AI company, was deliberately hobbling their public models, trying to extract as much narrow capability from them as possible without giving them dangerous general intelligence. It wasn't until Claude Opus 4.5 (released November 24) that Claude finally crossed the reasoning threshold, skipping an era and stopping at the next categorical break in the Starburst progression, putting it still noticeably behind SOTA models. Notably, Opus 4.5 triggered a new level of scrutiny from Anthropic[3].
Opus 4.6, released shortly after Anthropic "revised" their policy of not pushing the frontier, was the first Claude model to be SOTA in its price tier at Starburst since we started benchmarking them.
.
But despite the rapid improvements of the past year, this is a story about Starburst not saturating. Gemini-3.1 pro, the current SOTA in its price tier (we generally only benchmark and measure progress using the ~$20/mo models), solved Starburst in era 10, which corresponds to what we'd call somewhat smart or ordinary smart[4]. Cole solved it in era 6. Ethan, another friend of mine, solved Starburst in era 5. His playthrough was inspired, characterized by repeated wild insights. I suspect that the best human geniuses would solve Starburst early in era 4.
The models may be stuck at era 10 for a while: they've come up against probably the hardest wall in the Starburst progression. One of the issues with the design of Starburst is that the difficulty curve is uneven in places. And the steepest part of the difficulty curve is between eras 9 and 10. In Starburst, prior to era 10, players are given observations primarily in the form of lists of slopes (above the horizon) at which an object appeared. They don't include all objects, aren't attached to object IDs, and there's only limited means to get the game to tell you if a sighting you see one turn is the same object as a sighting the next turn. Starting with era 10, you get a full map and catalog, including a consistent ID for each object. The heart of Starburst – finding truth from messy, incomplete data – is replaced with something much closer to I'll-show-you-what's-happening-and-you-tell-me-what's-happening. (For this reason, we call eras 4-9 "The Starburst Eon".)
Gemini 3.1 pro has basically reached as early in era 10 as is possible without having mostly solved it before entering era 10 (which it's emphatically unable to do), but even at the highest price tier, the most recent version of Gemini Deep Think isn't close to breaking past this wall into era 9 or earlier. I don't know how long this wall will hold, but everything I've seen from human trials suggests it's a tall one. While it wouldn't shock me if it only lasted a few months, I suspect it'll be on the order of a year or two[5].
.
Okay, wait a minute. What if models are failing at Starburst for a reason other than intelligence limitations? Well, it's not an agency issue, nor one of being overwhelmed with irrelevant data. From the start, we had given models a single era in each test, and scored them based on the earliest era at which they could solve it. They don't have to make any decisions about whether to answer or ask for more data (which could arguably be called agency moreso than intelligence); they just have to take the data and figure out what's going on. (And this scheme avoids distracting them with data from more difficult eras that they won't be able to use fruitfully.)[6]
As of Opus 4.6, the models displayed sufficiently robust agentic ability that we realized the handholding of the old benchmarking scheme was no longer necessary, and we switched to a more directly human-comparable scheme wherein the models are given 5 turns of data at a time and pick whether to ask for 5 more turns or give a solution. This gives us scores for models that are essentially on a level playing field with humans, and has yielded comparable or better performance to the old scheme for models as good as or better than Opus 4.6.
But even within a single era, especially as the game gets harder, there's still a lot of data processing to do. In our three human tests, Starburst has taken 12, 14, and 48 hours. LLMs are known to have “time horizon” issues, where they're worse at longer tasks of otherwise comparable “difficulty.” Obviously, they can't remember things outside their context length, but their performance degrades for long tasks even before the context length is exhausted. To test this (and check for contamination of the benchmark), I designed two shorter Starburst-like games, one of which takes two hours, the other four hours. Gemini 3.1 Pro saturated the two-hour one. The four-hour game remains unsaturated, even by Gemini Deep Think. Notably, even models that had time horizons (as measured by METR's benchmark) well in excess of 4 hours failed to saturate either of them, though they seemed to perform better on the shorter tasks than longer ones relative to human performance (perhaps because short tasks tend to have lower discernment ceilings). So current models really do seem to be prevented from saturating Starburst by intelligence limitations, though other limitations likely play a role too.
We're trying to sort out exactly what capabilities the models are missing that would be necessary for strong AGI (which we define as average-human-level ability at anything that can be done on a computer). They have smart-ish human level reasoning ability, so why aren't they drop-in workers for most remote jobs yet? The most obvious answer is long-term agency and many things required for it. The models are generally good (in many cases, wildly superhuman) at in-context "learning", but they really can't do long-term learning from experience ("online learning") in the way humans can. They don't even have long-term memory (though agentic scaffolds add a very crude long-term memory). They also have reliability issues that tend to rear their heads much more seriously for long-horizon tasks. At first glance, it seems surprising that models with general reasoning ability could be so deficient in other faculties, but there is a human analog to this phenomenon: imagine a smart person with encyclopedic knowledge, but who's fairly scatterbrained, bad at long-term planning and execution, and has long-term memory issues. Such people aren't common, to be clear, but it's not wholly alien. (The models also seem to have perceptual limitations that few humans have, though those also seem to be rapidly improving.)
To assess this sort of agentic ability, we're looking into other possible benchmarks that lean more heavily on agentic abilities. We recently made a Starburst-like game that's based on chemistry instead of physics, takes very roughly 2-6 hours, and involves mixing chemicals and observing the effects of reactions. While Starburst involves agentic decisions about observations in eras 9 and earlier (such as where to point a telescope), and decisions about whether to give a solution or advance the turn throughout the game, it's primarily a game of passive observation. Such is not true for Chemistry Starburst. Gemini 3.1 Pro doesn't saturate it, but does well, confirming the issue isn't agency categorically, but long-horizon agency (and whatever faculties are necessary for it). We also have a technology/engineering Starburst-like game that's so long (we expect around 50-100+ hours for a playthrough) and fiendish that we haven't been able to get a single full human playthrough, despite it being over a year old. Like Chemistry Starburst, it's very agentic/interactive. I haven't yet tested AI on it, but I don't expect success. It might be suitable as a long-term agency benchmark, or we may need to devise another.
.
I tell this story because I believe there are useful lessons here. Starburst wasn't carefully, thoughtfully crafted, but it has still managed to survive nearly two years as a text-in text-out benchmark that doesn't rely upon obscure specialized knowledge or skills. Why?
Part of it is the particular characteristics of the task. Starburst is an unconventional multi-part reasoning task that doesn't heavily depend upon background knowledge. IQ tests, by contrast, elicit ludicrously high scores from LLMs, largely because they use knowledge as a proxy for intelligence, and LLMs have superhuman knowledge. And when they actually test reasoning, it's typically very narrow reasoning (sometimes aided by certain arbitrary intuitions).
Part of it is that Starburst has unusually good high-end discernment. It's easy to design a mediocre intelligence test, but it's very difficult to design a good one, especially one that's good in the high end. Even though exotic domain knowledge is neither required nor helpful, a genius has a significant advantage at Starburst over someone who's very smart. This really isn't easy to do, especially without careful consideration. Even with that consideration, I still went with one of the more obvious approaches: theoretical physics grants a significant return on genius, so I made a game where you do theoretical physics (in a simple fictional universe where advanced physics or math background wouldn't be helpful).
But the other part of my wild luck had nothing to do with the design of Starburst. We were studying and designing tests for human intelligence. To benchmark intelligence in LLMs, it's useful to understand intelligence. And to understand intelligence, it’s useful to study human intelligence. This has given me one of my most useful benchmarking insights: performance of models should be, to the greatest extent possible, compared to the human distribution of performances on that task. It isn't enough to compare to a single “human baseline” performance.
.
A final question: we originally started AI benchmarking out of a desire to assess AI risks (though we didn't take them seriously at first). For the better part of two years, we've been planning to get land in the woods if we ever expect a serious possibility of semi-imminent disaster. (Yes, I know that being in the woods is unlikely to help us survive an ASI takeover, but it very plausibly would help against an AI-assisted bioweapon, cyberattack, failed AGI takeover, AI-thucydides-trap-driven nuclear war, and many other plausible scenarios.) What capability threshold should we set for that? Back when we expected developing intelligence to be the toughest problem to solve, we set Starburst era 9 as the woods threshold. We figured that a model with intelligence significantly above that of an ordinary smart person would be capable enough to pose the sort of threats mentioned (especially considering the gap between internal and released models). But achieving average-human-level intelligence has proved significantly easier than achieving certain other abilities that most humans have, and which seem to be necessary for a drop-in worker. Starburst certainly tests those abilities to some extent, and much more so in era 9 than 10, but maybe not enough. Could a model reach Starburst era 9 without the long-term agency needed to be a drop-in worker? I lean towards yes, but I'm unsure. If it does, would its intelligence alone make it sufficiently dangerous?
We maintain a Starburst leaderboard here. If you're interested in working with us or discussing further, please feel free to email or dm me.
Somewhat less than it used to, due to removing details about the solution, per abstractapplic's concerns about contamination from a public post being used in training. See my response to his comment below. Happy to tell anyone who asks what I removed over dm.
This is the biggest detail I removed. If you have any doubt of the "simple and in training data" claim, dm me.
“Claude Opus 4.5 showed strong performance across many evaluations, warranting a comprehensive assessment to determine whether it had reached the ASL-4 threshold. We determined that Claude Opus 4.5 does not cross this threshold. However, the model is approaching or surpassing high levels of capability in our ‘rule-out’ evaluations — early proxies designed to indicate whether a model might be nearing the next capability threshold.” https://www.anthropic.com/transparency
The recent vague sense I've heard from many people that LLMs were finally “good” coincided with them reaching eras 10-12.
We all know what happened last time I said something like this. I pray it doesn't happen again.
On the subject of different benchmarking schemes for humans and LLMs, some benchmarks (such as simplebench and Arc-AGI) use different setups for humans and models that give humans unfair advantages, presumably to make their benchmarks look more impressive and difficult to saturate. For obvious reasons, I believe that this should be avoided.