AI coding has been the big topic for capabilities growth, as of late. The announcement of GPT 5.1, in particular, eschewed the traditional sea change declaration of a new modality, a new paradigm, or a new human-level-capability threshold surmounted, and instead provided a relatively muted press release augmented with an interactive gallery of computer programs that OpenAI's new LLM generated when provided with a relatively brief prompt.
One of these demos in particular caught my eye, on the basis that its prompt was very similar to a test I run on every new model. Given an open-ended three-sentence prompt, the LLM had constructed a visually and functionally-polished game. Complete, impressively, with sound design that wasn't physically painful to experience. Integrating image and audio processing into a single core model had clearly paid dividends by giving it the intuition necessary to code something visually and audibly tolerable in a single shot.
The other examples listed in the gallery, to be frank, were much less useful in assessing capabilities, consisting of a variety of visually-polished but functionally-sparse webpages driven primarily by frontend libraries[1].
Just as playing games provided an excellent metric for the growth of reinforcement learning capabilities, programming games can serve as the gold standard for determining just how rigorously competent LLMs are at implementing computer programs. Most of the actual "work" of creating a polished personal website is done by the NodeJS libraries the user has installed, and most industry programming tasks draw their difficulty from domain knowledge (knowing what to build) rather than implementation challenge (knowing how to build it). Even coding challenge benchmarks made specifically to evaluate capabilities are often formulaic in nature - anyone who's spent time grinding the leaderboards on LeetCode knows that there are only so many ways to structure a dynamic programming problem before they all start blurring together.
Game programming, in contrast, is all about pushing technical limits, and, unlike most other domains of programming, there are sharply diminishing returns on coloring-by-numbers[2]. For a game to impress, it has to be novel, either by posing a new set of mechanics to the player or by implementing an existing set of mechanics in a novel way. In either case, boilerplate outputs are much less likely to over-impress, and the manner of thinking that a high-level engineer regularly requires is much more rigorously tested. The model must formulate a set of mechanics that are needed to satisfy some open-ended design criteria, intuit how a user is likely to interact with these mechanics, and model obscure edge cases that might only come up after several hundred iterations over the core game loop ahead of time. Moreover, it must do all of these things while understanding what would or would not be pleasant for a human user - modeling expected human preferences is both challenging and very important to the overwhelming majority of proposed use-cases for LLMs.
In other words, it allows researchers to capture the 'vibe' of a model much better than quantitative programming challenge evaluations without the noise introduced by settings with extensive support libraries and countless templated examples in the model's training set.
I've spent a considerable amount of time trying to get an answer to this question, since I think it'd provide me with substantial intuition on just how far LLMs will need to improve in order to start seriously impacting the economy[3]. I'm usually confident in my Google-fu, but all I can find on the subject of "What's the most impressive game I can get an LLM to generate without feeding it pseudocode" is beginners trying the LLM out by asking it to implement Snake or Asteroids, often with a new visual skin on top, and LLMs' ability to replicate common programming tutorial templates doesn't tell me much that I didn't know three years ago.
I've run my own tests, of course[4], but I haven't been able to come up with anything suitably rigorous to justify a confident update to my understanding of LLM capabilities. If people are interested, I'd be willing to write a few simple prompts across popular genres, sample results across the top dozen models, and write up results, but even that would be vulnerable to criticism that my prompts were less-than-optimal in eliciting the models' best capabilities, and I'm not sure how I'd go about addressing that.
Given the sheer popularity of large language models among the demographics that usually experiment with making their own game, I'd expect much more to be written about their capabilities and limits on this front. Even if results were uniformly negative, I'd expect a few blog posts going around about the surprising inefficacy of LLMs in game design. Instead, I can find one or two hastily-made startup webpages promising to "Make games with AI", but with no examples of the kind of results they've achieved.
If you've got five minutes, and are similarly curious, please try generating a game using whichever LLM and prompt you think would get the best one-shot result. I'd be interested to see whether your results are in line with what I've seen.
Impressed as I was by "EV: Pocket Skirmish", I attempted to replicate the results in the press release. Over several dozen tries with their provided prompt, spread across a number of different models and API settings, I couldn't get anything qualitatively similar. 5.1 will reliably produce a clone of Asteroids, often with 1-2 glaring graphical or functional issues, very poor sound design, and one or two very simplistic enemy ships with non- or barely-functional AI.
I've run enough trials that I'm certain my results aren't a statistical fluke, and I'd be very interested to hear others' theories on what's going on there. OpenAI usually doesn't cherry-pick results, and I think you'd have to run this prompt a four digit number of times to get anything comparable to their example output with the model they released.
lovable.dev showed a year ago that professional-quality React websites could be created easily and reliably with LLMs, but the sheer quantity of passable React on the internet, coupled with the fact that it's very unlikely that a request made to an LLM for a React personal webpage will be something entirely new, means that this doesn't tell us much about how well models can perform real engineering work.
A very common criticism of LLM outputs of all kinds, but especially coding work. If you speak to someone who is pessimistic about LLMs' prospective economic value, the idea that their most apparently impressive capabilities stem from regurgitating training data rather than first-principles reasoning is likely to come up.
Among other things.
Most recently, I wrote a style/design document for a roguelike, and asked some frontier LLMs to implement a vertical slice. ChatGPT, Gemini, and Claude all managed to construct basic Roguelike engines, though Gemini's map generation algorithm was noticeably less advanced than the others, and it failed to intuit that requiring the enter key to be pressed after each movement command might be inconvenient for the player. Claude did the best job, with a procedural generation algorithm that recursively expanded its map and some surprisingly good atmospheric writing snippets. ChatGPT landed somewhere in the middle, with a partially-functional map generator that I suspect was half-remembered from a popular tutorial.