Grok 4 doesn’t appear to be a meaningful improvement over other SOTA models. Minor increases in benchmarks are likely the result of Goodharting.
I expect that GPT 5 will be similar, and if it is, this gives greater credence to diminishing returns on RL & compute.
It appears the only way we will see continued exponential progress is with a steady stream of new paradigms like reasoning models. However, reasoning models are a rather self-suggesting and low-hanging fruit, and new needle-moving ideas will become increasingly hard to come by.
As a result, I’m increasingly bearish on AGI within 5-10 years, especially as a result of merely scaling within the current paradigm.
Current AIs are trained with 2024 frontier AI compute, which is 15x original GPT-4 compute (of 2022). The 2026 compute (that will train the models of 2027) will be 10x more than what current AIs are using, and then plausibly 2028-2029 compute will jump another 10x-15x (at which point various bottlenecks are likely to stop this process, absent AGI). We are only a third of the way there. So any progress or lack thereof within a short time doesn't tell much about where this is going by 2030, even absent conceptual innovations.
Grok 4 specifically is made by xAI, which is plausibly not able to make use of their compute as well as the AI companies that were at it longer (GDM, OpenAI, Anthropic). While there are some signs that it's at a new level of RLVR, even that is not necessarily the case. And it's very likely smaller than compute optimal for pretraining even on 2024 compute.
They likely didn't have GB200 NVL72 for long enough and in sufficient enough numbers to match their pretraining compute with them alone, which means compute utilitization by RLVR was worse than it will be going forward. So the effect size of RLVR will only start being visible more clearly in 2026, after enough time has passed with sufficient availability of GB200/GB300 NVL72. Though perhaps there will soon be a GPT-4.5-thinking release with pretraining-scale amount of RLVR that will be a meaningful update.
(Incidentally, now that RLVR is plausibly catching up with pretraining in terms of GPU-time, there is a question of a compute optimal ratio between them, which portion of GPU-time should go to pretraining and which to RLVR.)
It’s starting to really feel like we’re in the process of AI improvement fizzling out and companies are merely disguising this with elaborate products.
Yeah there haven't been any improvements that significantly changed how capable a model is on a hard task I need solved for like, at least a week, maybe more /j
/j was because I haven't really kept track of how long it's been. Gemini 2.5 pro was the last one I was somewhat impressed by. now, like, to be clear, it's still flaky and still an LLM, still incremental improvement, but noticeably stronger on certain kinds of math and programming tasks. still mostly relevant when you want speed and some slop is ok.