Nitpick:
The average Mechanical Turker gets a little over 75%, far less than o3’s 87.5%.
Actually, average Mechanical Turk performance is closer to 64% on the ARC-AGI evaluation set. Source: https://arxiv.org/abs/2409.01374.
(Average performance on the training set is around 76%, what this graph seemingly reports.)
So, I think this graph you pull the numbers from is slightly misleading.
sometimes, like with chess, the “is it superhuman” question has a clear cut answer. But unless the case is that clear cut, it’s probably better to ask narrower questions
Asking a question without nuance is useful for gesturing at the hypotheticals where reality makes the answer clear cut.
Yeah, I agree. But also I think:
Of course, when it's couched in hypothetical well, that's fine, though I do sometimes have an intuition that I'm being led down an implausible path when hypotheticals presume something as hand-wavey as "human level" systems in some domain. Capabilities are spiky!
In forecasting, talking about reality leads to talking past each other, because everyone will expect different things. So it's useful to zero in on intended hypotheticals, even if they are not of interest to you, simply to use an appropriate framing to understand the other things that person is saying.
Hmm, I think we're seeing right now the pitfalls of terms like "AGI" and "superhuman" in past forecasting. Like, Tyler Cowen keeps saying o3 is AGI, which seems obviously wrong to me, but there's enough wiggle room in the term that all I can really do is shrug. More specific claims are easier to evaluate in the future. I don't have a bone to pick with long reports that carefully define lots of terms and make concrete predictions, but on the margin I think there's too much ephemeral, buzzy conversation about "this model is PhD level" or "this model is human level at task xyz" or what have you. I'm far less interested in when we'll have a "human level novelist" system, and more in, say, in what year an AI generated book will first be a New York Times bestseller (and that social forces might prevent this is both feature and bug).
While I believe SC2 and Dota would fail today with sufficient effort, the models didn't quite perform superhuman, and as far as I am aware no community bots do either.
Ah yup, for Starcraft that's what I get for relying on hazy 2019 memory.
https://deepmind.google/discover/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning/ includes "I’ve found AlphaStar’s gameplay incredibly impressive – the system is very skilled at assessing its strategic position, and knows exactly when to engage or disengage with its opponent. And while AlphaStar has excellent and precise control, it doesn’t feel superhuman – certainly not on a level that a human couldn’t theoretically achieve. Overall, it feels very fair – like it is playing a ‘real’ game of StarCraft."
Dota2, though, I do feel was fairly superhuman - it beat the world champion team (I think? I don't know Dota2 that well so it's plausible I'm misunderstanding the 2019 showmatch it won against OG) and won 99.7% of its matches in a public demo, which is totally insane from a human perspective. Seems analogous-ish to DeepBlue first beating Kasparov, though Carlsen could maybe beat DeepBlue now. I guess it just goes to show that, well, "superhuman" isn't well specified.
EDIT: ah, I missed the not-all-heroes limitation. That's big. At the time I remember following the "they got invulnerable couriers" handicap and being excited when they lifted it, but yeah, restricting the number of heroes is definitely too big for it to count as superhuman.
Also, thanks! Edited the post to correct the error.
Even with chess there are some nuances:
Yeah, I was thinking about (some different dimensions of) this too. Like, humans are better at chess in that chess engines couldn't abstract gracefully (without extra training) to kung fu chess, or Chess Evolved Online, or other arbitrary rule variants. They'd be especially bad at flick chess.
Also, I think the best Go AIs specifically are reliably fooled by certain wacky tricks that wouldn't work on humans. Cleanly superhuman performance is hard to find!
Strength
In 1997, with Deep Blue’s defeat of Kasparov, computers surpassed human beings at chess. Other games have fallen in more recent years: Go, Shogi, and Othello[1] among them. AI is superhuman at these pursuits, and unassisted human beings will never catch up. The situation looks like this:[2]
The average serious chess player is pretty good (1500), the very best chess player is extremely good (2837), and the best AIs are way, way better (3700). Even Deep Blue’s estimated Elo is about 2850 - it remains competitive with the best humans alive.
A natural way to describe this situation is to say that AI is superhuman at chess. No matter how you slice it, that’s true.
For other activities, though, it’s a lot murkier. Take radiology, for example:
CheXNet is a model for detecting pneumonia. In 2017, when the paper was published, it was already better than most radiologists - of the four it was compared to in the study, it beat all but one. Is it superhuman? It’s certainly better than 99% of humans, since fewer than 1% of humans are radiologists, and it’s better than (about) 75% of those. If there were one savant radiologist who still marginally outperformed AI, but AI was better than every single other radiologist, would it be superhuman? How about if there was once such a radiologist, but he’s since dead, or retired?
We can call this the strength axis. AI can be better than the average human, better than the average expert, better than all but the very best human performers, or, like in the case of chess, better than everyone, full stop.
But that’s not the only axis.
Effort
The ARC-AGI benchmark is a series of visual logic puzzles that are easy (mostly) for humans, and difficult for AI. There’s a large cash prize for the first AI system that does well on them, but the winning system has to do well cheaply.
Why does this specification matter? Well, OpenAI’s industry leading o3 scored 87.5% on the public version of the benchmark, blowing many other models out of the water. It did so, however, by spending about $3000 per puzzle. A human can solve them for about $5 each.
Talking in terms of money undersells the difference, though. The way o3 did so well was by trying to solve each puzzle 1024 separate times, thinking through a huge number of considerations each time, and then choosing the best option. This adds up to a lot of thinking: the entire set of 100 questions took 5.7 billion tokens of thinking for the system to earn its B+. War and Peace, for reference, is about a million tokens.
The average Mechanical Turker gets about 64%[3], far less than o3’s 87.5%. So is o3 human-level, or even superhuman? In some narrow sense, sure. But it had to write War and Peace 5,700 times to get there, and spend significantly more than most people’s annual salaries.
We can call this axis the effort axis. Like the strength axis, it has various levels. A system can perform a task for negligible cost, or for approximately how much a human would charge, or for way more than a human would charge, or for astronomical sums.
And More
These axes combine! A system that costs $10,000 to do a task better than any human alive is one kind of impressive, and a system that exceeds the human average for pennies is another. Nor are strength and effort the only considerations; lots of tasks have dimensions particular to them.
Take persuasion. A recent study (which may be retracted/canceled due to ethics concerns) showed that LLM systems did better than (most) humans at changing people’s views on the subreddit r/changemyview. Try to unpack this, and it’s way more complicated than chess, ARC-AGI, or radiology diagnosis. To name just a few dimensions:
Many capabilities we care about are more like persuasion than chess or radiology diagnostic work. A drop-in remote worker, for example, needs to have a lot of fuzzy “soft skills” that are not only difficult to ingrain, but difficult to measure.
Beyond “Superhuman”
Probably, you don’t see people calling AI “superhuman” all the time. But there are lots of related terms used the same way. Questions like:
These questions try to compress a multidimensional spectrum into a binary property. The urge makes sense, and sometimes, like with chess, the “is it superhuman” question has a clear cut answer. But unless the case is that clear cut, it’s probably better to ask narrower questions, and to think carefully when performance claims are flying around.
I originally had Starcraft II and Dota2 in this list. On Starcraft II I just straight up misremembered (see comments), but Dota2 is more interesting. While OpenAI's system did beat a top professional team in a best of three, it was with a (very) limited hero pool. Thanks to Tenoke for the correction, and for making me a case study for the post!
Yeah, I know this graph (and the next) wouldn't actually be normal distributions. I don't think it matters for the purposes of this post, though.
The image above says ~75%, but Ryan Greenblatt (see comment below) showed me that that's on the public training set, rather than the private set, so it's apples to oranges with the o3 figure. Thanks for the correction!