AI safety & alignment researcher
In Rob Bensinger's typology: AGI-wary/alarmed, welfarist, and eventualist.
Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.
I have signed no contracts or agreements whose existence I cannot mention.
[Very quick take; I haven't thought much about this]
In some plausible futures, the current pragmatic alignment strategy (constitutional AI, deliberative alignment, RLHF, etc) continues working to and at least a little bit past AGI. As I see it, that approach sketches out traits or behaviors that we want the model to have, and then relies on the model to generalize that sketch in some reasonable way. As far as I know, that isn't a very precise process; different models from the same company have somewhat different personalities, traits, etc, in a way that seems at least partly unintended.
It seems likely that there are some values and traits that models usually generalize well when you point them in the right general direction, and others that require more precise pointing-to. This is an area I'd like to see some work in: entirely aside from the question of what values and traits we want models to have, which ones are easier or harder to specify to models, and what characteristics correlate with that?
Many have asserted that LLM pre-training on human data can only produce human-level capabilities at most. Others, eg Ilya Sutskever and Eliezer Yudkowsky, point out that since prediction is harder than generation, there's no reason to expect such a cap.
The latter position seems clearly correct to me, but I'm not aware of it having been tested. It seems like it shouldn't be that hard to test, using some narrow synthetic domain.
The only superhuman capability of LLMs that's been clearly shown as far as I know is their central one: next-token prediction. But I expect that to feel like a special case to most people. There's also truesight and related abilities, but that's just a corollary of next-token prediction, and also I'm not aware of a conclusive comparison between LLMs and humans on that capability.
Is anyone aware of some other superhuman capability that's been clearly demonstrated? Bonus points if it's something that at least some humans care about for unrelated reasons; bonus points if the capability is broad rather than narrow. But I'd love to hear about any!
I think that you should strengthen your 'see follow-up post' comment at the beginning. I would have spent less time focusing on the details of this post if I had realized that the follow-up involves so strong an update to what you write here, and especially that 'my previous post reflected a few misunderstandings of the AI 2027 model, in particular in how to interpret "superhuman AI researcher".'
This is nitpicking, but since this post is specifically trying to increase clarity:
The headline projection in the paper is that in about 5 years, AIs will be able to carry out roughly 50% of software engineering tasks that take humans one month to complete. A 50% score is a good indication of progress, but a tool that can only complete half of the tasks assigned to it is a complement to human engineers, not a replacement. It likely indicates that the model has systematic shortfalls in some important skill areas. Human workers will still be needed for the other 50% of tasks, as well as to check the AI’s work (which, for some kinds of work, may be difficult).
This paragraph gives the impression to me that we could split the full set of tasks into 'ones current AI gets right' and 'ones current AI gets wrong'. I interpret the METR post as instead pointing to a set of tasks such that for any given task in this class, an AI has a 50% chance of getting it right. The latter is a much worse position to be in for production use, since you can't just, in advance, assign AI to the tasks it's good at and assign humans to the others.
Elsewhere it's clear that that's not what you mean, so I think it's just a matter of your wording in this particular paragraph.
Interestingly, looking back at ARC-AGI while writing the review, I notice that Gemini-3 Deep Think has quietly crossed the winning threshold (85%) on the original ARC-AGI without that getting much notice. Of course it's a big closed-source model, so not eligible for the grand prize. But it's fascinating to compare to Chollet & Knoop's original prediction that LLM-based approaches would stall out far below that:
Looking back at this post 18 months later, it's making two distinct claims:
Point 1 stands. Point 2 was true at the time of publication, and is still somewhat true now, but I think the evidence that LLMs are capable of general reasoning is significantly stronger. Essentially all of the specific skeptical evidence I found most compelling in this post no longer held for frontier models four months later. The best-known LLM skeptics don't seem to have updated much; see Gary Marcus's blog for examples of this stance. I think there's been a fair amount of goalpost-moving and a tendency to selectively focus on the negative evidence.
This became the first post of the 'General Reasoning in LLMs' sequence. I still owe a final post in that sequence, reporting the outcomes of empirical work on this question, which found that LLMs were capable of characterizing toy, novel scientific domains (up to some level of complexity, and not that reliably)
Of course, there are limits to LLM generalization, just as there are limits to human generalization. These are somewhat hard to cleanly specify, since as far as I know there's still no good account of what counts as in- and out-of-distribution for LLM training data (please correct me!). But the pattern has been that when skeptics point to a specific missing capability that (in their view) shows that obviously LLMs are just shallow stochastic parrots, that capability is usually present in the next generation of models.
Thanks for answering so many questions about this. I can see why it makes sense to filter on text from the evals. What's the rationale for not also filtering on the canary string as a precaution? I realize there would be some false positives due to abuse, but is that common enough that it would have a significant inappropriate effect?
I think of the canary string as being useful because it communicates that some researcher has judged the document as likely to corrupt eval / benchmark results. Searching for specific text from evals doesn't seem like a full substitute for that judgment.
To be clear, I'm not asking you to justify or defend the decision; I just would like to better understand GDM's thinking here.
It is also assumed that introspection can be done entirely internally, without producing output. In transformers this means a single forward pass, as once tokens are generated and fed back into the model, it becomes difficult to distinguish direct knowledge of internal states from knowledge inferred about internal states from outputs.
If we make this a hard constraint, though, we rule out many possible kinds of legitimate LLM introspection (especially because the amount of cognition possible in a single forward pass is pretty limited, which is a sort of limitation that human cognition doesn't face). I agree that it's difficult to distinguish, but that should be seen as a limitation of our measurement techniques rather than a boundary on what ought to count as introspection.
As one example, although the state of the evidence seems a bit messy, papers like 'Let's Think Dot by Dot' (https://arxiv.org/abs/2404.15758v1) show that LLMs can have cognitive processes that last over many tokens, in a way that isn't drawing information from the outputs created during those processes.
LLMs are of course also superhuman at knowing lots of facts, but that's unlikely to impress anyone since it was true of databases by the early 1970s.