601

LESSWRONG
LW

600
AI
Frontpage

41

We are likely in an AI overhang, and this is bad.

by Gabriel Alfour
23rd Sep 2025
2 min read
4

41

This is a linkpost for https://cognition.cafe/p/we-are-likely-in-an-ai-overhang-and
AI
Frontpage

41

We are likely in an AI overhang, and this is bad.
13ryan_greenblatt
3jsd
3Daniel Kokotajlo
3Nina Panickssery
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:28 PM
[-]ryan_greenblatt4h130

My best guess is that the elicitation overhang has reduced or remained stable after the release of GPT-4. I think we've now elicited much more of the underlying "easy to access" capabilities with RL, and there is a pretty good a priori case that scaled up RL should do an OK job eliciting capabilities (at least when considering elicitation that uses the same inference budget used in RL) on at least the distribution of tasks which models are RL'd on.

It seems totally possible that performance on messier real world tasks with hard-to-check objectives is under elicited, but we can roughly bound this elicitation gap by looking at performance on the types of tasks which are RL'd on and then conservatively assuming that models could perform this well on messy tasks with better elicitation (figuring out the exact transfer isn't trivial, but often we can get a conservative bound which isn't too scary).

Additionally, it seems possible that you could use vast amounts of inference compute (e.g. comparable to human wages or much more than human wages) more effectively and elicit more performance using this. But, GPT-5 reasoning on high is already decently expensive and this is probably not a terrible use of inference compute, so we don't have a ton of headroom here. We can get a rough sense of the returns to additional inference compute by looking at scaling with existing methods and the returns don't look that amazing overall. I could imagine this effectively moving you forward in time by like 6-12 months though (as in, you'd see the capabilities we'd see in 6-12 months, but at much higher cost, so this would be somewhat impractical).

Another potential source of elicitation overhang is niche superhuman capabilities being better elicited and leveraged. I don't have a strong view on this, but we haven't really seen this.

None of this is to say there isn't an elicitation overhang, I just think it probably hasn't really increased recently. (I can't really tell if this post is trying to argue the overhang is increasing or just that there is some moderately sized overhang ongoingly.) For instance, it seems pretty plausible to me that a decent amount of elicitation effort could make models provide much more bioweapon uplift and this dynamic means that open weight models are more dangerous than they seem (this doesn't necessarily mean the costs outweigh the benefits).

Reply
[-]jsd4h30

It seems totally possible that performance on messier real world tasks with hard-to-check objectives is under elicited

As an example, I think that the kinds of agency errors that show up in VendingBench or AI Village are largely due to lack of elicitation. I see these errors way less when coding with Claude Code (which has both better scaffolding and more RL). I'd find it difficult to inject numbers to get a concrete bound though.

Reply
[-]Daniel Kokotajlo1h30

It would be unfortunate if AI village is systematically underestimating AI capabilities due to non-SOTA scaffolding and/or not having access to the best models. Can you say more about your arguments, evidence, how confident you are, etc.?

Reply
[-]Nina Panickssery4h37

I disagree, I don't think there's a substantial elicitation overhang with current models. What is an example of something useful you think could in theory be done with current models but isn't being elicited in favor of training larger models? (Spending an enormous amount of inference-time compute doesn't count as that's super inefficient)

Reply
Moderation Log
More from Gabriel Alfour
View more
Curated and popular this week
4Comments


By racing to the next generation of models faster than we can understand the current one, AI companies are creating an overhang. This overhang is not visible, and our current safety frameworks do not take it into account.

1) AI models have untapped capabilities

At the time GPT-3 was released, most of its currently-known capabilities were unknown.

As we play more with models, build better scaffolding, get better at prompting, inspect their internals, and study them, we discover more about what's possible to do with them.

This has also been my direct experience studying and researching open-source models at Conjecture.

2) SOTA models have a lot of untapped capabilities

Companies are racing hard.

There's a trade-off between studying existing models and pushing forward. They are doing the latter, and they are doing it hard.

There is much more research into boosting SOTA models than studying any old model like GPT-3 or Llama-2.

To contrast, imagine a world where Deep Openpic decided to start working on the next generation of models only until they were confident that they juiced their existing models. That world would have much less of an overhang.

3) This is bad news.

Many agendas, like red-lines, evals or RSPs, revolve around us not being in an overhang.

If we are in an overhang, then a red-line being met may already be much too late, with untapped capabilities already way past it.

4) This is not accounted for.

It is hard to reason about unknowns in a well-calibrated way.

Sadly, I have found that people consistently have a tendency is to assume that unknowns do not exist.

This means that directionally, I expect people to underestimate overhangs.


This is in great part why...

  • I am more conservative on AI development and deployment than what the evidence seems to warrant.
  • I am sceptical of any policy of the form "We'll keep pursuing AI until it is clear that it is too risky to keep continuing."
  • I think Open Weight is particularly pernicious.

Sadly, researching this effect is directly capabilities relevant. It is likely that many amplification techniques that work on weaker models would work on newer models too.

Without researching it directly, we may start to feel the existence of an overhang after a pause (whether it is because of a global agreement or a technological slowdown).

Hopefully, at this point, we'd have the collective understanding and infrastructure needed to deal with rollbacks if they were warranted.