My best guess is that the elicitation overhang has reduced or remained stable after the release of GPT-4. I think we've now elicited much more of the underlying "easy to access" capabilities with RL, and there is a pretty good a priori case that scaled up RL should do an OK job eliciting capabilities (at least when considering elicitation that uses the same inference budget used in RL) on at least the distribution of tasks which models are RL'd on.
It seems totally possible that performance on messier real world tasks with hard-to-check objectives is under elicited, but we can roughly bound this elicitation gap by looking at performance on the types of tasks which are RL'd on and then conservatively assuming that models could perform this well on messy tasks with better elicitation (figuring out the exact transfer isn't trivial, but often we can get a conservative bound which isn't too scary).
Additionally, it seems possible that you could use vast amounts of inference compute (e.g. comparable to human wages or much more than human wages) more effectively and elicit more performance using this. But, GPT-5 reasoning on high is already decently expensive and this is probably not a terrible use of inference compute, so we don't have a ton of headroom here. We can get a rough sense of the returns to additional inference compute by looking at scaling with existing methods and the returns don't look that amazing overall. I could imagine this effectively moving you forward in time by like 6-12 months though (as in, you'd see the capabilities we'd see in 6-12 months, but at much higher cost, so this would be somewhat impractical).
Another potential source of elicitation overhang is niche superhuman capabilities being better elicited and leveraged. I don't have a strong view on this, but we haven't really seen this.
None of this is to say there isn't an elicitation overhang, I just think it probably hasn't really increased recently. (I can't really tell if this post is trying to argue the overhang is increasing or just that there is some moderately sized overhang ongoingly.) For instance, it seems pretty plausible to me that a decent amount of elicitation effort could make models provide much more bioweapon uplift and this dynamic means that open weight models are more dangerous than they seem (this doesn't necessarily mean the costs outweigh the benefits).
It seems totally possible that performance on messier real world tasks with hard-to-check objectives is under elicited
As an example, I think that the kinds of agency errors that show up in VendingBench or AI Village are largely due to lack of elicitation. I see these errors way less when coding with Claude Code (which has both better scaffolding and more RL). I'd find it difficult to inject numbers to get a concrete bound though.
It would be unfortunate if AI village is systematically underestimating AI capabilities due to non-SOTA scaffolding and/or not having access to the best models. Can you say more about your arguments, evidence, how confident you are, etc.?
I disagree, I don't think there's a substantial elicitation overhang with current models. What is an example of something useful you think could in theory be done with current models but isn't being elicited in favor of training larger models? (Spending an enormous amount of inference-time compute doesn't count as that's super inefficient)
By racing to the next generation of models faster than we can understand the current one, AI companies are creating an overhang. This overhang is not visible, and our current safety frameworks do not take it into account.
At the time GPT-3 was released, most of its currently-known capabilities were unknown.
As we play more with models, build better scaffolding, get better at prompting, inspect their internals, and study them, we discover more about what's possible to do with them.
This has also been my direct experience studying and researching open-source models at Conjecture.
Companies are racing hard.
There's a trade-off between studying existing models and pushing forward. They are doing the latter, and they are doing it hard.
There is much more research into boosting SOTA models than studying any old model like GPT-3 or Llama-2.
To contrast, imagine a world where Deep Openpic decided to start working on the next generation of models only until they were confident that they juiced their existing models. That world would have much less of an overhang.
Many agendas, like red-lines, evals or RSPs, revolve around us not being in an overhang.
If we are in an overhang, then a red-line being met may already be much too late, with untapped capabilities already way past it.
It is hard to reason about unknowns in a well-calibrated way.
Sadly, I have found that people consistently have a tendency is to assume that unknowns do not exist.
This means that directionally, I expect people to underestimate overhangs.
This is in great part why...
Sadly, researching this effect is directly capabilities relevant. It is likely that many amplification techniques that work on weaker models would work on newer models too.
Without researching it directly, we may start to feel the existence of an overhang after a pause (whether it is because of a global agreement or a technological slowdown).
Hopefully, at this point, we'd have the collective understanding and infrastructure needed to deal with rollbacks if they were warranted.