theory uplift differentially benefits safety & is underleveraged

Yudhister Kumar

theory uplift differentially benefits safety & is underleveraged — LessWrong

66 theory uplift differentially benefits safety & is underleveraged

by Yudhister Kumar

20th May 2026

1 min read

66

[1] We will likely have near-superhuman mathematics AI by Q1 2027.^[1]

[2] Qualitatively, AI mathematics capabilities are developing significantly faster than automated AI R&D capabilities.^[2]

[3] Thus, we will likely have a period of time where the rate of our ability to rigorously & usefully verify and understand model behavior and model outputs outpaces the rate of R&D capability development itself.

[4] Our ability to take advantage of this period is bottlenecked on the quality of our specification generation infrastructure, elicitation tooling (for proofs & specs etc.), and the institutional capacity for scaling useful outputs with capital.

[5] My understanding is that basically no one^[3] is working on building infra that can usefully turn >$100M of compute credits into safety-relevant mathematical output.

[5.1] The number of theory-driven ASI alignment efforts is also small, but growing. ARC is a much better bet now than it was in 2023.

[5.2]. My understanding is also that no one is working on developing AI-powered conceptual tooling infrastructure for tackling problems in, for instance, metaphilosophy. This is a much harder problem.

[6] In worlds where alignment is easy, prosaic methods may scale. In worlds where alignment is harder than easy, guarantees are more wanted. Hard worlds are less hard now because guarantees are easier. This is incredible and should be leveraged.

^{^}
Concretely: I expect an OAI model to serve as a more efficient proof oracle for ~any well-specified mathematics problem (bar perhaps a singular expert) by Q1 2027.
^{^}
AI Futures has as medians for automated coder and superhuman AI researcher as June and December 2028 respectively.
^{^}
I mean this in both the absolute sense and the relative sense.

Who comes to mind: Convergent is incubating a FRO focused on the specification generation problem, Theorem is working on formal verification, ARIA is funding formal methods broadly, Fulcrum is working on elicitation.

5.5 Pro estimates total investment in these orgs to be ~$105-145M, $79M of which was the Safeguarded AI mandate. Given we're expecting $37-100B/yr in philanthropic funding imminently, this is minuscule.

(The taut constraints are institutional capacity & talent, not capital).

Frontpage

66

theory uplift differentially benefits safety & is underleveraged

New Comment

12 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:58 PM

[-]Gurkenglas19h192

How does theory uplift differentially benefit safety? What theoretical framework is more useful for safety than for capabilities?

[-]Yudhister Kumar15h155

Singular learning theory led to the development of empirical LLC estimators and susceptibilities. no capabilities benefits to my knowledge.
Computational mechanics led to discovering belief state geometry in a variety of cases in transformers. no capabilities benefits to my knowledge
Superposition Hypothesis / Linear Representation Hypothesis led to e.g. development of SAEs and our current conception of features. In my view this had marginal capabilities benefits (activation steering was an interesting case study but often suffers at scale)
Vast majority of MIRI's work in the 2010s (reflective oracles, decision theory, logical inductors) was useful for our understanding of bounded and ideal reasoners but had little to no impact on capabilities development
I expect most classical circuit theory to be essentially irrelevant for capabilities development and somewhat useful for safety (understanding steganography, adversarially generated backdoors in models, ARC-flavored ideas)

It seems like theory, on the whole, has not really moved the needle on the development of deep learning over the last twenty-odd years. muP is an exception that ~proves the rule. I think this is a pretty general truth about the world --- America's year-on-year 2% GDP growth over the last 150 years was not caused by policies set by a politburo of growth economists.

[-]testingthewaters2h30

While I am also excited for the possibility of stronger leverage/force multipliers in theory work, I think many of these results are not yet useful for capabilities, not "no capabilities benefits". For a few examples (trying not to include too many for obvious reasons), a full understanding of SLT might lead to new optimisers that guide models to useful singularities rapidly, or a complete understanding of computational mechanics might lead to new steering and mech interp techniques that are at least dual use ("hey, the model can't do X operation! ...hey, here's a new model that can do X, its so much better on all the benchmarks!").

[-]Kaarel17h100

Our ability to take advantage of this period is bottlenecked on the quality of our specification generation infrastructure, elicitation tooling (for proofs & specs etc.), and the institutional capacity for scaling useful outputs with capital.

(assuming there will be such a period,) imo a central bottleneck is having the philosophical understanding necessary to systematically turn scientific/technological/philosophical problems into mathematical problems. ^[1] ^[2] afaik we currently have no idea how to do this. (or maybe there is not even any nice procedure for systematically doing such reductions to be found, idk.)

currently, for almost all scientific/technological/philosophical problems, if one manages to reduce them to well-defined mathematical problems, i think that's most of the way toward solving them. so, it's usually very difficult. consider how there are very few precise mathematical problems whose solutions economists or physicists would be excited about. currently, imo and afaik, there are basically no well-defined mathematical problems whose solutions would help much with the AGI situation

linking some stuff of mine relevant to this: slides of a talk on verification-based alignment schemes, slides of a talk on modeling, messy draft paper on formalizing messy questions

(if under "improving the quality of our specification generation infrastructure" you already meant to include radical philosophical progress on precisification/verification outside math, then nvm :P)

or into some other kind of precisely specified problem, if there is such a thing as a precisely specified problem that is not a math problem. or at least having the philosophical understanding necessary for doing principled verification of properties or arguments outside well-defined math ↩︎
just finding one pivotal problem that can be turned into a tractable math problem would also be sufficient ↩︎

[-]Yudhister Kumar15h10

am curious why you put 10% likelihood on [1] and [2]

[-]Yudhister Kumar16h*10

an inspiration for me is the kind of metamathematical consensus-building that occurs when e.g. assessing the validity of various cryptographic assumptions / complexity class hierarchies. ^[1] of course community consensus in these cases is fallible. but it's better than chance? relies on a variety of different kind of epistemic evidence, including proofs and intuition etc., and integrates them

so there is this capacity a community of human minds possesses to understand complicated mathematical fields beyond the strictness of formal legibility. consider what happens when evidence of the form "is \ulcorner x \urcorner true" is cheap. does this diminish net human mathematical understanding of a field? almost surely not. does the understanding fray with the power of the proof oracle? if done healthily, there's no need for this --- the understanding can just be ingested at an appropriate rate (and almost surely accelerated compared to the counterfactual where the evidence is more expensive)

now consider a community of human minds centered on solving scientific/technological/philosophical problems. I consider it quite likely that such a community's understanding is also meaningfully accelerated when evidence of the form "is \ulcorner x \urcorner" is cheap, even when "\ulcorner x \urcorner" is not a complete specification of the problem at hand. e.g., string theorists working on bootstrap principles can test axiomatizations OOMs more cheaply, game theorists/decision theorists can trivially compute equilibria in ~any setting in much more generality, moral philosophers of an analytic tradition can almost trivially test the consequences of their theories (MacAskill's latest paper and Parfit's Reasons and Persons are essentially formal, for instance). the problem of meaning-making from this evidence is harder and less solved than the mathematical case, for sure. but I find it hope worthy?

(we assume your oracle is not scheming. I am also setting aside the specgen problem for e.g. software, which seems to be solvable in practice as well as theory s.t. the natural equilibrium is provably safe code in the limit? this is what I'd put on one end of specgen hardness, where the philosophical difficulty of the problem is much reduced by virtue of the structure of the problem & field and the intellectual tradition that made it. on the other end is idk formalizing psychology / properties of mind, which is plausibly central to solving the alignment problem in full generality. there's this middle ground that I think is worth building infra for)

Edit: I guess I also sidestep your frame of "verifying pivotal problems" (I skimmed your verification slides and it seemed to have much of this vibe). if you think this is central / more of a canonical case I'm happy to consider it

see also ↩︎

[-]Jan_Kulveit16h80

Seems the bottleneck is more in the pre-formalization stage (concepts->maths). [5.2] we (ACS) are thinking about risks and opportunities of AI philosophy, but my current impression is it makes sense to focus on cyborg philosophy.

[-]Yudhister Kumar16h10

agree it makes sense to focus on cyborg philosophy (this post is essentially cyborgist in that I don't think AI-powered infra should substitute for application of human volition, and view accelerated conceptual progress as a joint AI/human project)

it does not seem to me that the bottleneck on important problems is wholly pre-formalization. there are some problems that do seem to be quite far from adequate formalizations s.t. it's likely not useful to consider atm (robustness of convergence to moral basins). there are other problems that seem to be closer (circuit decomposition of neural networks). in general I suspect that once a field is sufficiently theoretic there are many gains to be had from testing intuitions w/formalizations & iterating quickly

[-]jelly18h80

I don’t necessarily agree or disagree with you, but you might be interested in reading Formal Methods are not Slopless

[-]Yudhister Kumar16h20

Am familiar! In the regime of "near-superhumanly efficient proof oracle" I expect the vast majority of slop to originate from the specification gap, and the rest of the problems highlighted I expect to fall to data & scale.

[-]TsviBT18h64

[1] We will likely have near-superhuman mathematics AI by Q1 2027.[1]

I think there's a lot of worlds where AI is superhuman at something that isn't "come up with good math concepts", but still kinda looks like a "more efficient proof oracle for ~any well-specified mathematics problem". (It wouldn't actually be the same, but it would be a bit hard to tell; you could notice by seeing problems the AI tries but fails to resolve, but that humans eventually do resolve.)

safety-relevant mathematical output

In worlds where alignment is harder than easy, guarantees are more wanted. Hard worlds are less hard now because guarantees are easier. This is incredible and should be leveraged.

I don't especially have an opinion about relative investments in this; but I'll note that what would be needed for alignment is less like "more guarantees resting on proofs" and more like "better conceptual insight".

[-]Yudhister Kumar16h30

I think there's a lot of worlds where AI is superhuman at something that isn't "come up with good math concepts", but still kinda looks like a "more efficient proof oracle for ~any well-specified mathematics problem". (It wouldn't actually be the same, but it would be a bit hard to tell; you could notice by seeing problems the AI tries but fails to resolve, but that humans eventually do resolve.)

I do think good theory-building is still meaningfully accelerated here bc testing the consequences of definitions and such is much easier, and I probably consider the proportion of these worlds to be smaller than you, but I agree these worlds exist.

I don't especially have an opinion about relative investments in this; but I'll note that what would be needed for alignment is less like "more guarantees resting on proofs" and more like "better conceptual insight".

I also agree with this. (I think I also addressed this in my response to Kaarel's comment)

Moderation Log