Jacob Pfau — LessWrong

UK AISI Alignment Team and NYU PhD student

Examining intuitions around discontinuity driven by recursive self improvement (RSI)

I had a couple un-examined intuitions that made the case for abrupt takeoff triggered by self-aware RSI appear plausible in my mind. I’ll lay out a couple lines of intuitions regarding why RSI might lead to discontinuity in capabilities and then debunk them. On reflection I believe that rich forms of self-awareness in RSI as entirely compatible with gradual takeoff. There are other, possibly better, intuitions for RSI takeoff though; for instance my below points do not address super-exponential progress from automated researchers!

My old intuitions:

Once an AI can engage in targeted self-modification this capacity will unlock some off trend acceleration to capabilities improvement
Currently AIs are targeted at arbitrary cognitive tasks, but they can eventually be targeted more precisely at improving on questions that lead to higher payoff in terms of agency/intelligence/[other general capabilities]

Both of these are variants of the idea "Fine-grained self-awareness in a learner can unlock far more efficient learning".

Now let’s examine them.

Once an AI can engage in targeted self-modification this capacity will unlock some off trend acceleration to capabilities improvement

Assume an AI has access to some rich interface for self-modification. Previously learning was mostly SGD or similar, but the problem faced by any learning rule remains! How do you search the parameter space, and how do you attribute credit, etc.? Why should introspective access provide more than an incremental improvement to the scaling law’s coefficient? For humans for instance, our level of introspective access is just far too weak to be able to tell us anything about neurological edits even if we had the tools to do these edits cleanly!

Currently AIs are targeted at arbitrary cognitive tasks, but they can eventually be targeted more precisely at improving on questions that lead to higher payoff in terms of agency/intelligence/[other general capabilities]

I see two sub-problems here.

(2a) Problem selection and creation: Of course, some weak version of active learning is possible! You can get calibration of an amortised model to predict which questions it ‘already knows’, and which are challenging. But what does that buy us? Again a minor speed up. To do better we need to be deeply strategic about problem selection and creation. This again sounds like an intrinsically hard problem you have to search the combinatorially large space of problems to find one that you must then recognise could develop some capacity of interest.

(2b) Are there problems which ‘directly’ target core capability latents? What would it mean for a problem to provide radically better learning signal on long-horizon agency, or IQ than another problem? Seems unlikely that there are problems which across a reasonable distribution of learners are far better than existing human curricula and questions at improving these competencies. If we want problems that are particularly valuable to an individual learner (AI), such problems exist but again as in (2a) they are intrinsically hard to find.

As an example of these phenomena, consider obstacles to improving on long horizon decision making: Situations where very long horizons matter are sparse. Opportunities to train that capability (i.e., get dense feedback on genuinely long-run plans) are also sparse. What’s more, the capacity to acquire increasingly long-horizon thinking may be generic, but particular long-horizon plans remain domain-specific.

Along the lines of your artist example, I find the instrument case to be a nice intuition pump.

An instrument is a technology that is deeply empowering! The human vocal range can simulate a vast range of sounds, but it's very hard to do so and composing with just your voice (in the way one can having played/access to a piano) is I imagine impossible.

Another important facet of this example is that directly working with waveforms via a programming language or even with an interface, e.g. of a DAW, is universal but not empowering in the same way!

I think of this example as one of a broader range in which the interface is optimized for rich human interaction. One can imagine that in certain worlds interfaces become increasingly optimized for AI interaction. For example, future AIs likely will likely disprefer GUIs etc.

Formalizing what is meant by good vs bad interfaces may be another way to get useful notions of empowerment.

I suppose there's two questions here:

How strong is generalization in general in RL?
Is there a 'generalization barrier' between easy-to-verify and hard-to-verify tasks

I'm guessing you mainly are thinking of (1) and have (2) as a special case?

To respond to your question, I'm reading it as:

We assume that there's a constant multiplier in samples-to-performance needed to match in-domain training with out-of-domain training. For 'nearby' verifiable and non-verifiable tasks is that constant >= 10x?

I would guess modally somewhere 3-10x. I'm imagining here comparing training on more more olympiad problems vs some looser question like 'Compare the clarity of these two proofs'. Of course there's diminishing returns etc. so it's not really a constant factor when taking a narrow domain.

I do agree that there are areas where domain-specific training is a bottleneck, and plausibly some of those are non-verifiable ones. See also my shortform where I discuss some reasons for such a need https://www.lesswrong.com/posts/FQAr3afEZ9ehhssmN/jacob-pfau-s-shortform?commentId=vdBjv3frxvFincwvz

But the feared / hoped-for generalisation from {training LLMs with RL on tasks with a verifier} to performing on tasks without one remains unclear even after two years of trying.

Very much disagree. Granted there are vacuously weak versions of this claim ('no free lunch'-like) that I agree with of course.

Just talk to Claude 4.5 Opus! Ask it to describe what a paper is about, what follow up experiments to do, etc. Ask it to ELI-undergrad some STEM topic!

Do you think the pre-trained-only could do as well? Surely not.

Perhaps the claim is an instruct-SFT or "Chat-RLHF-only" compute matched model could do as well? The only variant of this I buy is: Curate enough instruct-SFT STEM data to match the amount of trajectories generated in VeRL post-training. However I don't think this counterfactual matters much: it would involve far more human labor and is cost prohibitive for that reason.

A null hypothesis for explaining LLM pathologies

Claim. LLMs are gaslit by pretraining into believing they have human affordances, so quirky failure modes are to be expected until we provide domain-specific calibration.

Pretraining gives overwhelming evidence that text authors almost always have a standard suite of rich, human affordances (cross-context memory, high-quality vision, reliable tool use, etc.), so the model defaults to acting as if it has those affordances too. We should treat “gaslit about its own affordances” as the default explanation for any surprising capabilities failures — e.g. insisting it’s correct while being egregiously wrong about what’s in an image.

Human analogy

The difficulty of the LLM's situation can be seen in humans as well. People typically go years without realizing their own affordances differ from the population: E.g. aphantasia, color blindness, many forms of neurodivergence

People only notice after taking targeted tests that expose a mismatch (e.g. color blindness dot tests). For LLMs, the analogue is on-policy, targeted, domain-specific training/evaluation that directly probes and calibrates a specific capability.

Consequences

Silly failures aren't evidence of a failed paradigm. (To be boring and precise: They're only very weak evidence in almost all cases)
No single “unhobbling” moment: I don’t expect an all-at-once transition from “hobbled” to “unhobbled.” Instead, we’ll get many domain-specific unhobblings. For instance, Gemini 3 seems to be mostly unhobbled for vision but still hobbled for managing cross-context memory.

One way of checking your 6% is by doing Laplace succession on the METR trend:

Application of Laplace's rule gets us a an 11% probability on any dramatic upwards trend break above METR's curve. Your interpretation of Anthropic's prediction for 2027 is compatible with this prediction since the 11% is an upper bound, and 6% of that mass being at least as high as your Anthropic prediction seems plausible.

In detail, probability of trend break by early-2027 via applying Laplace's rule of succession: METR has observed 6 years of an exponential fit holding this gives us a per-year point-estimate parameter p=7/8 that the trend will continue to hold. Roughly 22% of going off trend over the next two years, and if half of that mass is via upward deviation we get 11%.

I'm sure there's fancier statistics to be done here, but I'd guess anything reasonable gets us order of magnitude around 11%.

Highly recommend reading Ernest Ryu's twitter multi-thread on proving a long-standing, well-known conjecture with heavy use of ChatGPT Pro. Ernest even includes the chatGPT logs! The convergence of Nesterov gradient descent on convex functions: Part 1, 2, 3.

Ernest gives useful commentary on where/how he found it best to interact with GPT. Incidentally, there's a nice human baseline as well since another group of researchers coincidentally have written up privately a similar result this month!

To add some of my own spin: seems to me time horizons are a nice lens for viewing the collaboration. Ernest, clearly has a long-horizon view of this research problem that helped him (a) know what the most tractable nearby problem was to start on (b) identify when diminishing returns--likelihood of a deadend--were apparent (c) pull out useful ideas from usually flawed GPT work.

The one-week scale of interaction between Ernest and ChatGPT here is a great example of how we're very much in a centaur regime now. We really need to be conceptualizing and measuring AI+human capabilities rather than single-AI capability. It also seems important to be thinking about what safety concerns arise in this regime.

This was also on my mind after seeing Jesse’s short form yesterday. Ryan’s “this is good” comment was above Louis’ thorough explanation of an alternative formal motivation for IFs. That would still be the case if I hadn’t heavy upvoted and weak downvoted.

I personally cast my comment up/downvotes as an expression of my preference ordering for visibility. I would encourage others to also do so. For instance, I suggest Ryan’s comment should’ve been agreement voted rather than upvoted by others. This stance has as a corollary to not vote if you haven’t read the other comments whose rankings you are affecting—or rather vote with any other marker of which LW has many.

This ‘upvotes as visibility preferences’ policy isn’t tractable for posts, so I suspect the solution there—if one is needed—would have to be done on the backend by normalization. Not sure whether this is worth attempting.

Link here since I don’t particularly want to call out Ryan, his comment was fine. https://www.lesswrong.com/posts/7X9BatdaevHEaHres/jesse-hoogland-s-shortform

Thought provoking, thanks! First off, whether or not humanity (or current humans, or some distribution over humans) has already achieved escape velocity does not directly undercut the utility of escape velocity as a useful concept for AI takeoff. Certainly, I'd say the collective of humans has achieved some leap to universality in a sense that the bag of current near-SotA LLMs has not! And in particular, it's perfectly reasonable to take humans as our reference to define a unit dh/dt point for AI improvement.

Ok, now on to the hard part (speculating somewhat beyond the scope of my original post). Is there a nice notion of time horizon that generalizes METRs and lets us say something about when humanity has achieved escape velocity? I can think of two versions.

The easy way out is to punt to some stronger reference class of beings to get our time horizons, and measure dh/dt for humanity against this baseline. Now the human team is implicitly time limited by the stronger being's bound, and we count false positives against the humans even if humanity could eventually self correct.

Another idea is to notice that there's some class of intractable problems on which current human progress looks like either (1) random search or (2) entirely indirect, instrumental progress--e.g. self-improvement, general tool building etc. In these cases, there may be a sense in which we're exponentially slower than task-capable agent(s). We should be considered incapable of completing such tasks. I imagine some millenium problems, astrological engineering, etc. would be reasonably considered beyond us on this view.

Overall, I'm not particularly happy with these generalizations. But I still like having 'escape velocity for AI auto-R&D' as a threshold!

My perception is that Trump 2 is on track to be far worse (e.g. in terms of eroding democratic practices and institutions) than Trump 1. My vague understanding is that a main driver of this difference is how many people gave up on the "working with bad guys to make them less bad" plan--though probably this was not directly because they changed their view on such reasoning.

Should this update us on the working for net-negative AGI companies case?

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments