Toby_Ord — LessWrong

LESSWRONG
LW

Replying toI ate bear fat with honey and salt flakes, to prove a point

Toby_Ord3mo

I ate bear fat with honey and salt flakes, to prove a point

I found the thread!

Replying toI ate bear fat with honey and salt flakes, to prove a point

Toby_Ord3mo

I ate bear fat with honey and salt flakes, to prove a point

Love it!

I'm actually kind of to blame for this whole honey/fat/salt thing.

Eliezer had long used ice cream as an example of a super-stimulus that hacks our evolved tastes by delivering a combination of sweetness/fat/salt that was beyond anything available in our ancestral environment. I'd realised that this wasn't actually true as honey, animal fat and salt were all available in the ancestral environment and when I saw him giving this argument again I weighed in with this example (sadly I can't find the thread). I was probably a bit rude, but Eliezer sportingly engaged on it and I find it hilarious that he is now using this as an example, that you tried it, and that it is actually quite good!

(I can't remember who decided that it should be 'bear fat' in particular — that might have been Eliezer's addition — it does fit nicely with the honey!)

Replying toHow Well Does RL Scale?

Toby_Ord4mo

How Well Does RL Scale?

I'm a bit confused here. Your first paragraph seems to end up agreeing with me? i.e. that RL scaling derives most of its importance from enabling inference-scaling and is dependent on it. I'm not sure we really have any disagreement there — I'm not saying people will stop doing any RL.

Re WTP, I do think it is quite hard to scale. For example consider consumer use. Many people are paying ~$1 per day for AI access (the $20/month subscriptions). If companies need to 1000x inference in order to get the equivalent of a GPT level, then consumers would need to pay ~$1000 per day, which most people won't do (and can't do). Indeed,... (read more)

Replying toHow Well Does RL Scale?

Toby_Ord4mo

How Well Does RL Scale?

I do think that progress will slow down, though its not my main claim. My main claim is to do with the tailwind of compute scaling will become weaker (unless some new scaling paradigm appears or a breakthrough saves this one). That is a piece in the puzzle of whether overall AI progress will accelerate or decelerate and I'd ideally let people form their own judgments about the other pieces (e.g. whether recursive self improvement will work, or whether funding will collapse in a market correction, taking away another tailwind of progress). But having a major boost to AI progress (compute scaling) become less of a boost is definitely some kind of... (read more)

Replying toHow Well Does RL Scale?

Toby_Ord4mo

How Well Does RL Scale?

I agree that separately from its direct boost to performance at the same inference-compute, RL training also helps enable more inference scaling. I talk about that above when I say "this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost."

A key thing I'm trying to get across is that I think this is where most of the benefit from RL is coming from. i.e. that while you pay the RL scaling costs at training time, you also need to pay the inference scaling costs at deployment time in order to get the... (read more)

Replying toHow Well Does RL Scale?

Toby_Ord4mo

How Well Does RL Scale?

Yes, you would get an optimal allocation with non-zero amounts to each. A simple calculation suggests 1:2 ratio of RL-OOMs : Inference-OOMs. e.g. scaling up RL by 100x and inference by 10,000x. So it could easily lead to RL compute becoming an ever-smaller fraction of FLOPs. But there are additional complications from the fact that inference is a flow of costs and also increases with the number of users, while RL is a fixed cost.

On the simple model and with my scaling numbers, the contribution of RL to capabilities (keeping token-use fixed) would be 20% — a 1:4 ratio with inference because half as many OOMs and half the effect per OOM.

The... (read more)

Replying toHow Well Does RL Scale?

Toby_Ord4mo

How Well Does RL Scale?

Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3/2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.

I'm not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.

Replying toHow Well Does RL Scale?

Toby_Ord4mo

How Well Does RL Scale?

Thanks Jacob. It is less of a mathematical mistake and more me trying to make a qualitative connection between the observed poor scaling of RL training and theoretical mechanism I'd just written about of poor information efficiency, both of which look very big. I agree that the theoretical explanation doesn't seem to be quite the right shape to explain the empirical issue.

Of your potential reasons, I do think longer episodes is part of it. The R1 paper has a chart on page 8 showing that without trying to affect episode lengths, they increased linearly from 500 tokens to ~9000 tokens over 8000 episodes, suggesting pretty much 1 token increase per episode on... (read more)

How Well Does RL Scale?

Toby_Ord

4mo

Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety.

The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types... (read 1992 more words →)

132

Toby_Ord4mo

Thanks for this Jacob — excellent analysis.

I'm a huge fan of Bradley-Terry models. I'm quite sure they are the natural way of representing noisy contests like chess ability and that Elo is an inferior way. They key thing with Bradley-Terry is that each competitor has a raw ability score (e.g. A and B) and that then when they have a contest the odds of A beating B is just A:B. I think of it as each player puts a number of tickets of their colour into a hat and then one is drawn at random determining the winner. This is an even simpler interpretation than the one from the Hex paper and... (read more)

Replying toMONA: Managed Myopia with Approval Feedback

Toby_Ord1y

MONA: Managed Myopia with Approval Feedback

Thanks — this looks promising.

One thing I noticed is that there is an interesting analogy between your model and a fairly standard model in economics where society consists of a representative agent in each time period (representing something like a generation, but without overlap) each trying to maximise its own utility. They can plan based on the utilities of subsequent generations (e.g. predicting that the next generation will undo this generation's policies on some topic) but they don't inherently value those utilities. This is then understood via the perspective of a planner who wants to maximise the (discounted) sum of future utilities, even though each agent in the model is only trying... (read more)