This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.
For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.
I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.
I think it's reasonable to aim for quantity within 2 OOM of RLHF.
Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?
I think that data quality is a helpful framing of outer alignment for a few reasons:
I do think that the framing is less helpful if the answer to my question is "not much", but that's currently still unclear to me, for the reasons I give in the post.
I agree that data quality doesn't guarantee robustness, but that's a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.
I've not mentally carved things up that way before, but they do seem like different flavors of work (with 1 and 2 being closely related).
Another distinction I sometimes consider is between exploring a network for interpretable pieces ("finding things we understand") versus trying to exhaustively interpret part of a network ("finding things we don't understand"). But this distinction doesn't carve up existing work very evenly: the only thing I can think of that I'd put in the latter category is the work on Artificial Artificial Neural Networks.
Great post! This seems like a useful perspective to keep in mind.
Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely - instead, PPO's implicit "local" KL penalty controls the rate of policy change.
If we were in the regime of optimizing the policy significantly more, experience from traditional RL suggests that there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.
I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.
The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.
A magician asks you to choose any infinite sequence of 0s and 1s, and to start reciting it until they say stop. They will then make a prediction of the form "p% of the next n digits will be 0", and they will be correct to within 1% at least 99% of the times they perform the trick. How is the trick done?h/t Alex Arkhipov