Wiki Contributions


How much alignment data will we need in the long run?

This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.

For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.

I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.

How much alignment data will we need in the long run?

I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:

  • Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
  • I think it's pretty likely that we'll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don't want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.
How much alignment data will we need in the long run?

A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.

How much alignment data will we need in the long run?

I think it's reasonable to aim for quantity within 2 OOM of RLHF.

Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?

How much alignment data will we need in the long run?

I think that data quality is a helpful framing of outer alignment for a few reasons:

  • Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
  • If we had the perfect alignment solution on paper, we would still need to implement it. Since we don't yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense).
  • It's not a framing I've seen before, and I think it's helpful to have different framings for things.

I do think that the framing is less helpful if the answer to my question is "not much", but that's currently still unclear to me, for the reasons I give in the post.

I agree that data quality doesn't guarantee robustness, but that's a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.

Deep learning curriculum for large language model alignment

I've not mentally carved things up that way before, but they do seem like different flavors of work (with 1 and 2 being closely related).

Another distinction I sometimes consider is between exploring a network for interpretable pieces ("finding things we understand") versus trying to exhaustively interpret part of a network ("finding things we don't understand"). But this distinction doesn't carve up existing work very evenly: the only thing I can think of that I'd put in the latter category is the work on Artificial Artificial Neural Networks.

RL with KL penalties is better seen as Bayesian inference

Great post! This seems like a useful perspective to keep in mind.

Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely - instead, PPO's implicit "local" KL penalty controls the rate of policy change.

If we were in the regime of optimizing the policy significantly more, experience from traditional RL suggests that there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

[Link] Training Compute-Optimal Large Language Models

I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.

[Link] Training Compute-Optimal Large Language Models

The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.

What are the best elementary math problems you know?

A magician asks you to choose any infinite sequence of 0s and 1s, and to start reciting it until they say stop. They will then make a prediction of the form "p% of the next n digits will be 0", and they will be correct to within 1% at least 99% of the times they perform the trick. How is the trick done?

h/t Alex Arkhipov

Load More