Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This question stands out to me because:

  • It should directly affect empirical alignment priorities today
  • While it is informed by both theoretical and empirical evidence, it seems tractable for purely theoretical alignment researchers to make progress on today

It's even possible that theoretical alignment researchers already consider this to be a solved problem, in which case I think it would be valuable to have a carefully-reasoned write-up that empirical alignment practitioners can feel confident in the conclusions of.

Thanks to Paul Christiano for discussion that prompted this post and to Jan Leike for comments.

Why this should affect empirical alignment priorities today

Outer alignment can be framed as a data quality problem. If our alignment training data correctly favors aligned behavior over unaligned behavior, then we have solved outer alignment. But if there are errors in our data that cause an unaligned policy to be preferred, then we have a problem.

It is common to worry about errors in the alignment training data that arise from evaluation being too difficult for humans. I think this makes sense for two reasons:

  • Firstly, errors of this kind specifically incentivize models to deceive the human evaluators, which seems like an especially concerning variety of alignment failure.
  • Secondly, errors of this kind will get worse with model capability, which is a scary dynamic: models would get more misaligned as they became more powerful.

Nevertheless, I think we could still get catastrophic alignment failures from more mundane kinds of data quality issues. If we had the perfect scalable alignment solution, but the humans in the loop simply failed to implement it correctly, that could be just as bad as not using the solution at all.

But prevention of mundane kinds of data quality issues could look very different depending on the amount of data being collected:

  • If a large amount of alignment training data is needed, then a significant amount of delegation will be required. Hence practitioners will need to think about how to choose who to delegate different tasks to (including defending against adversaries intentionally introducing errors), how to conduct quality control and incentivize high data quality, how to design training materials and interfaces to reduce the likelihood of human error, and so on.
  • If only a small amount of alignment training data is needed, then it will be more feasible to put a lot of scrutiny on each datapoint. Perhaps practitioners will need to think about how to appropriately engage the public on the choice of each datapoint in order to maintain public trust.

Hence settling the question of how much alignment training data we will need in the long run seems crucial for deciding how much empirical alignment efforts should invest in the first versus the second kind of effort.

In practice, we may collect both a larger amount of lower-quality data and a smaller amount of higher-quality data, following some quality-quantity curve. The generalized form of the question then becomes: what is the probability of alignment for a given quality-quantity curve? Practitioners will then be able to combine this with feasibility considerations to decide what curve to ultimately follow.

Initial thoughts on this question

Considerations in favor of less alignment training data being required:

  • Larger models are more sample-efficient than smaller models, especially in the presence of pre-training. Hence for a given task we should expect the amount of alignment training data we need to go down over time.
  • There could be many rounds of fine-tuning used to teach models the precise details of performing certain tasks, and data quality may only be of great importance for the last few rounds.
  • We could design training schemes that are largely self-supervised, with models performing most of the reasoning about how different outputs are good for humans, and these training schemes might not require much human data.
  • In the limit, we could teach the model everything about the evaluation process in an entirely unsupervised way, and then use an extremely small amount of human data simply to get the model to recognize the output of this evaluation process as its objective.
  • Put another way, the information content of the instruction "be intent aligned" is very small once you have a model capable enough to understand exactly what you mean by this.

Considerations in favor of more alignment training data being required:

  • Model-free RL is very sample-inefficient, since reward is a channel with very low information density. It takes tens or hundreds of millions of samples for models to learn to play simple video games very well when trained from scratch. So we may be coming from a starting point of needing vast amounts of data to perfectly align models on complex tasks, and may still need a lot even if this amount goes down over time. Model-based RL is more sample-efficient, but could risk introducing unwanted bias.
  • From-scratch scaling laws have separate terms for model size and number of samples, implying a cap on sample efficiency in the infinite model size limit. These scaling laws do not apply to pre-trained models, but there should still be an information-theoretic lower bound on sample efficiency independent of model size.
  • Self-supervised approaches might not pan out, or they may be subject to instabilities that lead to misalignment, and so we may prefer to use approaches that require more human data in favor of model-generated data that hasn't been scrutinized as closely.
  • Deliberately telling the model about the evaluation process could make it more likely to exploit that process, so we again may prefer alternative approaches requiring more human data. Not telling the model about the evaluation process isn't a scalable defense, since sufficiently smart models should be able to infer most of the relevant details anyway, but we might still prefer our chances with this defense in place.
  • Even if the instruction "be intent aligned" has little information content, we may generally feel better about our alignment chances if we use methods that directly supervise specific tasks, rather than methods that try to decouple alignment into a phase of its own. As models get smarter, we will want them to perform harder and more specialized tasks, and so the information content of how we want them to behave may increase over time.

I think it's also worth studying not just the long-run limit, but also how we should expect the amount of alignment data we will need to change over time, since we are uncertain about the scale at which we could get dangerous misalignment. Empirical research could shed a lot of light on short-term trends, but we should be wary of extrapolating these too far if they seem at odds with theoretical conclusions.

New to LessWrong?

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 3:33 PM

I love the framing of outer alignment as a data quality problem!

As an illustrative data point, the way Google generates "alignment data" for its search evals is by employing thousands of professional raters and training them to follow a 200-page handbook (!) that operationalizes the concept of a "good search result".

Yes, I expect us to need some trusted data from humans. The cleverer we are the less we need. I think it's reasonable to aim for quantity within 2 OOM of RLHF.

But... no, outer alignment is not a data quality problem, any more than outer alignment is a cosmic ray problem because if only the right cosmic rays hit my processor, it would be outer aligned.

You're probably not the right target for this rant, but I typed it so oh well, sorry.

Yes, you could "just" obtain perfect labeled data about human actions, perfectly on-distribution, until a large NN converges, and get something that's as aligned as a human on-distribution. But that's not a real solution. Real solutions use obtainable amounts of data with obtainable quality, which requires being clever, which means doing all that thinking about outer alignment that isn't just about data quality. Also, real solutions integrate work on on- and off-distribution alignment. You can't just build something that generalizes poorly and then bolt generalization capabilities onto it afterward, you need to do outer alignment that includes desiderata for generalization properties.

I think that data quality is a helpful framing of outer alignment for a few reasons:

  • Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
  • If we had the perfect alignment solution on paper, we would still need to implement it. Since we don't yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense).
  • It's not a framing I've seen before, and I think it's helpful to have different framings for things.

I do think that the framing is less helpful if the answer to my question is "not much", but that's currently still unclear to me, for the reasons I give in the post.

I agree that data quality doesn't guarantee robustness, but that's a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.

I think my big disagreement is with point one - yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn't work in real life, and it's not something I see people working on such that there needs to be a word for it.

What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.

A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.

I think it's reasonable to aim for quantity within 2 OOM of RLHF.

Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?

Yeah I just meant the upper bound of "within 2 OOM." :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I'd be all for it.

I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that's the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.

As for why I think it's possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.

I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:

  • Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
  • I think it's pretty likely that we'll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don't want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.

Put another way, the information content of the instruction "be intent aligned" is very small once you have a model capable enough to understand exactly what you mean by this.

(The point I'm about to make may be indirectly addressed in your last bullet point in the list, "Considerations in favor of more alignment training data being required.") 

Regarding the sentence I quoted, I have the intuition that, if a system is smart enough to grok the nuances of 'be intent aligned,'" but it isn't yet intent aligned, that seems like a problem... With this sort of understanding, the system must already have a goal, right? There's no "ghost of advanced cognition" – you don't get intelligence without some objective.  Further training could shape the instincts in the system's internal architecture (which could maybe lead to changes in its goal after more training runs?), but there's no guarantee that the system's initial goal (the one it had before we tried to align it) won't somehow be too persistent to go away. 

I'm probably just restating something that's obvious to everyone? I guess it doesn't necessarily get much easier if you try to reward the system for (what looks like) "aligned actions" even when its capabilities are at baby level, because at that early level you're not selecting for a deep understanding of alignment?

Edit: After thinking about it some more, I think there is a big (potential) difference! If the training step that's meant to align the AI comes after other sorts of training that gave it advanced cognitive capabilities, it can use its advanced understanding to fake being aligned. By contrast, if you constantly reward it for actions that look like they're aligned, you're at least somewhat more likely to build up instincts that are genuine and directed at the right sort of thing.

This roughly matches some of the intuitions behind my last bullet that you referenced.

Tangential question (and maybe this isn't the sort of thing to go into too much detail on a public forum), but I'm quite curious about what alignment training would look like in practice.  Are there notes on this anywhere? 

For instance, what should imagine a "training episode" to be like? Something similar to Ought's experiments with factorized cognition? Some person doing "work as a CEO of an EA org" while they have an input and mouse movement tracker on their laptop? The AI playing some kind of open-ended game to gather resources and negotiate in self-play and people looking at it and distributing positive and negative points for various actions? (Probably not this one – I don't see why that would lead to alignment.) The AI writing up plans for how it would do a given assistance task and people rate these plans in terms of safety, norm following, and common sense understanding (on top of plans actually being workable)?

It seems like "alignment training" is such a vague category that I don't really know what to envision, which bottlenecks my thinking in a lot of related areas and is a bit frustrating.

(I guess there's more than one question implicit in my query. On the one hand, I'm wonder how systems with various "pivotal" / "transformative" capabilities would be trained to be safe/aligned. On the other hand, I'm wondering what sort of system people have in mind, whether it'll be an AI CEO or some more domain-limited application.) 

It's hard to predict (especially if timelines are long), but if I had to guess I would say that something similar to human feedback on diverse tasks will be the unaligned benchmark we will be trying to beat. In that setting, a training episode is an episode of an RL environment in which the system being trained performs some task and obtains reward chosen by humans.

It's even harder to predict what our aligned alternatives to this will look like, but they may need to be at least somewhat similar to this in order to remain competitive. In that case, a training episode might look more-or-less the same, but with reward chosen in a more sophisticated way and/or with other training objectives mixed in. The post I linked to discusses some possible modifications.

[-]JanB2yΩ110

Relevant: Scaling Laws for Transfer: https://arxiv.org/abs/2102.01293

If our alignment training data correctly favors aligned behavior over unaligned behavior, then we have solved outer alignment.

I'm curious to understand what this means, what "data favoring aligned behavior" means particularly. I'll take for granted as background that there are some policies that are good ("aligned" and capable) and some that are bad. I see two problems with the concept of data favoring a certain kind of policy:

  1. Data doesn't specify generalization. For any achievable training loss on some dataset, there are many policies that achieve that loss, and some of them will be good, some bad.
  2. There's ambiguity in what it means for training data to favor some behavior. On the one hand, there's the straightforward interpretation of the labels: the data specifies a preference for one kind of behavior over another. On the other hand, there's the behavior of the policies actually found by training on this data. These can come far apart if the data's main effect on the policies found is to imbue them with a rich world model and goals. There isn't necessarily a straightforward relationship between the labels and the goals.

I realise you're focusing on "outer alignment" here, and maybe these are not outer alignment problems.

This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.

For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.

I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.