This post is a response to Paul Christiano's post “Learning the prior.”
The generalization problem
Generally, when we train models, we often end up deploying them in situations that are distinctly different from those they were trained under. Take, for example, GPT-3. GPT-3 was trained to predict web text, not serve as a dungeon master—and the sort of queries that people present to AI dungeon are quite different than random web text—but nevertheless GPT-3 can perform quite well here because it has learned a policy which is general enough that it continues to function quite effectively in this new domain.
Relying on this sort of generalization, however, is potentially quite troublesome. If you're in a situation where your training and deployment data are in fact independently and identically distributed (i.i.d.), you can produce all sorts of nice guarantees about the performance of your model. For example, in an i.i.d. setting, you know that in the limit of training you'll get the desired behavior. Furthermore, even before the limit of training, you know that validation and deployment performance will precisely track each other such that you can bound the probability of catastrophic behavior by the incidence of catastrophic behavior on the validation data.
In a generalization setting, on the other hand, you have no such guarantees—even in the limit of training, precisely what your model does off-distribution is determined by your training process's inductive biases. In theory, any off-distribution behavior is compatible with zero training error—the only reason machine learning produces good off-distribution behavior is because it finds something like the simplest model that fits the data. As a result, however, a model's off-distribution behavior will be highly dependent on exactly what the training process's interpretation of “simpler” is—that is, its inductive biases. And relying on such inductive biases for your generalization behavior can potentially have catastrophic consequences.
Nuances with generalization
That being said, the picture I've painted above of off-distribution generalization being the problem isn't quite right. For example, consider an autoregressive model (like GPT-3) that's just trained to learn a particular distribution. Then, if I have some set of training data and a new data point , there's no test you can do to determine whether was really sampled from the same distribution as . In fact, for any and , I can always give you a distribution that could have been sampled from that assigns whatever probability I want to . Thus, to the extent that we're able to train models that can do a good job for i.i.d. —that is, that assign high probability to —it's because there's an implicit prior there that's assigning a fairly high probability to the actual distribution you used to sample the data from rather than any other of the infinitely many possible distributions (this is the no free lunch theorem). Even in the i.i.d. case, therefore, there's still a real and meaningful sense in which your performance is coming from the machine learning prior.
It's still the case, however, that actually using i.i.d. data does give you some real and meaningful guarantees—such as the ability to infer performance properties from validation data, as I mentioned previously. However, at least in the context of mesa-optimization, you can never really get i.i.d. data thanks to fundamental distributional shifts such as the the very fact that one set of data points is used in training and one set of data points is used in deployment. Paul Christiano's RSA-2048 example is a classic example of how that sort of fundamental distributional shift could potentially manifest. Both Paul and I have also written about possible solutions to this problem, but it's still a problem that you need to deal with even if you've otherwise fully dealt with the generalization problem.
Paul's approach and verifiability
The question I want to ask now, however, is the extent to which we can nevertheless at least somewhat stop relying on machine learning generalization and what benefits we might be able to get from doing so. As I mentioned, there's a sense in which we'll never fully be able to stop relying on generalization, but there might still be major benefits to be had from at least partially stopping doing so. At first, this might sound crazy—if you want to be competitive, surely you need to be able to do generalization? And I think that's true—but the question was whether we needed our machine learning models to do generalization, not whether we needed generalization at all.
Paul's recent post “Learning the prior” presents a possible way to get generalization in the way that a human would generalize while relying on significantly less machine learning generalization. Specifically, Paul's idea is to use ML to learn a set of forecasting assumptions that maximize the human's posterior estimate of the likelihood of over some training data, then generalize by learning a model that predicts human forecasts given . Paul argues that this approach is nicely i.i.d., but for the reasons mentioned above I don't fully buy that—for example, there are still fundamental distributional shifts that I'm skeptical can ever be avoided such as the fact that a deceptive model might care about some data points (e.g. the deployment ones) more than others (e.g. the training ones). That being said, I nevertheless think that there is still a real and meaningful sense in which Paul's proposal reduces the ML generalization burden in a helpful way—but I don't think that i.i.d-ness is the right way to talk about that.
Rather, I think that what's special about Paul's proposal is that it guarantees verifiability. That is, under Paul's setup, we can always check whether any answer matches the ground truth by querying the human with access to . In practice, for extremely large which are represented only implicitly as in Paul's post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human with access to to verify the answer, but regardless the point is that we have the ability to check the model's answers. This is different even than directly doing something like imitative amplification, where the only ground truth we can get in generalization scenarios is either computationally infeasible (HCH) or directly references the model itself (). One nice thing about this sort of verifiability is that, if we determine when to do the checks randomly, we can get a representative sample of the model's average-case generalization behavior—something we really can't do otherwise. Of course, we still need worst-case guarantees—but having strong average-case guarantees is still a big win.
To achieve verifiability while still being competitive across a large set of questions, however, requires being able to fully verify answers to all of those questions. That's a pretty tall order because it means there needs to exist some procedure which can justify arbitrary knowledge starting only from human knowledge and reasoning. This is the same sort of thing that amplification and debate need to be competitive, however, so at the very least it's not a new thing that we need for such approaches.
In any event, I think that striving for verifiability is a pretty good goal that I expect to have real benefits if it can be achieved—and I think it's a much more well-specified goal than i.i.d.-ness.
EDIT: I clarify a lot of stuff in the above post in this comment chain between me and Rohin.
Note that when I say “the human with access to ” I mean through whatever means you are using to allow the human to interface with a large, implicitly represented (which could be amplification, debate, etc.)—for more detail see “Approval-maximizing representations.” ↩︎
First off, I want to note that it is important that datasets and data points do not come labeled with "true distributions" and you can't rationalize one for them after the fact. But I don't think that's an important point in the case of i.i.d. data.
Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn't matter, and even if it did, I struggle to see how it has any safety implications -- we'd just find that our ML model is completely unable to get any validation performance and so we wouldn't deploy.
Want to note I agree that you never actually get i.i.d. data and so you can't fully solve inner alignment by making everything i.i.d. (also even if you did it wouldn't necessarily solve inner alignment).
I don't really see this. My read of this post is that you introduced "verifiability", argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it's better specified than i.i.d. because... actually I'm not sure why, but possibly because we can never actually get i.i.d. in practice?
If that's right, then I disagree. The way in which we lose i.i.d. in practice is stuff like "the system could predict the pseudorandom number generator" or "the system could notice how much time has passed" (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can't verify your system if it can predict which outputs you will verify, you can't verify your system if it varies its answer based on how much time has passed, you can't verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don't see why verifiability does any better.
I agree—I pretty explicitly make that point in the post.
I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it's easier to understand the text just by reading it.
The reason I care, though, is because the fact that the performance is coming from the implicit ML prior means that if that prior is malign then even in the i.i.d. case you can still get malign optimization.
That's mostly right, but the point is that the fact that you can't get i.i.d. in practice matters because it means you can't get good guarantees from it—whereas I think you can get good guarantees from verifiability.
I agree that you can't verify your model's answers if it can predict which outputs you will verify (though I don't think getting true randomness will actually be that hard)—but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.
Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem -- I initially thought you were arguing "since there is no 'true' distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice".
I still don't see the distinction. Let's be particularly concrete.
Say I have D = timeseries data of new COVID cases per day for the last 14 days, and I want to predict D' = timeseries data of new COVID cases per day for the next 14 days. Maybe our Z* ends up being "on day t there will be exp(0.1 * days since Feb 17) new cases".
Our model for predicting on D' is trained via supervised learning on human predictions on randomly sampled data points of D'. We then use it to predict on other randomly sampled data points of D', thus using it in a nominally i.i.d. setting. Now, you might be worried that we learn a model that thinks something like "if RSA-2048 is not factored, then plug it into the formula from Z* and report that, otherwise say there will be a billion cases and once all the humans hide away in their houses I'll take over the world", which leverages one of the ways in which nominally i.i.d. data is not actually i.i.d.. How does verifiability help with this problem?
Perhaps you'd say, "with verifiability, the model would 'show its work', thus allowing the human to notice that the output depends on RSA-2048, and so we'd see that we have a bad model". But this seems to rest on having some sort of interpretability mechanism -- I feel like you're not just saying "rather than i.i.d., we need perfect interpretability and that will give us better guarantees", but I don't know what you are saying.
Hmmm... I'll try to edit the post to be more clear there.
Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we're forcing the guarantees to hold by actually randomly checking the model's predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.
There's no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model's answers to induce i.i.d.-like guarantees.
How is this different from evaluating the model on a validation set?
I certainly agree that we shouldn't just train a model and assume it is good; we should be checking its performance on a validation set. This is standard ML practice and is necessary for the i.i.d. guarantees to hold (otherwise you can't guarantee that the model didn't overfit to the training set).
Sure, but at the point where you're randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn't actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model's output against some ground truth. In either situation, though, part of the point that I'm making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what's mattering is that the validation and deployment data are i.i.d. (because that's what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.
This is taking the deployment data, and splitting it up into validation vs. prediction sets that are i.i.d. (via random sampling), and then applying the i.i.d. theorem on results from the validation set to make guarantees on the prediction set. I agree the guarantees apply even if the training set is not from the same distribution, but the operation you're doing is "make i.i.d. samples and apply the i.i.d. theorem".
At this point we may just be debating semantics (though I do actually care about it in that I'm pretty opposed to new jargon when there's perfectly good ML jargon to use instead).
Alright, I think we're getting closer to being on the same page now. I think it's interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it's an argument that we shouldn't be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model's output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that's just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I'm hesitant to just call it “validation/deployment i.i.d.” or something.
Fair enough. In practice you still want training to also be from the same distribution because that's what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)
This seems to rely on an assumption that "human is convinced of X" implies "X"? Which might be fine, but I'm surprised you want to rely on it.
I'm curious what an algorithm might be that leverages this relaxation.
Well, I'm certainly concerned about relying on assumptions like that, but that doesn't mean there aren't ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that H being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train f(x | Z) via debate over what H(x | Z) would do if H could access the entirety of Z, then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.
I'm not sure what "just ask the human to verify the answer given Z" looks like, for implicitly represented Z
There are lots of ways to allow H to interface with an implicitly represented Z, but the one Paul describes in “Learning the Prior” is to train some model Mz(⋅, z) which represents Z implicitly by responding to human queries about Z (see also “Approval-maximizing representations” which describes how a model like Mz could represent Z implicitly as a tree).
Once H can interface with Z, checking whether some answer is correct given Z is at least no more difficult than producing an answer given Z—since H can just produce their answer then check it against the model's using some distance metric (e.g. an autoregressive language model)—but could be much easier if there are ways for H to directly evaluate how likely H would be to produce that answer.
Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it's simple to verify the results of amplification/debate systems?)
Ah, sorry, no—I was assuming you were just using whatever procedure you used previously to allow the human to interface with Z in that situation as well. I'll edit the post to be more clear there.
Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer "Would A(Z) output answer Y to question X?", as opposed to asking A to answer "X", and then checking if it equals "Y". This can at most be as hard as running the original system, and maybe could be much more efficient.
Yep; that's what I was imagining. It is also worth noting that it can be less safe to do that, though, since you're letting A(Z) see Y, which could bias it in some way that you don't want—I talk about that danger a bit in the context of approval-based amplification here and here.