[Probably a noob question]
I'm thinking about what an inner alignment failure might look like for GPT-3. This would have to involve some deployment context in which GPT-3 performs significantly worse (by the standards of the base objective) than it did in training. (It would involve other things too, such as GPT-3 being a mesa-optimizer.)
But to say how well GPT-3 performs on some prompt not in the training dataset, we have to have a definition of the base objective that extends beyond the training dataset. If the base objective only makes sense in the context of the training dataset, then inner alignment failure is impossible by definition.
Is the base objective "Predict the next word?" Or is it "Predict the next word, supposing what you are reading is typical 2019 Internet text?" Or is it "Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D? [where D is the dataset used to train GPT-3]" The third option is in some sense the best, because it most closely fits what we actually did to train GPT-3. But note that the logical extension of this line of reasoning is to prefer a fourth option: "Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D' [where D' is like D except that it doesn't contain any of the bits of text that GPT-3 happened to not see in training, and the randomness weights are chosen to more accurately yield the data points that GPT-3 in fact saw]."
The problem with these last two answers is that they make it undefined how well GPT-3 performs on the base objective on any prompt that wasn't in D, which then rules out psuedo-alignment by definition.
From the Risks from Learned Optimization paper:
In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. In reinforcement learning (RL), for example, the base objective is generally the expected return. Because the mesa-objective is not specified by the programmers, mesa-optimization opens up the possibility of a mismatch between the base and mesa- objectives, wherein the mesa-objective might seem to perform well on the training environment but lead to bad performance off the training environment. We will refer to this case as pseudo-alignment below.
Expected return in a particular environment/distribution? Or not? If not, then you may be in a deployment context where you aren't updating the weights anymore and so there is no expected return, or at least it's close to 0 because there's only any return if you can convince people to start updating your weights again!
I worry I am just confused about all this. Hence why I'm asking. What is GPT-3's base objective?