Behavior cloning (BC) is, put simply, when you have a bunch of human expert demonstrations and you train your policy to maximize likelihood over the human expert demonstrations. It’s the simplest possible approach under the broader umbrella of Imitation Learning, which also includes more complicated things like Inverse Reinforcement Learning or Generative Adversarial Imitation Learning. Despite its simplicity, it’s a fairly strong baseline. In fact, prompting GPT-3 to act agent-y is essentially also BC, just rather than cloning on a specific task, you're cloning against all of the task demonstration-like data in the training set--but fundamentally, it's a scaled up version of the exact same thing. The problem with BC that leads to miscalibration is that the human demonstrator may know more or less than the model, which would result in the model systematically being over/underconfident for its own knowledge and abilities.
For instance, suppose the human demonstrator is more knowledgeable than the model at common sense: then, the human will ask questions about common sense much less frequently than the model should. However, with BC, the model will ask those questions at the exact same rate as the human, and then because now it has strictly less information than the human, it will have to marginalize over the possible values of the unobserved variables using its prior to be able to imitate the human’s actions. Factoring out the model’s prior over unobserved information, this is equivalent to taking a guess at the remaining relevant info conditioned on all the other info it has (!!!), and then act as confidently as if it had actually observed that info, since that's how a human would act (since the human really observed that information outside of the episode, but our model has no way of knowing that). This is, needless to say, a really bad thing for safety; we want our models to ask us or otherwise seek out information whenever they don't know something, not randomly hallucinate facts.
In theory we could fix this by providing enough information in the context such that the human doesn’t know anything the model doesn’t also know. However, this is very impractical in the real world, because of how much information is implicit in many interactions. The worst thing about this, though, is that it fails silently. Even if we try our best to supply the model with all the information we think it needs, if we forget anything the model won't do anything to let us know; instead, it will silently roll the dice and then pretend nothing ever happened.
The reverse can also happen, where if the model knows more than the human, then it will collect a bunch of unnecessary info which it discards so that its decision is as dumb as the human's. This is generally not as dangerous, though it might still mislead us to the capabilities of the model (we might think it's less knowledgeable than it actually is), and it would use resources suboptimally, so it's still best avoided. Also, the model might do both at the same time in different domains of knowledge, if the human is more knowledgeable in one area but less in another.
Plus, you don’t even need unobserved information in the information-theoretic sense. If your agent has more logical uncertainty than the human, you end up with the exact same problem; for example if it’s significantly better/worse at mental math than the human in an environment where the human/agent can choose to use a calculator provided as part of the environment that costs some small amount of reward to use, even though the agent has access to the exact same info as the human, it will choose to use a calculator too often/not often enough.
This isn’t a purely theoretical problem. This definitely happens in GPT-2/3 currently and is a serious headache for many uses of GPT already--model hallucinations have been a pretty big problem outlined in numerous papers. Further, I expect this problem to scale to even superhuman models, since this BC objective fundamentally does not incentivize calibration. Even as a component of a superhuman agent, it seems really bad if a component of the agent silently adds false assumptions with high confidence into random parts of the agent's thoughts. On the optimistic side, I think this problem is uniquely exhibitable and tractable and the solutions are scalable (superhuman BC would be uncalibrated for the exact same reasons as current BC).
Because an agent that's consistently overconfident/underconfident will get less reward, and reward is maximized when the model is calibrated, the RL objective incentivizes the model to become calibrated. However, RL comes with its own problems too. Making a good reward function that really captures what you care about and is robust against goodharting is hard, and either hand crafting a reward or learning a reward model opens you up to goodharting, which could manifest itself in much more varied and unpredictable ways depending on many details of your setup. A hybrid BC pretrain+RL finetune setup, as is common today (since training from scratch with RL is exorbitantly expensive in many domains) could have the problems of either, both, or neither, depending on the details of how much RL optimization is allowed to happen (i.e by limiting the number of steps of tuning, or having a distance penalty to keep the policy close to the BC model, etc).
I think it would be promising to see whether miscalibration can be fixed without allowing goodharting to happen. In particular, I think some kind of distance penalty that makes it inexpensive for the model to fix calibration issues but very expensive to make other types of changes would possibly allow this. The current standard KL penalty penalizes calibration related changes the exact same as all other changes, so I don’t expect tuning the coefficient on that will be enough, and even if it works it will probably be very sensitive to the penalty coefficient, which is not ideal. Overall, I’m optimistic that some kind of hybrid approach could have the best of both worlds, but just tweaking hyperparameters on the current approach probably won’t be enough.
Since P(x)=∑hP(x|h)P(h), sampling from P(x) is equivalent to first sampling an h from your prior and then conditioning on that information as if you had actually observed it. ↩︎
Also relevant: "Shaking the foundations: delusions in sequence models for interaction and control", Ortega et al 2021.
Also Against Mimicry
(Moderation note: added to the Alignment Forum from LessWrong.)