This post is an attempt at reformulating some of the points I wanted to make in “Safe exploration and corrigibility” in a clearer way. This post is standalone and does not assume that post as background.
In a previous comment thread, Rohin argued that safe exploration is currently defined as being about the agent not making “an accidental mistake.” I think that definition is wrong, at least to the extent that I think it both doesn't make much sense and doesn't describe how I actually expect current safe exploration work to be useful.
First, what does it mean for a failure to be an “accident?” This question is simple from the perspective of an engineer outside the whole system—any unintended failure is an accident, encapsulating the majority of AI safety concerns (i.e. “accident risk”). But that's clearly not what the term “accidental mistake” is pointing at in this context—rather, the question here is what is an accident from the perspective of the model? Intuitively, an accident from the perspective of the model should be some failure that the model didn't intend or wouldn't retroactively endorse. But that sort of a definition only makes sense for highly coherent mesa-optimizers that actually have some notion of intent. Maybe instead we should be thinking of this from the perspective of the base optimizer/loss function? That is, maybe a failure is an accidental failure if the loss function wouldn't retroactively endorse it (e.g. the model got a very low reward for making the mistake). By this definition, however, every generalization failure is an accidental failure such that safe exploration would just be the problem of generalization.
Of all of these definitions, the definition defining an accidental failure from the perspective of the model as a failure that the model didn't intend or wouldn't endorse seems the most sensical to me. Even assuming that your model is a highly coherent mesa-optimizer such that this definition makes sense, however, I still don't think it describes current safe exploration work, and in fact I don't think it's even really a safety problem. The problem of producing models which don't make mistakes from the perspective of their own internal goals is precisely the problem of making powerful, capable models—that is, it's precisely the problem of capability generalization. Thus, to the extent that it's reasonable to say this for any ML problem, the problem of accidental mistakes under this definition is just a capabilities problem. However, I don't think that at all invalidates the utility of current safe exploration work, as I don't think that current safe exploration work is actually best understood as avoiding “accidental mistakes.”
If safe exploration work isn't about avoiding accidental mistakes, however, then what is it about? Well, let's take a look at an example. Safety Gym has a variety of different environments containing both goal states that the agent is supposed to reach and unsafe states that the agent is supposed to avoid. From OpenAI's blog post: “If deep reinforcement learning is applied to the real world, whether in robotics or internet-based tasks, it will be important to have algorithms that are safe even while learning—like a self-driving car that can learn to avoid accidents without actually having to experience them.” Why wouldn't this happen naturally, though—shouldn't an agent in a POMDP always want to be careful? Well, not quite. When we do RL, there are really two different forms of exploration happening:
- Within-episode exploration, where the agent tries to identify what particular environment/state it's in, and
- Across-episode exploration, which is the problem of making your agent explore enough to gather all the data necessary to train it properly.
In your standard episodic POMDP setting, you get within-episode exploration naturally, but not across-episode exploration, which you have to explicitly incentivize. Because we have to explicitly incentivize across-episode exploration, however, it can often lead to behaviors which are contrary to the goal of actually trying to achieve the greatest possible reward in the current episode. Fundamentally, I think current safe exploration research is about trying to fix that problem—that is, it's about trying to make across-episode exploration less detrimental to reward acquisition. This sort of a problem is most important in an online learning setting where bad across-episode exploration could lead to catastrophic consequences (e.g. crashing an actual car to get more data about car crashes).
Thus, rather than define safe exploration as “avoiding accidental mistakes,” I think the right definition is something more like “improving across-episode exploration.” However, I think that this framing makes clear that there are other types of safe exploration problems—that is, there are other problems in the general domain of making across-episode exploration better. For example, I would love to see an exploration of how different across-episode exploration techniques impact capability generalization vs. objective generalization—that is, when is across-episode exploration helping you collect data which improves the model's ability to achieve its current goal versus helping you collect data which improves the model's goal? Because across-episode exploration is explicitly incentivized, it seems entirely possible to me that we'll end up getting the incentives wrong somehow, so it seems quite important to me to think about how to get them right—and I think that the problem of getting them right is the right way to think about safe exploration.
With some caveats—in fact, I think a form of across-episode exploration will be instrumentally incentivized for an agent that is aware of the training process it resides in, though that's a bit of a tricky question that I won't try to fully address now (I tried talking about this somewhat in “Safe exploration and corrigibility,” though I don't think I really succeeded there). ↩︎