EDIT: I now think this post is somewhat confusing and would recommend starting with my more recent post “Exploring safe exploration.”

Balancing exploration and exploitation is a classic problem in reinforcement learning. Historically—with approaches such as deep Q learning, for example—exploration is done explicitly via a rule such as -greedy exploration or Boltzmann exploration. With more modern approaches, however—especially policy gradient approaches like PPO that aren't amenable to something like Boltzmann exploration—the exploration is instead entirely learned, encouraged by some sort of extra term in the loss to implicitly encourage exploratory behavior. This is usually an entropy term, though other more advanced approaches have also been proposed, such as random network distillation in which the agent learns to explore states for which it would have a hard time predicting the output of a random neural network, an approach which was able to set a state of the art on Montezuma's Revenge, a notoriously difficult Atari environment because of how much exploration it requires.

This move to learned exploration has a very interesting and important consequence, however, which is that the safe exploration problem for learned exploration becomes very different. Making -greedy exploration safe is in some sense quite easy, since the way it explores is totally random. If you assume that the policy without exploration is safe, then for -greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question. With learned exploration, however, this becomes much more complicated—there's no longer a nice “if the non-exploratory policy is safe” assumption that can be used to cleanly subdivide the overall problem of off-distribution safety, since it's just a single, learned policy doing both exploration and exploitation.

First, though, an aside: why is learned exploration so much better? I think the answer lies primarily in the following observation: for most problems, exploration is an instrumental goal, not a terminal one, which means that to do exploration “right” you have to do it in a way which is cognizant of the objective you're trying to optimize for. Boltzmann exploration is better than -greedy exploration because its exploration is guided by its exploitation—but it's still essentially just adding random jitter to your policy. Fundamentally, though, exploration is about the value of information such that proper exploration requires dynamically balancing the value of information with the value of exploitation. Ideally, in this view, exploration should arise naturally as an instrumental goal of pursuing the given reward function—an agent should instrumentally want to get updated in such a way that causes it to become better at pursuing its current objective.

Except, there's a really serious, major problem with that reasoning: instrumental exploration only cares about the value of information for helping the model to achieve the goal it's learned so far, not for helping it fix its goal to be more aligned with the actual goal.[1] Consider, for instance, my maze example. Instrumental exploration will help the model better explore the larger maze, but it won't help it better figure out that it's objective of finding the green arrow is misaligned—that is, it won't, for example, lead to the model trying both the green arrow and the end of the maze to see which one is right. Furthermore, because the instrumental exploration actively helps the model explore the larger maze better, it improves the model's capability generalization without also helping its objective generalization, leading to precisely the most worrying case in the maze example. If we think about this problem from a 2D robustness perspective, we can see that what's happening is that instrumental exploration gives us capability exploration but not objective exploration.

Now, how does this relate to corrigibility? To answer that question, I want to split corrigibility into three different subtypes:

  1. Indifference corrigibility: An agent is indifference corrigible if it is indifferent to modifications made to its goal.
  2. Exploration corrigibility: An agent is exploration corrigible if it actively searches out information to help you correct its goal.
  3. Cooperation corrigibility: An agent is cooperation corrigible if it optimizes under uncertainty over what goal you might want it to have.

Previously, I grouped both of those second two into act-based corrigibility, though recently I've been moving towards thinking that act-based corrigibility isn't as well-defined as I previously thought it was. However, I think the concept of objective exploration lets us disentangle act-based corrigibility. Specifically, I think exploration corrigibility is just indifference corrigibility plus objective exploration, and cooperation corrigibility is just exploration corrigibility plus corrigible alignment.[2] That is, if a model is indifferent to having its objective changed and actively optimizes for the value of information in terms of helping you change its current objective, that gives you exploration corrigibility, and if its objective is also a “pointer” to what you want, then you get cooperation corrigibility. Furthermore, I think this helps solve a lot of the problems I previously had with corrigible alignment, as indifference corrigibility and exploration corrigibility together can help you prevent crystallization of deceptive alignment.

Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment. Safety Gym, for example, has a variety of different environments containing both goal states that the agent is supposed to reach and unsafe states that the agent is supposed to avoid. One particularly interesting recent work in this domain was Leike et al.'s “Learning human objectives by evaluating hypothetical behaviours,” which used human feedback on hypothetical trajectories to learn how to avoid environmental traps. In the context of the capability exploration/objective exploration dichotomy, I think a lot of this work can be viewed as putting a damper on instrumental capability exploration. What's nice about that lens, in my opinion, is that it both makes clear how and why such work is valuable while also demonstrating how much other work there is to be done here. What about objective exploration—how do we do it properly? And do we need measures to put a damper on objective exploration as well? And what about cooperation corrigibility—is the “right” way to put a damper on exploration through constraints or through uncertainty? All of these are questions that I think deserve answers.


  1. For a mesa-optimizer, this is saying that the mesa-optimizer will only explore to help its current mesa-objective, not to help it fix any misalignment between its mesa-objective and the base objective. ↩︎

  2. Note that this still leaves the question of what exactly indifference corrigibility is unanswered. I think the correct answer to that is myopia, which I'll try to say more about in a future post—for this post, though, I just want to focus on the other two types. ↩︎

New Comment
4 comments, sorted by Click to highlight new comments since:

(This entire comment is setting aside embedded agency concerns, except for mesa optimization)

You seem to be equivocating between two notions of exploration. Consider an agent that is trained via RL to do well on a distribution of environments, . Then there are two kinds of exploration:

  • Across-episode exploration: Exploration across training trajectories, where the RL algorithm collects trajectories going to various different parts of the state space in the environment, in order to figure out where the reward is.
  • Within-episode exploration: Exploration within a single trajectory, where you try to identify which particular has been sampled, so that you can tailor your trajectory to that .

In across-episode exploration, the exploration is being done by some human-designed algorithm. (I would claim this of RND, Boltzmann exploration, ϵ-greedy and entropy bonuses.) I agree that these work because you want to tailor your exploration based on the value of information, but the agent isn't evaluating the value of information and deciding where to explore, the human-designed algorithm is doing that. So mesa optimization is not going to affect this.

In within-episode exploration, the exploration is being done directly by the policy, and so it is reasonable to talk about how a mesa optimizer would do such exploration.

With that in mind, some thoughts:

With more modern approaches, however—especially policy gradient approaches like PPO that aren't amenable to something like Boltzmann exploration—the exploration is instead entirely learned, encouraged by some sort of extra term in the loss to implicitly encourage exploratory behavior.

This initially sounds like you are talking about across-episode exploration, but then the phrase "entirely learned" makes me think you are talking about things within the domain of a mesa optimizer, i.e. within-episode exploration. Right now this is just semantics, but I think it plays into my confusion below.

Making ϵ-greedy exploration safe is in some sense quite easy, since the way it explores is totally random.

Usually (in ML) "safe exploration" means "the agent doesn't make a mistake, even by accident"; ϵ-greedy exploration wouldn't be safe in that sense, since it can fall into traps. I'm assuming that by "safe exploration" you mean "when the agent explores, it is not trying to deceive us / hurt us / etc".

exploration should arise naturally as an instrumental goal of pursuing the given reward function—though current RL methods aren't quite good enough to get that yet, those methods which are closer to it are starting to perform better.

Since by default policies can't affect across-episode exploration, I assume you're talking about within-episode exploration. But this happens all the time with current RL methods, e.g. one consequence of domain randomization was that OpenAI Five would go explore what Roshan's health was for the current Dota game. In general, it'll happen with any POMDP with a random initial state. We even have examples of this where you are (kind of) exploring the objective: Learning to Interactively Learn and Assist.

instrumental exploration gives us capability exploration but not objective exploration.

As you mention later, you would get objective exploration if the agent had uncertainty over the objective.

An agent is cooperation corrigible if it optimizes under uncertainty over what goal you might want it to have.

This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell's agenda, except applied to mesa optimization now. Should I take away something other than "we should have our mesa optimizers behave like the AIs in assistance games"? I feel like you are trying to say something else but I don't know what.

Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment.

I thought we were talking about "the agent doesn't try to deceive us / hurt us by exploring", which wouldn't tell us anything about the problem of "the agent doesn't make an accidental mistake".

(Aside: these problems should not both be called safe exploration; they seem ~unrelated to me.)

What about objective exploration—how do we do it properly?

The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don't, then there is no proper way to do it, in the same way there's no proper way to do capability exploration without a prior over what you might see when you take the new action.)

And do we need measures to put a damper on objective exploration as well?

You only need to put dampers on exploration if you're concerned that the agent cannot make proper VoI calculations for optimal exploration. (Alternatively, you can remove exploration altogether if you can provide the information that would be gained via exploration some other way, e.g. from human input; this allows you to avoid the otherwise-unavoidable regret incurred through exploration.)

My perspective is that Safety Gym and things like it are proposing that we specify objectives via rewards + constraints, because incorporating the constraints in the reward function is difficult (it requires tuning a hyperparameter that specifies the tradeoff between obtaining reward and avoiding constraint violations). Separately, it also proposes that we measure reward / constraint violation throughout training, as a way to measure regret (rather than just test-time performance). The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don't crash into the wall again just because you forgot about that experience).

And what about cooperation corrigibility—is the “right” way to put a damper on exploration through constraints or through uncertainty?

If you have the right uncertainty, then acting optimally to maximize that is the "right" thing to do.

[-]evhubΩ350

I completely agree with the distinction between across-episode vs. within-episode exploration, and I agree I should have been clearer about that. Mostly I want to talk about across-episode exploration here, though when I was writing this post I was mostly motivated by the online learning case where the distinction is in fact somewhat blurred, since in an online learning setting you do in fact need the deployment policy to balance between within-episode exploration and across-episode exploration.

Usually (in ML) "safe exploration" means "the agent doesn't make a mistake, even by accident"; ϵ-greedy exploration wouldn't be safe in that sense, since it can fall into traps. I'm assuming that by "safe exploration" you mean "when the agent explores, it is not trying to deceive us / hurt us / etc".

Agreed. My point is that “If you assume that the policy without exploration is safe, then for -greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question.” That is, even though it seems like it's hard for -greedy exploration to be safe, it's actually quite easy for it to be safe on average—you just need to be in a safe environment. That's not true for learned exploration, though.

Since by default policies can't affect across-episode exploration, I assume you're talking about within-episode exploration. But this happens all the time with current RL methods

Yeah, I agree that was confusing—I'll rephrase it. The point I was trying to make was that across-episode exploration should arise naturally, since an agent with a fixed objective should want to be modified to better pursue that objective, but not want to be modified to pursue a different objective.

This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell's agenda, except applied to mesa optimization now. Should I take away something other than "we should have our mesa optimizers behave like the AIs in assistance games"? I feel like you are trying to say something else but I don't know what.

Agreed that there's a similarity there—that's the motivation for calling it “cooperative.” But I'm not trying to advocate for that agenda here—I'm just trying to better classify the different types of corrigibility and understand how they work. In fact, I think it's quite plausible that you could get away without cooperative corrigibility, though I don't really want to take a stand on that right now.

I thought we were talking about "the agent doesn't try to deceive us / hurt us by exploring", which wouldn't tell us anything about the problem of "the agent doesn't make an accidental mistake".

If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I'm pointing at doesn't fall under that heading. What I'm trying to point at is that I think there are other problems that we need to figure out regarding how models explore than just the “not making accidental mistakes” problem, though I have no strong feelings about whether or not to call those other problems “safe exploration” problems.

The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don't, then there is no proper way to do it, in the same way there's no proper way to do capability exploration without a prior over what you might see when you take the new action.)

Agreed, though I don't think that's the end of the story. In particular, I don't think it's at all obvious what an agent that cares about the value of information that its actions produce relative to some objective distribution will look like, how you could get such an agent, or how you could verify when you had such an agent. And, even if you could do those things, it still seems pretty unclear to me what the right distribution over objectives should be and how you should learn it.

The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don't crash into the wall again just because you forgot about that experience).

Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective? I think it tends to be “better within-episode exploration relative to the base objective,” which I would call putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective, not the base objective.

If you have the right uncertainty, then acting optimally to maximize that is the "right" thing to do.

Sure, but as you note getting the right uncertainty could be quite difficult, so for practical purposes my question is still unanswered.

The point I was trying to make was that across-episode exploration should arise naturally

Are you saying that across-episode exploration should arise naturally when applying a deep RL algorithm? I disagree with that, at least in the episodic case; the deep RL algorithm optimizes within an episode, not across episodes. (With online learning, I think I still disagree but I'd want to specify an algorithm first.)

I suppose if for some reason you applied a planning algorithm that planned across episodes (quite a weird thing to do), then I suppose it would arise naturally; but that didn't sound like what you were saying.

If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I'm pointing at doesn't fall under that heading.

But in your post, you said:

Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment.

Isn't that entire paragraph about the "not making accidental mistakes" line of research?

Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective?

I was talking about Safety Gym and algorithms meant for it here. Safety Gym explicitly measures total number of constraint violations across all of training; this seems pretty clearly about across-episode exploration (since it's across all training) relative to the base objective (the constraint specification is in the base objective; also there just aren't any mesa objectives because the policies are not mesa optimizers).

putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective

I continue to be confused how instrumental / learned exploration happens across episodes.

I am also confused at the model here -- is the idea that if you do better exploration for the base objective, then the mesa optimizer doesn't need to do exploration for the mesa objective? If so, why is that true, and even if it is true, why does it matter, since presumably the mesa optimizer then already knows the information it would have gotten via exploration?

I think I'd benefit a lot from a concrete example (i.e. pick an environment and an algorithm; talk about what happens in the limit of lots of compute / data, feel free to assume that a mesa optimizer is created).

One particularly interesting recent work in this domain was Leike et al.'s “Learning human objectives by evaluating hypothetical behaviours,” which used human feedback on hypothetical trajectories to learn how to avoid environmental traps. In the context of the capability exploration/objective exploration dichotomy, I think a lot of this work can be viewed as putting a damper on instrumental capability exploration.

Isn't this work also linked to objective exploration? One of the four "hypothetical behaviours" used is the selection of trajectories which maximizes reward uncertainty. Trajectories which are then evaluated by humans.