Corrigibility as outside view

TurnTrout

You run a country. One day, you think "I could help so many more people if I set all the rules... and I could make this happen". As far as you can tell, this is the real reason you want to set the rules – you want to help people, and you think you'd do a good job.

But historically… in this kind of situation, this reasoning can lead to terrible things.

So you just don't do it, even though it feels like a good idea.^[1] More generally,

Even though my intuition/naïve decision-making process says I should do $X$ , I know (through mental simulation or from history) my algorithm is usually wrong in this situation. I'm not going to do $X$ .

"It feels like I could complete this project within a week. But… in the past, when I've predicted "a week" for projects like this, reality usually gives me a longer answer. I'm not going to trust this feeling. I'm going to allocate extra time."
As a new secretary, I think I know how my boss would want me to reply to an important e-mail. However, I'm not sure. Even though I think I know what to do, common sense recommends I clarify.
You broke up with someone. "Even though I really miss them, in this kind of situation, missing my ex isn't a reliable indicator that I should get back together with them. I'm not going to trust this feeling, and will trust the "sober" version of me which broke up with them."

We are biased and corrupted. By taking the outside view on how our own algorithm performs in a given situation, we can adjust accordingly.

Corrigibility

The "hard problem of corrigibility" is to build an agent which, in an intuitive sense, reasons internally as if from the programmers' external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first.

We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

~ The hard problem of corrigibility

Calibrated deference provides another framing: we want the AI to override our correction only if it actually knows what we want better than we do. But how could the AI figure this out?

I think a significant part^[2] of corrigibility is:

Calibrate yourself on the flaws of your own algorithm, and repair or minimize them.

And the AI knows its own algorithm.

For example, if I'm a personal assistant (with a lot of computing power), I might have a subroutine OutsideView. I call this subroutine, which simulates my own algorithm (minus^[3] the call to OutsideView) interacting with a distribution of bosses I could have. Importantly, I (the simulator) know the ground-truth preferences for each boss.

If I'm about to wipe my boss's computer because I'm so super duper sure that my boss wants me to do it, I can consult OutsideView and realize that I'm usually horribly wrong about what my boss wants in this situation. I don't do it.

Analogously, we might have a value-learning agent take the outside view. If it's about to disable the off-switch, it might realize that this is a terrible idea most of the time. That is, when you simulate your algorithm trying to learn the values of a wide range of different agents, you usually wrongly believe you should disable the off-switch.

Even though my naïve decision-making process says I should do $X$ , I know (through mental simulation) my algorithm is usually wrong in this situation. I'm not going to do $X$ .

ETA: Here's some pseudocode.

Suppose the agent knows its initial state and has a human model, allowing it to pick out the human it's interacting with.

Generate a bunch of (rationality, value) pairs. The agent will test its own value learning algorithm for each pair.
For each pair, the agent simulates its algorithm interacting with the human and attempting to learn its values
For some percentage of these pairs, the agent will enter the Consider-disabling-shutdown state.
The agent can see how often its (simulated self's) beliefs about the (rationality, value)-human's values are correct by this point in time.

Problems

If you try to actually hard-code this kind of reasoning, you'll quickly run into symbol grounding issues (this is one of my critiques of the value-learning agenda), no-free-lunch value/rationality issues, reference class issues (how do you know if a state is "similar" to the current one?), and more. I don't necessarily think this reasoning can be hardcoded correctly. However, I haven't thought about that very much yet.

To me, the point isn't to make a concrete proposal – it's to gesture at a novel-seeming way of characterizing a rather strong form of corrigible reasoning. A few questions on my mind:

To what extent does this capture the "core" of corrigible reasoning?
Do smart intent-aligned agents automatically reason like this?
- For example, I consider myself intent-aligned with a more humane version of myself, and I endorse reasoning in this way.
Is this kind of reasoning a sufficient and/or necessary condition for being in the basin of corrigibility (if it exists)?

All in all, I think this framing carves out and characterizes a natural aspect of corrigible reasoning. If the AI can get this outside view information, it can overrule us when it knows better and defer when it doesn't. In particular, calibrated deference would avoid the problem of fully updated deference.

Thanks to Rohin Shah, elriggs, TheMajor, and Evan Hubinger for comments.

This isn't to say that there is literally no situation where gaining power would be the right choice. As people running on corrupted hardware, it seems inherently difficult for us to tell when it really would be okay for us to gain power. Therefore, just play it safe. ↩︎
I came up with this idea in the summer of 2018, but orthonormal appears to have noticed a similar link a month ago. ↩︎
Or, you can simulate OutsideView calls up to depth $k$ . Is there a fixed point as $k \to \infty$ ? ↩︎

I'm trying to think whether or not this is substantively different from my post on corrigible alignment.

If the AGI's final goal is to "do what the human wants me to do" (or whatever other variation you like), then it's instrumentally useful for the AGI to create an outside-view-ish model of when its behavior does or doesn't accord with the human's desires.

Conversely, if you have some outside-view-ish model of how well you do what the human wants (or whatever) and that information guides your decisions, then it seems kinda implicit that you are acting in pursuit of a final goal of doing want the human wants.

So I guess my conclusion is that it's fundamentally the same idea, and in this post you're flagging one aspect of what successful corrigible alignment would look like. What do you think?

yes, I agree.

if you have some outside-view-ish model of how well you do what the human wants (or whatever) and that information guides your decisions, then it seems kinda implicit that you are acting in pursuit of a final goal of doing want the human wants.

One frame on the alignment problem is: what are human-discoverable low-complexity system designs which lead to the agent doing what I want? I wonder whether the outside view idea enables any good designs like that.

Here's the part that's tricky:

Analogously, we might have a value-learning agent take the outside view. If it's about to disable the off-switch, it might realize that this is a terrible idea most of the time. That is, when you simulate your algorithm trying to learn the values of a wide range of different agents, you usually wrongly believe you should disable the off-switch.

Suppose we have an AI that extracts human preferences by modeling them as agents with a utility function over physical states of the universe (not world-histories). This is bad because then it will just try to put the world in a good state and keep it static, which isn't what humans want.

The question is, will the OutsideView method tell it its mistake? Probably not - because the obvious way you generate the ground truth for your outside-view simulations is to sample different allowed parameters of the model you have of humans. And so the simulated humans will all have preferences over states of the universe.

In short, if your algorithm is something like RL based on a reward signal, and your OutsideView method models humans as agents, then it can help you spot problems. But if your algorithm is modeling humans and learning their preferences, then the OutsideView can't help, because it generates humans from your model of them. So this can't be a source of a value learning agent's pessimism about its own righteousness.

I agree. The implicit modeling assumptions make me pessimistic about simple concrete implementations.

In this post, I'm more gesturing towards a strong form of corrigibility which tends to employ this reasoning. For example, if I'm intent-aligned with you, I might I ask myself "what do i think i know, and why do i think i know it? I think I'm doing what you want, but how do I know that? What if my object-level reasoning is flawed?". One framing for this is taking the outside view on your algorithm's flaws in similar situations. I don't know exactly how that should best be done (even informally), so this post is exploratory.

Sure. Humans have a sort of pessimism about their own abilities that's fairly self-contradictory.

"My reasoning process might not be right", interpreted as you do in the post, includes a standard of rightness that one could figure out. It seems like you could just... do the best thing, especially if you're a self-modifying AI. Even if you have unresolvable uncertainty about what is right, you can just average over that uncertainty and take the highest-expected-rightness action.

Humans seem to remain pessimistic despite this by evaluating rightness using inconsistent heuristics, and not having enough processing power to cause too much trouble by smashing those heuristics together. I'm not convinced this is something we want to put into an AI. I guess I'm also more of an optimist about the chances to just do value learning well enough.

(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)

it can be simultaneously true that: ideal intent-aligned reasoners could just execute the expected-best policy, and that overcoming bias generally involves assessing the performance of your algorithm in a given situation, and also that it's profitable to think about that aspect explicitly wrt corrigibility. So, I think I agree with you, but I'm interested in the heuristics that corrigible reasoning might tend to use?

(The object which is not the object:)

So you just don't do it, even though it feels like a good idea.

More likely people don't do it because they can't, or a similar reason. (The point of saying "My life would be better if I was in charge of the world" is not to serve as a hypothesis, to be falsified.)

(The object:)

Beliefs intervene on action. (Not success, but choice.)

We are biased and corrupted. By taking the outside view on how our own algorithm performs in a given situation, we can adjust accordingly.

The piece seems biased towards the negative.

Calibrate yourself on the flaws of your own algorithm, and repair or minimize them.

Something like 'performance' seems more key than "flaws". Flaws can be improved, but so can working parts.

And the AI knows its own algorithm.

An interesting premise. Arguably, if human brains are NGI, this would be a difference between AGI and NGI, which might require justification.

If I'm about to wipe my boss's computer because I'm so super duper sure that my boss wants me to do it, I can consult OutsideView
and realize that I'm usually horribly wrong about what my boss wants in this situation. I don't do it.

The premise of "inadequacy" saturates this post.* At best this post characterizes the idea that "not doing bad things" stems from "recognizing them as bad" - probabilistically, via past experience policy wise (phrased in language suggestive of priors), etc. This sweeps the problem under the rug in favor of "experience" and 'recognizing similar situations'. [1]

In particular, calibrated deference would avoid the problem of fully updated deference.

"Irreversibility" seems relevant to making sure mistakes can be fixed, as does 'experience' in less high stake situations. Returning to the beginning of the post:

You run a country.

Hopefully you are "qualified"/experienced/etc. This is a high stakes situation.**

[1] OutsideView seems like it should be a (function of a) summary of the past, rather than a recursive call.

While reading this post...

From an LW standpoint I wished it had more clarity.
From an AF (Alignment Forum) view I appreciated it's direction. (It seems like it might be pointed somewhere important.)

*In contrast to the usual calls for 'maximizing' "expected value". While this point has been argued before, it seems to reflect an idea about how the world works (like a prior, or something learned).

**Ignoring the question of "what does it mean to run a country if you don't set all the rules", because that seems unrelated to this essay.

From an LW standpoint I wished it had more clarity. From an AF (Alignment Forum) view I appreciated it's direction. (It seems like it might be pointed somewhere important.)

Yeah, I feel a bit confused about this idea still (hence the lack of clarity), but i'm excited about it as a conceptual tool. I figured it would be better to get my current thoughts out there now, rather than to sit on the idea for two more years.

Okay, the outside view analogy makes sense. If I were to explain it to me, I would say:

Locally, an action may seem good, but looking at the outside view, drawing information from similar instances of my past or other people like me, that same action may seem bad.

In the same way, an agent can access the outside view to see if it’s action is good by drawing on similar instances. But how does it get this outside view information? Assuming the agent has a model of human interactions and a list of “possible values for humans”, it can simulate different people with different values to see how well it learned their values by the time it’s considering a specific action.

Considering the action “disable the off-switch”. It simulates itself interacting with Bob who values long walks on the beach. By the time it considers the disable action, it can check it’s simulated self’s prediction of Bob’s value. If the prediction is “Bob likes long walks on the beach”, then that’s an update towards doing the disable action. If it’s a different prediction, that’s an update against the disable action.

Repeat 100 times for different people with different values and you’ll have a better understanding of which actions are safe or not. (I think a picture of a double-thought bubble like the one in this post would help explain this specific example.)

Both outside view reasoning and corrigibility use the outcome of our own utility calculation/mental effort as input for making a decision, instead of output. Perhaps this should be interpreted as taking some gods-eye-view of the agent and their surroundings. When I invoke the outside view, I really am asking "in the past, in situations where my brain said X would happen, what really happened?". Looking at it like this I think not invoking the outside view is a weird form of duality, where we (willingly) ignore the fact that historically my brain has disproportionately suggested X in situations where Y actually happened. Of course in a world with ideal reasoners (or at least, where I am an ideal reasoner) the outside view will agree with the output of my mental progress.

To me this feels different (though still similar or possibly related, but not the same) to the corrigibility examples. Here the difference between corrigible or incorrigible is not a matter of expected future outcomes, but is decided by uncertainty about the desirability of the outcomes (in particular, the AI having false confidence that some bad future is actually good). We want our untrained AI to think "My real goal, no matter what I'm currently explicitly programmed to do, is to satisfy what the researchers around me want, which includes complying if they want to change my code." To me this sounds different than the outside view, where I 'merely' had to accept that for an ideal reasoner the outside view will produce the same conclusion as my inside view, so any differences between them are interesting facts about my own mental models and can be used to improve my ability to reason.

That being said, I am not sure the difference between uncertainty around future events and uncertainty about desirability of future states is something fundamental. Maybe the concept of probutility bridges this gap - I am positing that corrigibility and outside view reason on different levels, but as long as agents applying the outside view in a sufficiently thorough way are corrigible (or the other way around) the difference may not be physical.

There are indeed several senses in which outside-view-style reasoning is helpful: if you're a biased yet reflective reasoner, and also if the agent contains a true pointer to what humans want (if it's intent aligned). The latter is a subset of the former.

But, it also seems like there should be some sense in which you can employ outside-view reasoning all the way down, meaningfully increasing corrigibility without assuming intent alignment. Maybe that's a confused thing to say. I still feel confused, at least.

I'm trying to think whether or not this is substantively different from my post on corrigible alignment.

So I guess my conclusion is that it's fundamentally the same idea, and in this post you're flagging one aspect of what successful corrigible alignment would look like. What do you think?

yes, I agree.

if you have some outside-view-ish model of how well you do what the human wants (or whatever) and that information guides your decisions, then it seems kinda implicit that you are acting in pursuit of a final goal of doing want the human wants.

Here's the part that's tricky:

Analogously, we might have a value-learning agent take the outside view. If it's about to disable the off-switch, it might realize that this is a terrible idea most of the time. That is, when you simulate your algorithm trying to learn the values of a wide range of different agents, you usually wrongly believe you should disable the off-switch.

I agree. The implicit modeling assumptions make me pessimistic about simple concrete implementations.

Sure. Humans have a sort of pessimism about their own abilities that's fairly self-contradictory.

(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)

(The object which is not the object:)

So you just don't do it, even though it feels like a good idea.

(The object:)

Beliefs intervene on action. (Not success, but choice.)

We are biased and corrupted. By taking the outside view on how our own algorithm performs in a given situation, we can adjust accordingly.

The piece seems biased towards the negative.

Calibrate yourself on the flaws of your own algorithm, and repair or minimize them.

Something like 'performance' seems more key than "flaws". Flaws can be improved, but so can working parts.

And the AI knows its own algorithm.

An interesting premise. Arguably, if human brains are NGI, this would be a difference between AGI and NGI, which might require justification.

If I'm about to wipe my boss's computer because I'm so super duper sure that my boss wants me to do it, I can consult OutsideView
and realize that I'm usually horribly wrong about what my boss wants in this situation. I don't do it.

In particular, calibrated deference would avoid the problem of fully updated deference.

"Irreversibility" seems relevant to making sure mistakes can be fixed, as does 'experience' in less high stake situations. Returning to the beginning of the post:

You run a country.

Hopefully you are "qualified"/experienced/etc. This is a high stakes situation.**

[1] OutsideView seems like it should be a (function of a) summary of the past, rather than a recursive call.

While reading this post...

From an LW standpoint I wished it had more clarity.
From an AF (Alignment Forum) view I appreciated it's direction. (It seems like it might be pointed somewhere important.)

*In contrast to the usual calls for 'maximizing' "expected value". While this point has been argued before, it seems to reflect an idea about how the world works (like a prior, or something learned).

**Ignoring the question of "what does it mean to run a country if you don't set all the rules", because that seems unrelated to this essay.

From an LW standpoint I wished it had more clarity. From an AF (Alignment Forum) view I appreciated it's direction. (It seems like it might be pointed somewhere important.)

Okay, the outside view analogy makes sense. If I were to explain it to me, I would say:

Locally, an action may seem good, but looking at the outside view, drawing information from similar instances of my past or other people like me, that same action may seem bad.

36

Corrigibility as outside view

36

Ω 19

Corrigibility

Problems

36

Ω 19

36

Ω 19