Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

After my example of problems with corrigibility, and Eliezer pointing out that sometimes corrigibility may involve saying "there is no corrigible action", here's a scenario where saying that may not be the optimal choice.

Petrov is, as usual for heroes, tracking incoming missiles in his early warning command centre. The attack pattern seems unlikely, and he has decided not to inform his leaders about the possible attack.

His corrigible AI pipes up to check if he needs any advice. He decides he does, and asks the AI to provide him with documentation about computer detection malfunction. In the few minutes it has, the AI can start with one introductory text A, or with introductory text B. Predictably, if given A, Petrov will warn his superiors (and maybe set off a nuclear war), and, if given B, he will not.

If the corrigible AI says that it cannot answer, however, Petrov will decide to warn his superiors, as his thinking has been knocked off track by the conversation. Note that this is not what would have happened had the AI stayed silent.

What is the corrigible thing to do in this situation? Assume that the AI can predict Petrov's choice for whatever action it itself can take.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 2:47 PM

Predict asking Petrov in no state of urgency about what the AI should hypothetically do if it found itself in this situation, and do that.

Predictably, if given A, Petrov will warn his superiors (and maybe set off a nuclear war), and, if given B, he will not.

Petrov has all kinds of preferences about what kind of introductory text is "best," and the goal of the AI is to give the one that Petrov considers best. Petrov's preferences about introductory text will not be based on backwards chaining from the effects on Petrov (otherwise he wouldn't need to read the textbook), they will be based on features of the text itself. Likewise, the AI's decision shouldn't be based on backwards chaining from the effects on Petrov.

You can set it up in the same way as in the previous example - Petrov has two sets of underdeveloped preferences (say, for human survival and for loyalty) and has not chosen between them, and the AI's actions will force him to choose one or the other.

I'm saying Petrov has preferences over what text to read based on characteristics of the text (and implicitly over the deliberative process implied by that text)---does it make true claims, does it engage with his sympathies in a way that he endorses, does it get quickly to the point, etc..

Those preferences over text (and hence over deliberative process) will ultimately lead to Petrov's preferences changing in one way or another, but it's his preferences about text that imply his meta-preferences rather than the other way around.

Similarly, when I choose how I want to deliberate or reflect, I'm looking at the process of deliberation itself and deciding what process I think is best. That process then leads to some outcome, which I endorse because it was the outcome of the deliberative process I endorsed. I'm not picking a conclusion and then preferring the deliberative process that leads to that conclusion. If I'm in a state such that I'd prefer pick a conclusion and then choose the deliberative process that leads to that conclusion, then I'm not deliberating (in the epistemic sense) at all, my preferences are already settled.

I don't see any of this as conflicting with corrigibility. If the AI is involved in my deliberative process, whether by choosing how to explain something or what evidence to show me or whatever, then the corrigible thing to do is to (try) to help me deliberate in the way that I would prefer to deliberate (as opposed to influencing my deliberation in a way that is intended to achieve any other end). Of course my values will change, my values are constantly changing, any AI that is embedded in the world in a realistic way is going to have an influence on the way our values change. The point of aligned AI in general is to help us get what we want, including what we want about the process by which our values change.

We seem to have a persistent disagreement about this point. I understand the position Wei Dai outlined in this thread and consider that to be an understandable quantitative disagreement---about the relative importance of value drift caused by errors in our understanding of deliberation (compared to what I consider the alignment problem proper). My view could change on that point, especially if I came to be more optimistic about narrow-sense alignment. If your position is different from that one, then I don't yet understand it.

Let's try and tease out the disagreement. I mentioned two seemingly valid approaches, that would lead to different beliefs for the human, and asked how the AI could choose between them. You then went up a level of meta, to preferences over the deliberative process itself.

But I don't think the meta preferences are more likely to be consistent - if anything, probably less so. And the meta-meta-preferences are likely to be completely underdefined, except in a few philosophers.

So I see the AI as having to knowingly decide between multiple different possible preferences, meta-preferences, etc... about the whole definition of what corrigibility means. And then imposing those preferences on humans, because it has to impose something.

Doing corrigibility without keeping an eye on the outcome seems, to me, to be similar to many failed AI safety approach - focusing on the local "this sounds good", rather than on the global "but it may cause extinction of sentient life".

(There is also the side issue that corrigibility involves communicating certain facts to the human, and engaging with them. This may result in the human being manipulated to engage in the exchange in a more corrigible way; if this is true, then some manipulation may be inevitable, so aiming to remove it would be impossible.)

This can imply a few things:

  • Corrigibility could be underdefined (at least for humans).
  • Though we are assuming that neither the AI nor the human is supposed to look at the conclusion, this may just result in either a random walk, or an optimisation pressure by hidden processes inside the definition.
  • Therefore it may be better to explicitly take the outcome into account as well.
  • And, possibly, we *may* need to care about corrigibility/influence in either case.

Is what you see as the issue that the AI knows what outcome would result from the different actions (showing introductory text A vs B)? As far as I can tell there would be no corrigibility problem of the form you are talking about if the AI didn't know this, since then the AI could just decide based on things like "which text is more informative", right? I don't believe this would result in a random walk, any more than Petrov deciding which text to read, reading it, and then making a decision results in a random walk.

This seems similar to the issue of free will: in what sense do humans choose things if a hypothetical entity could predict what they choose? I think if you accept that humans make choices in a deterministic universe then you should also accept that Petrov can make choices even if the AI knows what choice he would make given the different introductory texts.

There is still a remaining issue that if the AI knows the outcome, then the AI's decision making might take the outcome into account when making the decision, and this would create an external pressure towards some outcome. While "AI gives introductory text A without taking the result into account, Petrov warns superiors" and "AI gives introductory text B without taking the result into account, Petrov does not warn superiors" are both individually timelines in which the meaningful choice is made by Petrov (regardless of whether the AI knew the result), the AI is now deciding between different timelines, one with the same consequences as the first and one with the same consequences as the second, and so the AI is also exercising choice.

Is the issue solved by having the AI just not take the outcome of deliberation into account? If the AI is using something like UDT then things like "decide to make a decision without taking information X into account" have to already be the type of things the AI can do, as these decisions make the AI predictable to a counterparty who does not know X (which assists in making contracts with this counterparty). You say this could result in a random walk but this seems false given that in this case the choice of the outcome of deliberation is made by Petrov alone.

It doesn't seem like you need sophisticated technology to "decide to make a decision without taking information X into account" in this case---the AI can just make the decision on the basis of particular features that aren't X.

I mentioned two seemingly valid approaches, that would lead to different beliefs for the human, and asked how the AI could choose between them. You then went up a level of meta, to preferences over the deliberative process itself.

The AI was choosing what text to show Petrov. I suggested the AI choose the text based on the features that would lead Petrov (or an appropriate idealization) to say that one text or the other is better, e.g. informativeness, concision, etc. I wouldn't describe that as "going up a level of meta."

But I don't think the meta preferences are more likely to be consistent - if anything, probably less so. And the meta-meta-preferences are likely to be completely underdefined, except in a few philosophers.

It seems to me like Petrov does have preferences about descriptions that the AI could provide, e.g. views about which are accurate, useful, and non-manipulative. And he probably has views about what ways of thinking about things are going to improve accuracy. If you want to call those "meta preferences" then you can do that, but then why think that those are undefined?

Also it's not like we are passing to the meta level to avoid inconsistencies in the object level. It's that Petrov's object level preference looks like "option #1 is better than option #2, but 'whichever option I'd pick after thinking for a while' is better than either of them"

Doing corrigibility without keeping an eye on the outcome seems, to me, to be similar to many failed AI safety approach - focusing on the local "this sounds good", rather than on the global "but it may cause extinction of sentient life".

This doesn't seem right to me.

Though we are assuming that neither the AI nor the human is supposed to look at the conclusion, this may just result in either a random walk, or an optimisation pressure by hidden processes inside the definition.

Thinking about a problem without knowing the answer in advance is quite common. The fact that you don't know the answer doesn't mean that it's a random walk. And the optimization pressure isn't hidden---when I try to answer a question by thinking harder about it, there is a huge amount of optimization pressure to get to the right answer, it's just that it doesn't take the form of knowing which answer is correct and then backwards chaining from that to figure out what deliberative process would lead to the correct answer.

This doesn't seem right to me.

I think we have a strong intuitive disagreement here, that explains our varying judgements.

I think we both agree on the facts that a) there is a sense of corrigibility for humans interacting with humans in typical situations, and b) there are thought experiments (eg a human given more time to reflect) that extend this beyond typical situations.

We possibly also agree on c) corrigibility is not uniquely defined.

I intuitively feel that there is not a well defined version of corrigibility that works for arbitrary agents interacting with arbitrary agents, or even for arbitrary agents interacting with humans (except for one example, see below).

One of the reasons for this is my experience in how intuitive human concepts are very hard to scale up - at least, without considering human preferences. See this comment for an example in the "low impact" setting.

So corrigibility feels like it's in the same informal category as low impact. It also has a lot of possible contradictions in how its applied, depending on which corrigibility preferences and meta-preferences the AI choose to use. Contradictions are opportunities for the AI to choose an outcome, randomly or with some optimisation pressure.

But haven't I argued that human preferences themselves are full of contradictions? Indeed; and resolving these contradictions is an important part of the challenge. But I'm much more optimistic about getting to a good place, for human preferences, when explicitly resolving the contradictions in human overall preferences - rather than when resolving the contradictions in human corrigibility preferences (and if "human corrigibility preferences" include enough general human preferences to make it safe - is this really corrigibility we're talking about?).

To develop that point slightly, I see optimising for anything that doesn't include safety or alignment as likely to sacrifice safety or alignment; so optimising for corrigibility will either sacrifice them, or the concepts of safety (and most of alignment) are already present in "corrigibility".

I do know one version of corrigibility that makes sense, which explicitly looks at the outcome of what human preferences will be, and attempts to minimise the rigging of this process. That's one of the reasons I keep coming back to the outcome.

I would prefer if you presented an example of a setup, maybe one that had some corrigibility-like features, rather than having a general setup and saying "corrigibility will solve the problems with this".

I would consider it corrigible for the AI to tell Petrov about the problem. Not "I can't answer you" but "the texts I have on hand are inconclusive and unhelpful with respect to helping you solve your problem." This is, itself, informative.

If you're an expert in radar, and I ask you if you think something is a glitch or not, and you say you "can't answer", that doesn't tell me anything. I have no idea why you can't answer. If you tell me "it's inconclusive", that's informative. The information is that you can't really distinguish between a glitch and a real signal in this case. If I'm conservatively minded, then I'll increase my confidence that it's a glitch.