A putative new idea for AI control; index here.

This is a problem that developed from the "high impact from low impact" idea, but is a legitimate thought experiment in its own right (it also has connections with the "spirit of the law" idea).

Suppose that, next 1st of April, the US president may or may not die of natural causes. I chose this example because it's an event of potentially large magnitude, but not overwhelmingly so (neither a butterfly wing nor an asteroid impact).

Also assume that, for some reason, we are able to program an AI that will be nice, given that the president does die on that day. Its behaviour if the president doesn't die is undefined and potentially dangerous.

Is there a way (either at the initial stages of programming or at the later) to extend the "niceness" from the "presidential death world" into the "presidential survival world"?

To focus on how tricky the problem is, assume for argument's sake that the vice-president is a war monger that will start a nuclear war if they become president. Then "launch a coup on the 2nd of April" is a "nice" thing of the AI to do, conditional on the president dying. However, if you naively import that requirement into the "presidential survival world", the AI will launch a pointeless and counterproductive coup. This is illustrative of the kind of problems that could come up.

So the question is, can we transfer niceness in this way, without needing a solution to the full problem of niceness in general?

EDIT: Actually, this seems ideally setup for a Bayes network (or for the requirement that a Bayes network be used).

EDIT2: Now the problem of predicates like "Grue" and "Bleen" seem to be the relevant bit. If you can avoid concepts such as "X={nuclear war if president died, peace if president lived}", you can make the extension work.

New Comment
24 comments, sorted by Click to highlight new comments since:

So the question is, can we transfer niceness in this way, without needing a solution to the full problem of niceness in general?

How do you determine that it will be nice under the given condition?

As posed, it's entirely possible that the niceness is a coincidence: an artifact of the initial conditions fitting just right with the programming. Think of a coin landing on its side or a pencil being balanced on its tip. These positions are unstable and you need very specific initial conditions to get them to work.

The safe bet would be to have the AI start plotting an assassination and hope it lets you out of prison once its coup succeeds.

Hi Stuart,

I don't have my head wrapped around the nice/not nice AI issues you think about daily, but I am happy to help out with the graphical model if you think this will help (but honestly I don't think graphical models are a magic box for these kinds of problems, they simply help repurpose the visual part of the brain to help with algebra, they don't really generate insight for these types of problems on their own).

Hey there! Thanks for the offer, but I think the problem is conceptual rather than graphical. Literally, in fact - I think this problem can be solved if you have a good definition of "concept".

I'll continue thinking...

(In case it was not obvious, I was replying to the edit, a Bayes net is a type of graphical model.)

Yep! I've moved beyond the Bayes net now, into "Grue" "Bleen" territory.

How about you ask the AI "if you were to ask a counterfactual version of you who lives in a world where the president died, what would it advise you to do?". This counterfactual AI is motivated to take nice actions, so it would advise the real AI to take nice actions as well, right?

This counterfactual AI is motivated to take nice actions in worlds where the president died. It might not even know what "nice" means in other worlds.

And even if it knew the correct answer to that question, how can you be sure it wouldn't instead lie to you in order to achieve its real goals? You can't really trust the AI if you are not sure it is nice or at least indifferent...

Every idea that comes to my mind is faced by the big question "if we were able to program a nice AI for that situation, why would we not program it to be nice in every situation". I mean, it seems to me that in that scenario we would have both a solid definition of niceness and the ability to make the AI stick to it. Could you elaborate a little on that? Maybe an example?

This is basically in the line of my attempt to get high impact from reduced impact AIs. These are trying to extend part of "reduced impact" from a conditional situation, to a more general situation; see http://lesswrong.com/lw/m25/high_impact_from_low_impact/

Nevermind this comment, I read some more of your posts on the subject and I think I got the point now ;)

Thought a bit about the problem. Presumably, there's some way to determine whether an AI will behave nicely now and in the future. It's not a general solution, but it's able to verify perpetual nice behavior in the case where the president dies April 1. I don't know the details, so I'll just treat it as a black box where I can enter some initial conditions and it will output "Nice", "Not Nice", or "Unknown". In this framework, we have a situation where the only known input that returned "Nice" involved the president's death on April 1.

If you're using any kind of Bayesian reasoning, you're not going to assign probability 1 to any nontrivial statements. So, the AI would assign some probability to "The president died April 1" and is known to become nice when that probability crosses a certain threshold.

What are the temporal constraints? Does the threshold have to be reached by a certain date? What is the minimum duration for which the probability has to be above this threshold? Here's where one can experiment using the black box. If it is determined, for example, that the AI only needs to hold the belief for an hour, then one may be able to box the AI, give it a false prior for an hour, then expose it to enough contrary evidence for it to update its beliefs to properly reflect the real world.

What if the AI is known to be nice only as long as it believes the president to have died April 1? That would mean that if, say, six months later one managed to trick the AI into believing the president didn't die, then we would no longer know whether it was nice. So either the AI only requires the belief for a certain time period, or else the very foundation of its niceness is suspect.

Condition on a thermodynamic miracle transforming the vice president into a copy of the president?

I deliberately chose an example where the thermodynamic miracle needed was extremely hard to define.


So I'm assuming that the question assumes the Artificial Intelligence already exists before April 1st.

There is also the question of whether we're talking about absolute niceness or relative niceness. It looks like relative niceness from the way you're talking about a coup. The AI is nicer than the Vice President, but meaner than the President at least insofar as the coup is concerned.

So the current code says:

ON DATE April 2nd AI = Coup

So the new code would say:

IF President = dead THEN Condition A = true ELSE Condition A = false

IF Condition A = true THEN AI = Coup

IF ELSE Condition A = false THEN AI = Not Coup

Determining whether the President is alive (Condition A) seems like a relatively simple problem for programmers to solve. To get even more complicated, you might have the code expire when the President leaves office; defaulting to no coup. I'm not sure why this would be a hard problem.

The coup was just an example, to show that ("nice" | president dead) does not imply ("nice" | president alive). The coup thing can be patched if we know about it, but it's just an example of the general problem.


So the question is how do we solve a problem that we don't know exists. We only know it might exist, and that it will be solved under some conditions but not in others. And we don't know which conditions will be good and which will be bad. Yes, that is a tricky problem.

Is there a way (either at the initial stages of programming or at the later) to extend the "niceness" from the "presidential death world" into the "presidential survival world"?

This problem is ill-posed. "Extend the niceness" means to modify the AI. What types of modification count as extending the niceness and what ones don't?

That's what I'm trying to figure out. The challenge is to see whether this can be done without defining "niceness" at all.

My objection isn't about defining niceness to the people programming the AI. My objection is about defining niceness (actually, defining "extending niceness" which isn't the same) to the people determining whether the answer is correct. If we don't know what it means to "extend niceness", then we can't know that any given answer is "extending niceness", which means we have no way to know whether it's actually an answer.

I don't think that's actually the case - I think we can extend niceness without knowing what that means in this sense. Working on a potential solution currently...

Well, you can do it without knowing whether you're doing it, but how would you know if you've ever succeeded?

Furthermore, knowing that something is "extending niceness" is not the same as knowing if something is niceness. Let's say you know what niceness is. What counts as an extension?