Simplified preferences needed; simplified preferences sufficient

by Stuart_Armstrong 6mo5th Mar 20196 comments

31

Ω 9


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

AI scientists at EvenDeeperMind: "Hey everyone! We have a developed a low-impact AI!"

AI policy people at OpenFutureofExistentialAI: "Fantastic! What does it do?"

AI scientists: "It's limited to answering questions, and it has four possible outputs, , , , and ."

AI policy: "What exactly do these outputs do, btw?"

AI scientists: "Well, turns a green light on, turns a red light on, starts a nuclear war, and turns a blue light one."

AI policy: "Starts a nuclear war?!?!?"

AI scientists: "That or turns a yellow light on; I can never remember which...".

Against purely formal definitions of impact measure

It's "obvious" that an AI that starts a nuclear war with one of its four actions, cannot be considered a "low-impact" agent.

But what about one that just turned the yellow light on? Well, what about the utility function , which was if there was no yellow lights in that room during that hour, but was if there was a yellow light. For that utility function, the action "start a nuclear war" is the low impact action, and even entertaining the possibility of turning on the yellow light is an abomination, you monster.

To which you should answer " is a very stupid choice of utility function". Indeed it is. But it is a possible choice of utility function, so if we had an AI that was somehow "low-impact" for all utility functions, it would be low-impact for .

There are less artificial examples than ; a friendly utility function with a high discount rate would find any delay intolerable ("by not breaking out and optimising the world immediately, you're murdering thousands of people in intense agony, you monster!").

Abstract low-impact

This is has been my recurrent objection to many attempts to formalise low-impact in abstract terms. We live in a universe in which every action is irreversible (damn you, second law!) and the consequences expand at light-speed across the cosmos. And yet most versions of low impact - including my own attempts - revolve around some measure of "keeping the rest of the universe the same, and only changing this tiny thing".

For this to make sense, we need to classify some descriptors of the world as "important", and others as "unimportant". And we further need to establish what counts as "small" change to an "important" fact. You can see this as assigning utility functions to the values of the important descriptors, and capturing low impact as "only change to these utility functions in ".

But you absolutely need to define , and this has to be definition that captures something of human values. This paper measures low-impact by preserving vases and penalising "irreversible" changes. But every change is irreversible, and what about preserving the convection currents in the room rather than the pointless vases? ("you monster!").

So defining that are compatible with human models of "low-impact", is absolutely essential to getting the whole thing to work. Abstractly considering all utility functions (or all utility functions defined in an abstract action-observation sense) is not going to work.

Note that often the definition of can be hidden in the assumptions of the model, which will result in problems if those assumptions are relaxed or wrong.

The general intuitive disagreement

The objection I made here applies also to concepts like corrigibility, domesticity, value-learning, and similar ideas (including some versions of toolAI and Oracles). All of these need to designate certain AI policies as "safe" (or safer) and other as dangerous, and draw the line between them.

But, in my experience, this definition cannot be done in an abstract way; there is no such thing as a generally low-impact or corrigible agent. Defining some subset of what humans consider corrigible or tool-like, is an essential requirement.

Now, people working in these areas don't often disagree with this formal argument; they just think it isn't that important. They feel that getting the right formalism is most of the work, and finding the right is easier, or just a separate bolt-on that can be added later.

My intuition, formed mainly by my many failure in this area, is that defining the is absolutely critical, and is much harder than the rest of the problem. Others have different intuitions, and I hope they're right.

Strictly easier than friendliness

The problem of finding a suitable is, however, strictly easier than defining a friendly utility function.

This can be seen in the fact that there are huge disagreements about morality and values between humans, but much lower disagreement on what an Oracle, a low-impact, or a corrigible agent should do.

"Don't needlessly smash the vases, but the convection currents are not important" is good advice for a low impact agent, as agreed upon by people from all types of political, moral, and cultural persuasions, including a wide variety of plausible imaginary agents.

Thus defining is easier than coming up with a friendly utility function, as the same low-impact/corrigibility/domesticity/etc. is compatible with many different potential friendly utility functions for different values.

31

Ω 9