Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

Pick a very unsafe goal: "AI, make this world richer and less unequal." What does this mean as a goal, and can we make it safe?

I've started to sketch out how we can codify "human understanding" in terms of human ability to answer questions.

Here I'm investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.


For the purpose of this post, I'll assume we have some sufficient measure of accuracy . This is a boolean-value function, that takes as input a human (in a particular time and place), a string/description , and a world or a pair of worlds , . Then is true iff the string , when presented to the human , is an understandably accurate description (of )/(of the difference between and ).

is describing a world the human would not see as accurately described by .

Let be our world, let be any world, and let be the world that is meant to be describing (this is an informal definition, as we haven't formalised what this means yet).

Humans have a poor understanding of causality, of what causes what in the real world (and in ). A lot of strong political positions, for instance, seem predicated on denying the existence of certain trade-offs. And no-one has a complete understanding of all the physics, biology, and social sciences that best model our world. Thus the desiderata of may be impossible to satisfy; there is no plausible world that is well described by .

And on a basic and more fundamental level, we are simply ignorant of vast amounts of things about the world. No-one has a knowledge of all the basic statistical descriptors about our world, let alone the full distribution behind those descriptors.

Thus even if there was a plausible world well-described by , if we had a full description of that world, we would think it very different from what we intended with -- just as if we had a full description of , we wouldn't recognise our own world.

This suggests that should in some way be seen as a description of the "difference" between worlds.

Modelling worlds

Here we're going to replace worlds with models of those worlds. There models are made up of variables . Each of those variables has a description , and we use our measure of accuracy to ensure that these descriptions are understandable.

Specifically if and are almost the same except they have different values of for in a small set , then we say the descriptions are understandable if is true.

Thus the difference in the variables , along with the descriptions of , is a good description of the difference between worlds.

Lastly, the variables are required to be important, to humans, based on their descriptions. Thus it is more likely to include "human happiness" rather than "electron density of Saturn".

Testing the model: devil's advocacy

Now, it should be obvious that there exists worlds with very positive -- every human is modelled as being alive, healthy, happy, free, flourishing, equal, etc... -- that are nevertheless horrible places to live.

It's not only a question of siren worlds, deceptive worlds designed to hide their badness. It's more that is only an accurate description of in worlds that differ little from , and thus that constraining worlds to have specific does not constrain them to being well described by and .

And even if they were well-described, it's possible that do not capture all the variables that humans find important -- it may have missed some.

This is especially likely as humans often miss important background features of their own lives, that they don't have to think about. And because we haven't yet specified how to select all the variables in the model .

Enter the devil's advocate AI, DAI.

If given a world with model , the job off the DAI is to highlight to humans all the ways the can go wrong, in all the ways that are not captured by already.

Specifically, DAI needs to produce a description string such that:

  1. describes the difference between and well; ie is true.
  2. is not captured by the model difference; ie is a more accurate description than .
  3. The human agrees is an important fact (alternatively, we might want them to agree is an important and negative fact).

There may a back and forth cycle with other AIs that defend against the DAI, all of them using accurate descriptions, before the human agrees whether is important or not.

If the DAI loses, say that is well-modelled by .

Cashing out the description

We're now ready to try and cash out the description of . First of all, we translate it into a requirement on the variables . We check whether this requirement translates well by comparing how humans interpret versus how they interpret changes to . This allows a measure which counts how well the variables of are moved in the direction of compared with .

Then we can finally define :

  1. is well-modelled by .
  2. maximises/satisfices/quantilises .

Note that the first requirement can be used to fix the variables in : many variables make it easier to find well-described worlds (we may need to combine with a prior to cut down the number of variables to make sure it doesn't get too ridiculous).

New Comment