Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

This is a reworking of the previous models of AIs in virtual worlds, done after discussing the idea with Jessica.

The basic idea is that there are a lot of concepts about AI security - such as sealed boxes, Cartesian boundaries, and so on - that are only partially true. They are simplified models, abstractions, and, even if they are true in our model of physics (an absurdly optimistic claim) there's always the possibility (certainty) that our models are wrong or incomplete.

But there is one place where models are exactly true, and that... within the model itself. And a computer program, or a virtual machine, is a model, a model we try and make happen exactly in the real world.


Is my world model true?

Consider an AI, running inside a virtual machine. The AI has an internal model of the entire virtual machine, including itself, and knows the rules of the virtual machine/model.

In fact the AI entertains two hypotheses:

  • : the model is entirely correct about the (virtual) world.

  • : sometimes, random bits are flipped in the (virtual) world.

Then give the AI some utility function , between 0 and 1, defined over the virtual world. The AI also has a shutdown button B that it predicts will shut down the world; this B send a signal to us and will stop the execution of the virtual world, AI included. Let be some discount factor close to , and the time step at which was pressed (if it was). Then the AI tries to maximise the utility

Now the AI is motivated to maximise as long as it thinks that the world model it has is correct, and to press as soon as it feels its ache. But as soon as it suspects enough that the model of the virtual world is is incorrect - ie - it will press (the exact definition of could probably be improved; the might be unfortunate trade-offs at the moment).

What does translate to from our perspective? Well, the only thing that can cause "random bit flips" in the virtual world is if the model of that world is if the program is not running correctly: if our real world is somehow interfering with the program. Therefore this AI is motivated to turn itself off as soon as it detects real-world influence in its model. This includes its self-model: if it finds itself developing unusual abilities not explainable within the world model, it will increase the probability of .

What we could use the AI for

It's interesting to see what we could use the AI for. If we have detailed physical model of physics or of biology, we could ask it to solve many questions for us. Typically we wouldn't be implementing the answer, but looking at how the AI would solve it, and reuse that solution ourselves.

But we don't need to have detailed world-models to get useful work. For instance, we could give the AI libraries of potential drugs, the details of a site on a protein we would want the drug to bind to, and let it figure out a list of drugs likely to bind in that way.

Of course, the ones that would bind would be ones that would bind in the model, and not necessarily in reality. Nevertheless, if we looked at the top 1000 of the AI's suggestions, we'd probably find one that did bind in reality - or we'd find indications that our molecule interaction models are woefully bad.

In general, this sort of "testing models to destruction" is what we could use the AI for. We have a model - of molecules, of planets, of social structures - and an objective, and we want the AI to achieve that objective within the model.

Notice that we can do whatever we want with the model - it's perfectly plausible to have two levels of descriptions and "bridging laws" between them. For instance, we could have models of biology (fundamental unit: the cell?) and epidemiology (fundamental unit: the agent/human?) as long as we have some law connecting them ("when the agent gets infected, it causes these changes in the cells; when the cells collapse, the agent state moves to 'dead'...").

Looking at the AI's solutions

It may propose incrementally better solutions, or brilliant but reasonable solutions to our objective. But every so often, it will propose some perverse instantiation that will break the model ("set interest rates to -1%..."). And then we'll have to figure out whether this is a brilliant insight, or a flaw in the model.

If it is a flaw, we can patch the model (hopefully learning something along the way) and re-run the AI. If it's brilliant, we can figure out whether its the kind of brilliance that would actually work ("well, negative interest rates can be achieved in effect if we do X...").

This involves us looking at an AI's perverse instantiation, which is something that's generally unwise to do. However, this perverse instantiation is not aimed at exploiting us - the AI doesn't care about our world at all, and wouldn't waste time or resources exploiting us.

There still is a danger, if there are human-like minds inside the virtual world, and the AI is aiming to manipulate them. This is a reason not to put human-like minds inside the virtual world! (Though if for some reason we wanted something like that, we should make use of things like "Cartesian boundaries" and so on, which would actually work within the model).

In general, the richer the virtual world and the closer to the real world that it is, the greater the danger. Suppose that, for instance, the AI is given the detailed biological and epidemiological information mentioned above, and we want the AI to "cure the flu". Then it creates design for a virus. We test the design, and it infects people and rewires their DNA to become almost immune to the flu. Success! So we implement virus/vaccine. However, it turns out that the virus gives partial immunity and then kills everyone a few years later, thus "curing the flu".

So just thinking "ah, this solution works this way" is not enough; we need to ensure that that is the only way it works. The more distant the virtual world is from the real one, the more certain we can be that there is no hidden extra perverse instantiation that we can't see. Though note that in the example above, we could have a model definition for "the agent is dead" and check whether this the way the AI achieves its goals.

Partially aware of the real world

Old idea that doesn't work:

It might be even better if the AI could be aware of real-world influence, but discount this knowledge. My initial idea was of an AI that would want to "find the action with the highest expected that could be computed by machine A", along with the useful indication "I am machine A", so hence use the assumption: P="If I calculate it, then machine A can", but "if I get modified, then P is no longer true, so I will not attempt to modify myself."

Ideally, this agent could then continue to function even if real world influences were brought to bear, and would carefully segment its use of knowledge so that it didn't use any real world knowledge to reach its conclusions.

Unfortunately, the idea doesn't work in that form: "I will not attempt to modify myself" also means "I will attempt to prevent myself from being modified", so the AI has real-world preferences, and may dedicate an isolated subsystem of itself to manipulating the real world.

I'm still recording the old idea here, in case anything can be used from it.

New Comment