Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Consider an AI that is trying to achieve a certain result in a toy world running on a computer. Compare two models of what the AI is and what it's trying to do: first, you could say the AI is a physical program on a computer, which is trying to cause the physical computer that the toy world is running on to enter a certain state. Alternatively, you could say that the AI is an abstract computational process which is trying achieve certain results in another abstract computational process (the toy world) that it is interfacing with.

On the first view, if the AI is clever enough, it might figure out how to manipulate the outside world, by, for instance, hacking into other computers to gain more computing power. On the second view, the outside world is irrelevant to the AI's interests, since changing what's running on certain physical computers in the real world would have no effect on the idealized computational model that the AI is optimizing over, so the AI has no incentive to optimize over our world.

AIs for which the second model is more accurate seem generally safer than AIs for which the first model is more accurate. So trying to encourage AI development to follow the second model could help delay the development of dangerous AGI.

AIs following this model are limited in some ways. For instance, they could not be used to figure out how to prevent the development of other dangerous AGI, since this requires reasoning about what happens in the real world.

But such AIs could still be quite useful for many things, such as engineering. In order to use AIs optimizing over toy worlds to design things that are useful in the real world, we could make the toy world have physics and materials similar enough to our world that designs that work well in the toy world should be expected to also work well in the real world. We then take the designs the AI builds in the toy world, and replicate them in the real world. If they don't work in the real world, then we try to find the discrepancy between real-world physics and toy-world physics that accounts for it, fix the discrepancy, and try again.

One possible thing that could go catastrophically wrong with this strategy is if the design the AI comes up with has an agent in it. If the AI designs this agent to figure out what sort of world it's in, rather than hard-coding the agent to believe it’s in the toy world that the original AI cares about, then this agent could, in the toy world, figure out the physics of the toy world and do something sensible, making it look like the design should work when we simulate it in the toy world. But then when we replicate the design in the real world, the agent that gets built with the design notices that it's in a big world with humans that can be manipulated, computers that can be hacked, and so on, and does those things instead of acting as expected.

This problem could be addressed by trying to design the AI in such a way that it would not come up with solutions that involve creating agents, or by figuring out how to reliably detect agents in designs, so that we know to reject those. An alternative approach would be to design the AI in such a way that it can create agents, but only agents that share the property that it only directs its optimization towards the toy world and only builds agents with that same property.

Designing better AIs is an engineering task that an AI along these lines could be used for. Since in this case, the explicit purpose is creating agents, it requires a solution to the problem of creating agents that act in unintended ways that does not involve not creating any agents. Instead, we would want to formalize this notion of optimizing only over toy models, so that it can be used as a design constraint for the AIs that we're asking our AI to design. If we can do this, then it would give us a possible route to a controlled intelligence explosion, in which the AI designs a more capable successor AI because that is the task it has been assigned, rather than for instrumental reasons, and humans can inspect the result and decide whether or not to run it.

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 8:53 AM

Alternatively, you could say that the AI is an abstract computational process which is trying achieve certain results in another abstract computational process (the toy world) that it is interfacing with.

I expect there to be significant difficulties in achieving this in practice. We can design theoretical models that provide isolation, but implementation details often result in leaks in the abstraction. To give a simple example I know about, quantum cryptography is theoretically "perfect", but the first implementations had a fatal flaw due to the realities of successfully transmitting qubits over the wire which introduced opportunities for attacks because the qubits were not be represented as single particles but by multiple particles, permitting a "siphoning" attack. We see similar issues around classical computer memory (in the form of row-hammer attacks) and spinning disks (in the form of head-skip attacks). Given even a relatively low-powered AI was able to accidentally create a radio receiver (although in a seemingly not well sandboxed environment, to be fair), we should expect something aiming for superintelligence to not be much hindered by trying to present it with a toy version of the world since the toy world will almost certainly be leaky even if we don't notice the leaks ourselves.

Are you worried about leaks from the abstract computational process into the real world, leaks from the real world into the abstract computational process, or both? (Or maybe neither and I'm misunderstanding your concern?)

There will definitely be tons of leaks from the abstract computational process into the real world; just looking at the result is already such a leak. The point is that the AI should have no incentive to optimize such leaks, not that the leaks don't exist, so the existence of additional leaks that we didn't know about shouldn't be concerning.

Leaks from the outside world into the computational abstraction would be more concerning, since the whole point is to prevent those from existing. It seems like it should be possible to make hardware arbitrarily reliable by devoting enough resources to error detection and correction, which would prevent such leaks, though I'm not an expert, so it would be good to know if this is wrong. There may be other ways to get the AI to act similarly to the way it would in the idealized toy world even when hardware errors create small differences. This is certainly the sort of thing we would want to take seriously if hardware can't be made arbitrarily reliable.

Incidentally, that story about accidental creation of a radio with an evolutionary algorithm was part of what motivated my post in the first place. If the evolutionary algorithm had used tests of its oscillator design in a computer model, rather than in the real world, then it would have have built a radio receiver, since radio signals from nearby computers would not have been included in the computer model of the environment, even though they were present in the actual environment.

I think that the separation between "AIs that care about the physical world" and "AIs that care only about the Platonic world" is not that clean in practice. The way I would expect an AGI optimizing a toy world to actually work is, run simulations of the toy world and look for simplified models of it that allow for feasible optimization. However, in this way it can stumble across a model that contains our physical world together with the toy world. This model is false in the Platonic world, but testing it using a simulation (i.e. trying to exploit some leak in the box) will actually show that it's true (because the simulation is in fact running in the physical world rather than the Platonic world). Specifically, it seems to me that such a toy world is safe if and only if its description complexity is lower than the description complexity of physical world + toy world.

The agent could be programmed to have a certain hard-coded ontology rather than searching through all possible hypotheses weighted by description length.

My point is, I don't think it's possible to implement a strong computationally feasible agent which doesn't search through possible hypotheses, because solving the optimization problem for the hard-coded ontology is intractable. In other words, what gives intelligence its power is precisely the search through possible hypotheses.

It's not obvious to me, even on the "optimizing an abstract computational process" model, why an AI would not want get more compute -- it can use this compute for itself, without changing the abstract computational process it is optimizing, and it will probably do better. It seems that if you want to get this to work, you need to have the AI want to compute the result of running itself without any modification or extra compute on the virtual world. This feels very hard to me.

Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world.

The model I had in mind was that the AI and the toy world are both abstract computational processes with no causal influence from our world, and that we are merely simulating/spectating on both the AI itself and the toy world it optimizes. If the AI messes with people simulating it so that they end up simulating a similar AI with more compute, this can give it more influence over these peoples' simulation of the toy world the AI is optimizing, but it doesn't give the AI any more influence over the abstract computational process that it (another abstract computational process) was interfacing with and optimizing over.

Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world.

Yes, this could be difficult, and would likely limit what we could do, but I don't see why this would prevent us from getting anything useful out of a virtual-world-optimizer. Lots of engineering tasks don't require more explicit physics knowledge than we already have.

This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.)

Yes, this could be difficult, and would likely limit what we could do, but I don't see why this would prevent us from getting anything useful out of a virtual-world-optimizer. Lots of engineering tasks don't require more explicit physics knowledge than we already have.

I think there's a lot of common sense that humans apply that allows them to design solutions that meet many implicit constraints that they can't easily verbalize. "Thinking outside of the box" is when a human manages to design something that doesn't satisfy one of the constraints, because it turns out that constraint wasn't useful. But in most cases, those constraints are very useful, because they make the search space much smaller. By default, these constraints won't carry over into the virtual world.

(Lots of examples of this in The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities)

This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.)

Ok, here's another attempt to explain what I meant. Somewhere in the platonic realm of abstract mathematical structures, there is a small world with physics quite a lot like ours, containing an AI running on some idealized computational hardware, and trying to arrange the rest of the small world so that it has some desired property. Humans simulate this process so they can see what the AI does in the small world, and copy what it does. The AI could try messing with us spectators, so that we end up giving more compute to the physical instantiation of the AI in the human world (which is different from the AI in the platonic mathematical structure), which the physical instantiation of the AI in the human world can use to better manipulate the simulation of the toy world that we are running in the human world (which is also different from the platonic mathematical structure). The platonic mathematical structure itself does not have a human world with extra compute in it that can be grabbed, so trying to mess with human spectators would, in the platonic mathematical structure, just end up being a waste of compute, so this strategy will be discarded if it somehow gets considered in the first place. Thus a real-world simulation of this AI-in-a-platonic-mathematical-structure will, if accurate, behave in the same way.

Ah, I see. That does make it seem clearer to me, though I'm not sure what beliefs actually changed.

I mentioned this construction on Agent Foundations forum last year. (The idea that which worlds an agent cares about is an aspect of preference is folklore by now. This naturally allows not caring about particular worlds, if nothing in the worlds that such an agent cares about depends on those worlds.)

This happens automatically in the more tractable decision theory setups where we don't let the agent potentially care about everything, no universal priors etc., so maybe also no optimization daemons. It's a desirable property for the theory, but probably incompatible with following human values.

If we can do this, then it would give us a possible route to a controlled intelligence explosion, in which the AI designs a more capable successor AI because that is the task it has been assigned, rather than for instrumental reasons, and humans can inspect the result and decide whether or not to run it.

How would humans decide whether something designed by a superintelligent AI is safe to run? It doesn't sound safe by design because even if we rule out safety-compromising divergence in the toy intelligence explosion, how would we know that successor AI is safe for the real world -- nobody has designed it for that. We certainly shouldn't think that we can catch potential problems with the design by our own inspection -- we can't even do it reliably for designs produced by non-superintelligent software developers.

This reminds me of problems with boxed AIs: all is good unless they can affect the real world but this is a limitation on their usefulness and if they are superintelligent we might not see them leaking out of the box.

I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.

I'm very optimistic about this approach of doing "capability control" by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we'd still need to worry about "accidental" creation of subagents and (e.g. evolutionary) optimization pressures for their creation).

I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it's hard to tell if it is coherent, or how to formalize it.

P.S. Is this the same as "platonic goals"? Could you include references to previous thought on the topic?

I haven't heard the term "platonic goals" before. There's been plenty written on capability control before, but I don't know of anything written before on the strategy I described in this post (although it's entirely possible that there's been previous writing on the topic that I'm not aware of).

design the AI in such a way that it can create agents, but only

This sort of argument would be much more valuable if accompanied by a specific recipe of how to do it, or at least a proof that one must exist. Why worry about AI designing agents, why not just "design it in such a way" that it's already Friendly!

I agree. I didn't mean to imply that I thought this step would be easy, and I would also be interested in more concrete ways of doing it. It's possible that creating a hereditarily restricted optimizer along the lines I was suggesting could end up being approximately as difficult as creating an aligned general-purpose optimizer, but I intuitively don't expect this to be the case.