Safely and usefully spectating on AIs optimizing over toy worlds

[-]Gordon Seidoh Worley7yΩ240

Alternatively, you could say that the AI is an abstract computational process which is trying achieve certain results in another abstract computational process (the toy world) that it is interfacing with.

I expect there to be significant difficulties in achieving this in practice. We can design theoretical models that provide isolation, but implementation details often result in leaks in the abstraction. To give a simple example I know about, quantum cryptography is theoretically "perfect", but the first implementations had a fatal flaw due to the realities of successfully transmitting qubits over the wire which introduced opportunities for attacks because the qubits were not be represented as single particles but by multiple particles, permitting a "siphoning" attack. We see similar issues around classical computer memory (in the form of row-hammer attacks) and spinning disks (in the form of head-skip attacks). Given even a relatively low-powered AI was able to accidentally create a radio receiver (although in a seemingly not well sandboxed environment, to be fair), we should expect something aiming for superintelligence to not be much hindered by trying to present it with a toy version of the world since the toy world will almost certainly be leaky even if we don't notice the leaks ourselves.

[-]AlexMennen7yΩ120

Are you worried about leaks from the abstract computational process into the real world, leaks from the real world into the abstract computational process, or both? (Or maybe neither and I'm misunderstanding your concern?)

There will definitely be tons of leaks from the abstract computational process into the real world; just looking at the result is already such a leak. The point is that the AI should have no incentive to optimize such leaks, not that the leaks don't exist, so the existence of additional leaks that we didn't know about shouldn't be concerning.

Leaks from the outside world into the computational abstraction would be more concerning, since the whole point is to prevent those from existing. It seems like it should be possible to make hardware arbitrarily reliable by devoting enough resources to error detection and correction, which would prevent such leaks, though I'm not an expert, so it would be good to know if this is wrong. There may be other ways to get the AI to act similarly to the way it would in the idealized toy world even when hardware errors create small differences. This is certainly the sort of thing we would want to take seriously if hardware can't be made arbitrarily reliable.

Incidentally, that story about accidental creation of a radio with an evolutionary algorithm was part of what motivated my post in the first place. If the evolutionary algorithm had used tests of its oscillator design in a computer model, rather than in the real world, then it would have have built a radio receiver, since radio signals from nearby computers would not have been included in the computer model of the environment, even though they were present in the actual environment.

[-]Vanessa Kosoy7yΩ230

I think that the separation between "AIs that care about the physical world" and "AIs that care only about the Platonic world" is not that clean in practice. The way I would expect an AGI optimizing a toy world to actually work is, run simulations of the toy world and look for simplified models of it that allow for feasible optimization. However, in this way it can stumble across a model that contains our physical world together with the toy world. This model is false in the Platonic world, but testing it using a simulation (i.e. trying to exploit some leak in the box) will actually show that it's true (because the simulation is in fact running in the physical world rather than the Platonic world). Specifically, it seems to me that such a toy world is safe if and only if its description complexity is lower than the description complexity of physical world + toy world.

[-]AlexMennen7yΩ120

The agent could be programmed to have a certain hard-coded ontology rather than searching through all possible hypotheses weighted by description length.

[-]Vanessa Kosoy7yΩ110

My point is, I don't think it's possible to implement a strong computationally feasible agent which doesn't search through possible hypotheses, because solving the optimization problem for the hard-coded ontology is intractable. In other words, what gives intelligence its power is precisely the search through possible hypotheses.

[-]Rohin Shah7yΩ230

It's not obvious to me, even on the "optimizing an abstract computational process" model, why an AI would not want get more compute -- it can use this compute for itself, without changing the abstract computational process it is optimizing, and it will probably do better. It seems that if you want to get this to work, you need to have the AI want to compute the result of running itself without any modification or extra compute on the virtual world. This feels very hard to me.

Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world.

[-]AlexMennen7yΩ350

The model I had in mind was that the AI and the toy world are both abstract computational processes with no causal influence from our world, and that we are merely simulating/spectating on both the AI itself and the toy world it optimizes. If the AI messes with people simulating it so that they end up simulating a similar AI with more compute, this can give it more influence over these peoples' simulation of the toy world the AI is optimizing, but it doesn't give the AI any more influence over the abstract computational process that it (another abstract computational process) was interfacing with and optimizing over.

Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world.

Yes, this could be difficult, and would likely limit what we could do, but I don't see why this would prevent us from getting anything useful out of a virtual-world-optimizer. Lots of engineering tasks don't require more explicit physics knowledge than we already have.

[-]Rohin Shah7yΩ110

This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.)

Yes, this could be difficult, and would likely limit what we could do, but I don't see why this would prevent us from getting anything useful out of a virtual-world-optimizer. Lots of engineering tasks don't require more explicit physics knowledge than we already have.

I think there's a lot of common sense that humans apply that allows them to design solutions that meet many implicit constraints that they can't easily verbalize. "Thinking outside of the box" is when a human manages to design something that doesn't satisfy one of the constraints, because it turns out that constraint wasn't useful. But in most cases, those constraints are very useful, because they make the search space much smaller. By default, these constraints won't carry over into the virtual world.

(Lots of examples of this in The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities)

[-]AlexMennen7yΩ120

This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.)

Ok, here's another attempt to explain what I meant. Somewhere in the platonic realm of abstract mathematical structures, there is a small world with physics quite a lot like ours, containing an AI running on some idealized computational hardware, and trying to arrange the rest of the small world so that it has some desired property. Humans simulate this process so they can see what the AI does in the small world, and copy what it does. The AI could try messing with us spectators, so that we end up giving more compute to the physical instantiation of the AI in the human world (which is different from the AI in the platonic mathematical structure), which the physical instantiation of the AI in the human world can use to better manipulate the simulation of the toy world that we are running in the human world (which is also different from the platonic mathematical structure). The platonic mathematical structure itself does not have a human world with extra compute in it that can be grabbed, so trying to mess with human spectators would, in the platonic mathematical structure, just end up being a waste of compute, so this strategy will be discarded if it somehow gets considered in the first place. Thus a real-world simulation of this AI-in-a-platonic-mathematical-structure will, if accurate, behave in the same way.

[-]Rohin Shah7yΩ110

Ah, I see. That does make it seem clearer to me, though I'm not sure what beliefs actually changed.

[-]Vladimir_Nesov7y20

I mentioned this construction on Agent Foundations forum last year. (The idea that which worlds an agent cares about is an aspect of preference is folklore by now. This naturally allows not caring about particular worlds, if nothing in the worlds that such an agent cares about depends on those worlds.)

This happens automatically in the more tractable decision theory setups where we don't let the agent potentially care about everything, no universal priors etc., so maybe also no optimization daemons. It's a desirable property for the theory, but probably incompatible with following human values.

[-]kvas7y10

If we can do this, then it would give us a possible route to a controlled intelligence explosion, in which the AI designs a more capable successor AI because that is the task it has been assigned, rather than for instrumental reasons, and humans can inspect the result and decide whether or not to run it.

How would humans decide whether something designed by a superintelligent AI is safe to run? It doesn't sound safe by design because even if we rule out safety-compromising divergence in the toy intelligence explosion, how would we know that successor AI is safe for the real world -- nobody has designed it for that. We certainly shouldn't think that we can catch potential problems with the design by our own inspection -- we can't even do it reliably for designs produced by non-superintelligent software developers.

This reminds me of problems with boxed AIs: all is good unless they can affect the real world but this is a limitation on their usefulness and if they are superintelligent we might not see them leaking out of the box.

[-]David Scott Krueger (formerly: capybaralet)7y10

I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.

I'm very optimistic about this approach of doing "capability control" by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we'd still need to worry about "accidental" creation of subagents and (e.g. evolutionary) optimization pressures for their creation).

I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it's hard to tell if it is coherent, or how to formalize it.

P.S. Is this the same as "platonic goals"? Could you include references to previous thought on the topic?

[-]AlexMennen7y20

I haven't heard the term "platonic goals" before. There's been plenty written on capability control before, but I don't know of anything written before on the strategy I described in this post (although it's entirely possible that there's been previous writing on the topic that I'm not aware of).

[-]dvasya7yΩ010

design the AI in such a way that it can create agents, but only

This sort of argument would be much more valuable if accompanied by a specific recipe of how to do it, or at least a proof that one must exist. Why worry about AI designing agents, why not just "design it in such a way" that it's already Friendly!

[-]AlexMennen7yΩ120

I agree. I didn't mean to imply that I thought this step would be easy, and I would also be interested in more concrete ways of doing it. It's possible that creating a hereditarily restricted optimizer along the lines I was suggesting could end up being approximately as difficult as creating an aligned general-purpose optimizer, but I intuitively don't expect this to be the case.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

24

Safely and usefully spectating on AIs optimizing over toy worlds

24

Ω 9

24

Ω 9