I've updated towards AI boxing being surprisingly easy

[-]Jeff Rose3y80

The problem is twofold. One, as and to the extent AI proliferates, you will eventually find someone who is less capable and careful about their sandboxing. Two, relatedly and more importantly, for much the same reason that people will not be satisfied with AIs without agency, they will weaken the sandboxing.

The STEM AI proposal referred to above can be used to illustrate this. If you want the AGI to do theoretical math you don't need to tell it anything about the world. If you want it to cure cancer, you need to give it a lot of information about physics, chemistry and mammalian biology. And if you want it to win the war or the election, then you have to tell it about human society and how it works. And, as it competes with others, whoever has more real time and complete data is likely to win.

[-]Noosphere893y10

If you want it to cure cancer, you need to give it a lot of information about physics, chemistry and mammalian biology.

This is much, much safer than elections or wars, since we can basically prevent it from learning human models.

And I should made this explicit, but I believe sandboxing can be done in such a way that it basically incurs no performance penalty.

That is, I believe AI sandboxing is one of the most competitive proposals here that reduces the risk to arguably 0, in the STEM AI case.

[-]Yair Halberstadt3y50

We are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model.

In practice this isn't going to be nearly as useful as an AI which does have access to the wealth of human knowledge. So whilst this might be useful for some sort of pivotal act, it's not going to be a practical long term option for AI security.

Can you even think of a pivotal act an AI can help with that satisfies this criteria?

[-]Noosphere893y40

STEM AI is one such plan that could be done with a secure sandbox, as long as we don't give it data on humans or human models, or at least giving it the least amount of data that is necessary, and we can prevent escalation from sandboxing it. Thus we control the data sources.

From Evhub's post:

STEM AI is a very simple proposal in a similar vein to microscope AI. Whereas the goal of microscope AI was to avoid the potential problems inherent in building agents, the goal of STEM AI is to avoid the potential problems inherent in modeling humans. Specifically, the idea of STEM AI is to train a model purely on abstract science, engineering, and/or mathematics problems while using transparency tools to ensure that the model isn’t thinking about anything outside its sandbox.

This approach has the potential to produce a powerful AI system—in terms of its ability to solve STEM problems—without relying on any human modeling. Not modeling humans could then have major benefits such as ensuring that the resulting model doesn’t have the ability to trick us to nearly the same extent as if it possessed complex models of human behavior. For a more thorough treatment of why avoiding human modeling could be quite valuable, see Ramana Kumar and Scott Garrabrant’s “Thoughts on Human Models.”

Evhub's post is here:

https://www.lesswrong.com/posts/fRsjBseRuvRhMPPE5/an-overview-of-11-proposals-for-building-safe-advanced-ai

And the Thoughts on Human Models post is here:

https://www.lesswrong.com/posts/BKjJJH2cRpJcAnP7T/thoughts-on-human-models

[-]Yair Halberstadt3y141

Note that giving abstract STEM problems is very unlikely to give zero anthropological information to an AI. The very format and range of the problems is likely to reveal information about both human technology and human psychology.

Now I still agree that's much more secure than giving it all the information it needs, but the claim of zero bits is pushing it.

[-]Steven Byrnes3y20

IMO neither Evan nor Scott nor anyone else has offered a plausible plan for using a STEM AI (that knows nothing about the existence of humans) to solve the big problem that someone else is going to build an unboxed non-STEM AI the next year.

[-]Noosphere893y10

The key is that my sandbox (really Daviddad's sandbox) requires very little or no performance loss, so even selfish actors would sandbox their AIs.

[-]Yair Halberstadt3y42

Until you want to use the AGI to e.g. improve medicine...

[-]Donald Hobson3y20

But all of these bits are useless for breaking the sandbox, since again they're random.

This isn't true in principle. Suppose you had floating point numbers, you could add, multiply and compare them, but you weren't sure how they were represented internally. When you see a cosmic ray bitflip, you learn that only one bit needs to be flipped to produce these results. This is technically information. In practice not much info. But some.

[-]Steven Byrnes3y20

no noticeable performance penalty

I think I know what you mean, but we should be crystal-clear that if

Team A has an AI that understands humans and the human world and human technology and the human economy and human language,
Team B has an AI that doesn’t know anything about any of those things

…then Team B’s AI is suffering a “performance penalty” in the trivial sense that there are many tasks which Team A’s AI can perform and Team B’s AI can’t. Agree?

Then there are a few schools of thought:

“Team B’s AI is so useless as to be completely irrelevant—we can just ignore it in our discussions.”
“Team B’s AI can do important real-world things of the “pivotal act” variety.
“Team B’s AI is not directly useful, but can be used for safely testing possible solutions to the Alignment Problem.”

I would mostly agree with the first bullet point, with a bit of the third—see here. I’m very skeptical of the second one. Proof assistants are neat and all, for example, but I don’t see them changing the world very much. There’s no proposed solution to the alignment problem that is stuck on an unproven mathematical conjecture, IMO. So what do we do with the boxed AI?

[-]Noosphere893y10

I think human technology is a reasonably safe area, similar to how Go is safe even at arbitrary capability levels, and we might be able to get away with the human economy, but the other areas are likely too dangerous to work in.

My big optimism that while I think politics and social areas aren't very safe areas, I do believe that there actually safe areas that are crucially non-political/not overly reliant on human models, and I think technology/STEM development is exactly this.

Also, it allows us to iterate on the problem of AI Alignment by making sure we can test it's alignment safely.

[-]Steven Byrnes3y20

Davidad proposes “we are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model.” You write “we might be able to get away with [the AI knowing about] the human economy”. These seem very contradictory to me, right? The human economy ≠ the organosilicon octopus mind economy, right? For example, the chemical industry would look rather different if atoms obeyed the Bohr model. The clothing industry would look different if we lived underwater. Etc.

Normally when people say something like “technology/STEM development” in the context of AGI, I would think of something like “come up with a way to make faster CPUs”. The AI would need to know how current fabs work in great detail, and what other technology and tools are easily available to use to attack the problem, and how compilers works, presumably with code samples and so on, and also an English-language description of what would constitute a functional CPU (ideally with English-language back-and-forth discussion to navigate the trade-space). For real AGI, I would expect it to be able to also design the new advanced CPU factories, purchase the land, negotiate with suppliers, apply for permits, etc. etc.

OK, so how much of that can be distilled into tasks which would be exactly the same in an alternate universe with different laws of physics? I think a few extremely narrow slices at best—for example maybe some narrow-AI optimization problems along the lines of this. But meanwhile an unboxed AGI can do the entire thing.

So you really can’t say “no performance penalty”, right?

[-]Noosphere893y30

So you really can’t say “no performance penalty”, right?

Yeah, that might be a big concern with my approach, but it's still worth trying to get empirical data on how much we can get away with a boxed AI model.

[-]tailcalled3y20

Sandboxing an AI is trivial; you can just not run it.

[-]Noosphere893y2-2

My claim is roughly the following:

Removing non-deterministic APIs, as well as making it fully deterministic is possible with very little or zero performance penalty, thus sandboxing the AI.
The bits that remain accessible to the AI aren't sufficient to break the sandbox, because they're totally random, and the No Free Lunch theorem prohibits learning on entirely random data.
Thus, sandboxing arbitrarily powerful AI is possible, or equivalently there is a positive answer to whether we can prevent the AI from breaking the box, no matter how powerful.

That's a non-trivial claim, since the AI is still running here.

[-]tailcalled3y20

It's a non-trivial way to sandbox it, but I am not sure what value that gives us which the trivial way of sandboxing doesn't give us.

[-]Noosphere893y10

The basic innovation is that AI boxing can be done cheaply, and with no performance penalty. So sandboxing can be done surprisingly easily, and AI companies can prevent AI from knowing dangerous facts as part of their business. This is huge, since it allows to much more safely iterate on the alignment problem by always sandboxing AI.

Similarly, it benefits narrow AI approaches as well, as we can avoid exposing too much dangerous information to AI.

As a concrete example, Jacob's alignment plan requires sandboxing in order to work, and I showed that such sandboxing is easy to do, helping Jacob's alignment plan.

Here's the links:

https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need

https://www.lesswrong.com/posts/JPHeENwRyXn9YFmXc/empowerment-is-almost-all-we-need

Finally, AI alignment is in a much less precarious position if AI simply can't break out of the box, since it can leverage the power of iteration used in science to solve hard problems. So long as AI is aligned in the sandbox, than the consequences are much, much more safe than one free on the internet.

Another plan that is helped by sandboxes is OpenAI's alignment plan. Another link can be shown for OpenAI's alignment plan:

https://openai.com/blog/our-approach-to-alignment-research/

[-]Chris_Leong3y10

I think the key problem is that this sandboxing won’t work for anything with a large language model as a component.

[-]Noosphere893y10

Not really? Why could this actually happen, conditional on us filtering out certain information, the real question is how the large language model can break the sandbox such that it can learn other facts?

[-]Chris_Leong3y23

Filtering out all the implicit information would be really, really hard.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

8

I've updated towards AI boxing being surprisingly easy

8

8

Removing nondeterminism

A No Free Lunch argument for why the remaining bits from random errors are incapable, even in theory, of breaking the sandbox/Cartesian Boundary

Appendix