Re: Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

[-]Steven Byrnes4y50

I'm really glad you wrote this!! I already knew you were way more optimistic than me about AGI accident risk being low, and have been eager to hear where you're coming from.

Here are some points of disagreement…

If we define AGI as "world optimizer" then yes, definitely. But I can imagine a couple different kinds of superintelligences that aren't world optimizers (along with a few that naturally trend toward world optimizing). If you built a superintelligent machine that isn't a world optimizer then it need not necessarily end the world.
For example, MuZero separates value from policy from reward. If you built just the value network and cranked it up to superintelligence then you would have a superintelligence that is not a world optimizer.

See my discussion of so-called "RL-on-thoughts" here. Basically, I argue that if we want the AGI to be able to find / invent really new useful ideas that solve particular problems, it needs to explore the space-of-all-possible-ideas with purpose, because the space-of-all-possible-ideas is just way too big to explore it in any other way. To explore the space-of-all-possible-ideas with purpose, you need a closed-loop policy+value consequentialist thing, and a closed-loop policy+value consequentialist thing is a world optimizer by default, absent a solution to the alignment problem.

I don't know if Eliezer or Nate would endorse my "RL-on-thoughts" discussion, but my hunch is that they would, or at least something in that general vicinity, and that this underlies some of the things they said recently, including the belief that MuZero is on a path to AGI in a way that GPT-3 isn't.

Forcing the AI to use simple models provides a powerful safety mechanism against misalignment.

I think that's an overstatement. Let's say we have a dial / hyperparameter for "how simple the model must be" (or what's the numerical exchange rate between simplicity vs reward / loss / whatever). There are some possible dial settings where the model is simple enough for us to understand, simple enough to not come up with deceptive strategies, etc. There are also some possible dial settings where the model is powerful enough to "be a real-deal AGI" that can build new knowledge, advance AI alignment research, invent weird nanotechnology, etc.

The question is, do those ranges of dial settings overlap? If yes, it's a "powerful safety mechanism against misalignment". If no, it's maybe slightly helpful on the margin, or it's mostly akin to saying "not building AGI at all is a powerful safety mechanism against misaligned AGIs". :-P

So what's the answer? Do the ranges overlap or not? I think it's hard to say for sure. My strong hunch is "no they don't overlap".

You can get corrigibility by providing a switch the computer can activate for maximum reward by escaping its sandbox and providing an escape hatch you think is just beyond the AI's abilities and then turning up the allowed complexity. I understand this approach has theoretical problems. I can't prove it will work, but I predict it'll be a practical solution to real-world situations.

I think this presupposes that the AI is "trying" to maximize future reward, i.e. it presupposes a solution to inner alignment. Just as humans are not all hedonists, likewise AGIs are not all explicitly trying to maximize future rewards. I wrote about that (poorly) here; a pedagogically-improved version is forthcoming.

I am far more concerned about outer alignment. I'm not worried that an AI will take over the world by accident. I'm worried that an AI will take over the world because someone deliberately told it to.

This is a bit of a nitpick, but I think standard terminology would be to call this "bad actor risks". (Or perhaps "coordination problems", depending on the underlying story.) I've only heard "outer alignment" used to mean "the AI is not doing what its programmer wants it to do, because of poor choice of objective function" (or similar)—i.e., outer alignment issues are a strict subset of accident risk. This diagram is my take (from a forthcoming post):

[-]lsusr4y40

Thank you for the quality feedback. As you know, I have a high opinion of your work.

I have replaced "outer alignment" with "bad actor risk". Thank you for the correction.

[-]Logan Zoellner4y10

Nanosystems are definitely possible, if you doubt that read Drexler’s Nanosystems and perhaps Engines of Creation and think about physics. They’re a core thing one could and should ask an AI/AGI to build for you in order to accomplish the things you want to accomplish.
Not important. An AGI could easily take over the world with just computer hacking, social engineering and bribery. Nanosystems are not necessary.

This is actually a really important distinction!

Consider three levels of AGI:

basically as smart as a single human
capable of taking over/destroying the entire world
capable of escaping from a box by sending me plans for a self-replicating nano machine

I think it pretty clear that 1 < 2 < 3.

Now, if you're building AGI via recursive self-improvement, maybe it just Fooms straight from 1. to 3. But if there is no Foom (say because AGI is hardware limited), then there's a chance to solve the alignment problem between 1. and 2. But also between 2. and 3. since 2. can plausibly be boxed, even if when unboxed it destroys the world.

[-]lsusr4y30

The way I look at things, an AGI fooms straight from 1 to 2. At that point it has subdued all competing intelligences and can take it's time getting to 3. I don't think 2 can plausibly be boxed.

[-]Logan Zoellner4y10

You don't think the simplest AI capable of taking over the world can be boxed?

What if I build an AI and the only 2 things it is trained to do are:

pick stocks
design nuclear weapons

Is your belief that: a) this AI would not allow me to take over the world or b) this AI could not be boxed ?

[-]lsusr4y20

Designing nuclear weapons isn't any use. The limiting factor in manufacturing nuclear weapons is uranium and industrial capacity, not technical know-how. That (I presume) is why Eliezer cares about nanobots. Self-replicating nanobots can plausibly create a greater power differential at a lower physical capital investment.

Do I think that the simplest AI capable of taking over the world (for practical purposes) can't be boxed if it doesn't want to be boxed? I'm not sure. I think that is a slightly different from whether an AI fooms straight from 1 to 2. I think there are many different powerful AI designs. I predict some of them can be boxed. Also, I don't know how good you are at taking over the world. Some people need to inherit an empire. Around 1200, one guy did it with like a single horse.

[-]Logan Zoellner4y10

The 1940's would like to remind you that one does not need nanobots to refine uranium.

I'm pretty sure if I had $1 trillion and a functional design for a nuclear ICBM I could work out how to take over the world without any further help from the AI.

If you agree that:

it is possible to build a boxed AI that allows you to take over the world
taking over the world is a pivotal act

then maybe we should just do that instead of building a much more dangerous AI that designs nanobots and unboxes itself? (assuming of course you accept Yudkowski's "pivotal-act framework of course).

[-]lsusr4y*60

The 1940's would like to remind you that one does not need nanobots to refine uranium.

I'm confused. Nobody has ever used nanobots to refine uranium.

I'm pretty sure if I had $1 trillion and a functional design for a nuclear ICBM I could work out how to take over the world without any further help from the AI.

Really? How would you do it? The Supreme Leader of North Korea has basically those resources and has utterly failed to conquer South Korea, much less the whole world. Israel and Iran are in similar situations and they're mere regional powers.

I'm using "disruptive" in Clayton Christensen's sense of the word. ↩︎

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

20

Re: Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

20

20

TL;DR

Disclaimer

Individual Contentions

Zvi's thoughts