(Crossposted from my blog Centerless Set)

It’s hard to come up with a workably short function that includes human civilization in its global maximum/minimum.

This is a problem because we specify our goals to AI using functions. For any practical function we might use, there’s a set of strange and undesirable worlds that satisfy that function better than our world.

For example, if our function measures the probability that some particular glass is filled with water, the space near the maximum is full of worlds like “take over the galaxy and find the location least likely to be affected by astronomical phenomena, then build a megastructure around the glass designed to keep it full of water”.

On top of this, what the AI would actually do if we trained it on that function is hard to predict, because it would learn values that instrumentally help it in our training data but not necessarily in new environments. This would result in strange, unintuitive values whose maximum is equally if not more unlikely to include human civilization.

Here I’ll list some potential ideas that might come to mind, along with why I think they’re insufficient:

Idea:

Don’t specify our goals to AI using functions.

Flaw:

Current deep learning methods use functions to measure error, and AI learns by minimizing that error in an environment of training data. This has replaced the old paradigm of symbolic AI, which didn’t work very well. If progress continues in this direction, the first powerful AI will operate on the principles of deep learning.

Even if we build AI that doesn’t maximize a function, it won’t be competitive with AI that does, assuming present trends hold. Building weaker, safer AI doesn’t stop others from building stronger, less safe AI.

Idea:

Use long, complicated functions that represent our actual goals.

Flaw:

This is even more difficult than it sounds. It’s hard to specify a goal like “don’t affect anything except this glass and this pitcher of water” using a function. Every action that fills the glass with water also affects everything within the local causal sphere of influence, although in very minor ways. “Only affect everything else a little” has very strange worlds near its global maximum. “Only use T seconds to form a plan to fill the glass with water” could result in a plan like “copy paste my code without the planning time limit”.

Also, the longer our function gets, the harder it is to verify that the space near the global maximum is desirable. A complicated function only needs to be wrong in one way for the maximum to be undesirable.

Idea:

Use deep learning methods to generate the function.

Flaw:

There’s no reliable feedback mechanism to test the function outside of its training data. The function we get might be successful in the training environment, but then produce weird results in the real world.

This is okay for tools that don’t have to be 100% error-free, but for a powerful optimizer, a small discrepancy in the goal function can result in strange and undesirable maxima.

Idea:

Involve humans heavily in the process.

Flaw:

Once the AI is powerful enough, any preferences it has will be manifested in the world. Human involvement only keeps it safe as long as it's weak enough to be controlled.

For a powerful AI, even a function like “listen to what humans tell you” results in strange and undesirable global maxima. We would need an extremely reliable value learning function for this process to be safe. Teaching an AI to learn our values successfully might be even harder than teaching an AI our values directly.

Idea:

Get multiple AIs to prevent each other from maximizing their goal functions.

Flaw:

The global maximum of any set of functions like this still doesn’t include human civilization. Either a single AI will win, or some subset will compete among themselves with just as little regard for preserving humanity as the single AI would have.

Idea:

Don’t build powerful AI.

Flaw:

Technological progress is a highly decentralized process which is largely outside the control of any individual or group. Multiple groups are trying to develop powerful AI simultaneously, and more are likely to join in the future. Once it exists, powerful AI is likely to be much easier to generate or copy than historical examples of dangerous technologies like nuclear weapons.

Also, actors with higher risk tolerance or lower risk estimates are incentivized to develop powerful AI first. This makes the strategy of abstaining from AI development an unappealing option for most companies and governments.

Idea:

Build one AI just strong enough to stop others from being built.

Flaw:

The goal “stop other AI from being built” is not likely to be safer than the goal “fill the glass with water”. Both have undesirable worlds near the global maximum.

The key element of this idea might be “just strong enough”, but it’s unclear if this is feasible. It seems contradictory for an AI to be powerful enough to stop other AI from being built, but not powerful enough to produce undesirable worlds.

It’s conceivable that there exists some limited set of capabilities that meets these criteria, but none come to mind. It’s also conceivable that none exist.

Idea:

Enhance human cognitive abilities before the arrival of powerful AI.

Flaw:

We don’t know how to do this. So far, attempts to produce even minor cognitive improvements have been unsuccessful.

AI technology is improving much faster than human cognitive enhancement technology.

...

I don’t know of any solutions that aren’t flawed in this way.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 5:59 PM

For example, if our function measures the probability that some particular glass is filled with water, the space near the maximum is full of worlds like “take over the galaxy and find the location least likely to be affected by astronomical phenomena, then build a megastructure around the glass designed to keep it full of water”.

If the function is 'fill it and see it is filled forever' then strange things may be required to accomplish that (to us) strange goal.

 

Idea:

Don’t specify our goals to AI using functions.

Flaw:

Current deep learning methods use functions to measure error, and AI learns by minimizing that error in an environment of training data. This has replaced the old paradigm of symbolic AI, which didn’t work very well. If progress continues in this direction, the first powerful AI will operate on the principles of deep learning.

Even if we build AI that doesn’t maximize a function, it won’t be competitive with AI that does, assuming present trends hold. Building weaker, safer AI doesn’t stop others from building stronger, less safe AI.

Do you have any idea how to do "Don’t specify our goals to AI using functions."? How are you judging "if we build AI that doesn’t maximize a function, it won’t be competitive with AI that does"?

 

Idea:

Get multiple AIs to prevent each other from maximizing their goal functions.

Flaw:

The global maximum of any set of functions like this still doesn’t include human civilization. Either a single AI will win, or some subset will compete among themselves with just as little regard for preserving humanity as the single AI would have.

Maybe this list should be numbered.

 

This one is worse than it looks (though it seems underspecified). Goal 1: some notion of human flourishing. Goal 2: prevent goal 1 from being maximized. (If this is the opposite of 1, you may have just asked to be nuked.)

 

Idea:

Don’t build powerful AI.

Flaw:

For all the 'a plan that handles filling a glass of water, generated using time t' 'is flawed' - this could actually work. Now, one might object that a particular entity will try to create powerful AI. While there might be incentives to do so, trying to set limits, or see safeguard deployed (if the AI managing air conditioning isn't part of your AGI research, add these safeguards now). 

This isn't meant as a pure 'this will solve the problem' approach, but that doesn't mean it might not work (thus ensuring AIs handling cooling/whatever at data centers meet certain criteria). 

 

Once it exists, powerful AI is likely to be much easier to generate or copy than historical examples of dangerous technologies like nuclear weapons.

There's a number of assumptions here which may be correct, but are worth pointing out.

How big a file do you think an AI is?

1 MB?

1 TB?

That's not to say that compression exists, but also, what hardware can run this program/software you are imagining (and how fast)?

 

undesirable worlds near the global maximum.

There's a lot of stuff in here about maximums. It seems like your beliefs that 'functions won't do' stems from a belief that maximization is occurring. Maximizing a function isn't always easy, even at the level of 'find the maximum of this function mathematically'. That's not to say that what you're saying is necessarily wrong, but suppose some goal is 'find out how this protein folds'. It might be a solvable problem, but that doesn't mean it is an easy problem. It also seems like, if the goal is to fill a glass with water, then the goal is achieved when the glass is filled with water.

Thanks for the responses, I'll try to address them individually.

If the function is 'fill it and see it is filled forever' then strange things may be required to accomplish that (to us) strange goal.

I agree that this doesn't adequately represent our goal, but I think the problem persists even when we add lots of qualifications like "make sure the glass is filled with water for the next five minutes and then lose interest". The maximum of that function might not include a large-scale plan due to limited time, but it could include destroying everything within range except for the facility to prevent interference. It's possible that adding enough qualifications would solve this, but it wouldn't be easy to verify.

Do you have any idea how to do "Don’t specify our goals to AI using functions."? How are you judging "if we build AI that doesn’t maximize a function, it won’t be competitive with AI that does"?

I don't know how to achieve the same capabilities as current or future machine learning without specifying goals using functions. In that sense, I think it would be hard to match something like GPT without deep learning, and so more legible alternatives wouldn't be competitive. (I might be understating this. It seems like function-based learning is the only method we have that works.)

This one is worse than it looks (though it seems underspecified). Goal 1: some notion of human flourishing. Goal 2: prevent goal 1 from being maximized. (If this is the opposite of 1, you may have just asked to be nuked.)

I was thinking of Robin Hanson's idea that the competitive market of many AIs would prevent any individual AI from taking over. I don't think that would work either, but I agree that intentionally designing opposing AIs would be even worse.

For all the 'a plan that handles filling a glass of water, generated using time t' 'is flawed' - this could actually work.

It seems like humans are often kept safe from each other by limited resources and limited thinking time, so I agree that this could be a promising approach. But we would have to prevent a limited AI from increasing its own capabilities.

How big a file do you think an AI is?

Maybe it's not as easy as copying a piece of software, but probably easier than building a nuclear weapon in terms of resources. If running it requires an uncommon amoung of computing, then you're right, it would be hard to copy.

Maximizing a function isn't always easy, even at the level of 'find the maximum of this function mathematically'.

You're right, achieving the global maximum for many functions would be unfeasible. The risk comes when the space of high-value bad outcomes overlaps with the space of feasibale strategies for the AI. This is not necessarily at or even near the global maximum. This way of framing the problem might be more accurate.

It also seems like, if the goal is to fill a glass with water, then the goal is achieved when the glass is filled with water.

This is true unless the AI is trying to maximize the probability of success, or the proximity to some exact amoung of fullness, or some other precise goal. If it works by satisfying goals without maximizing anything, then the problem might be solved. But I don't think we know how to build powerful AI that satisfy goals without maximizing anything.

I like this format!

Even if we build AI that doesn’t maximize a function, it won’t be competitive with AI that does, assuming present trends hold. Building weaker, safer AI doesn’t stop others from building stronger, less safe

Why doesn't your non-function-maximizing safe AI (which learns human values through human involvement) stop others from building stronger, less safe AIs? Seems to me that it probably could and probably should, and if it definitely could and definitely should then it definitely would! :)

Also: upvoted this post for being in the positing-negating format which is fun and easy to read.

In this case one should probably aim to bite the bullet, rather than dodge the bullet. 21st century civilization is not a global maxima, and our intuitions are just that badly miscalibrated about the scope and scale of possible futures (furthermore there is no ground truth of morality either.) In this case, there's no solution because the problem depends on a bad assumption (That 21st century civilization is a global maxima.)