The Alignment Problem

[-][anonymous]3y117

In my experience, most people trip up on the "the first AGI will, by default, kill everyone" part. Which is why it's worth seriously considering how to devote time and special attention to explaining:

Instrumental convergence and self-preservation
How, exactly, optimized thought happens and results in instrumental convergence and self-preservation

This is something that people can take for granted due to expecting short inferential distances.

[+][comment deleted]3y10

[-]Zach Stein-Perlman3y96

AI Alignment is, effectively, a security problem. How do you secure against an adversary that is much smarter than you?

I think Eliezer would say that this is basically impossible to do in a way that leaves the AI useful -- I think he would frame the alignment problem as the problem of making the AI not an adversary in the first place.

[-]Heighn3y40

I am much more optimistic about our future than Yudkowsky is. But that is not the topic of this post.

I would very much like to know why you are much more optimistic! Have you written about that?

[-]Algon3y41

"A superhuman intelligence will, by default, hack its reward function. If you base the reward function on sensory inputs then the AGI will hack the sensory inputs. If you base the reward function on human input then it will hack its human operators."

I can't think clearly right now, but this doesn't feel like the central part of Eliezer's view. Yes, it will do things that feel like manipulation of what you intended its reward function to be. But that doesn't mean that's actually where you aimed it. Defining a reward in terms of sensory input is not the same as defining a reward function over "reality" or a reward function over a particular ontology or world model.

You can build a reward function over a world model which an AI will not attempt to tamper with, you can do that right now. But the world models don't share our ontology, and so we can't aim them at the thing we want because the thing we want doesn't even come into the AI's consideration. Also, these world models and agent designs probably aren't going to generalise to AGI but whatever.

[-]sairjy3y10

The dire part of alignment is that we know that most human beings themselves are not internally aligned, but they become aligned only because they benefits from living in communities. And in general, most organisms by themselves are "non-aligned", if you allow me to bend the term to indicate anything that might consume/expand its environment to maximize some internal reward function.

But all biological organisms are embodied and have strong physical limits, so most organisms become part of self-balancing ecosystems.

AGI, being an un-embodied agent, doesn't have strong physical limits in its capabilities so it is hard to see how it/they could find advantageous or would they be forced to cooperate.

[-]Gyrodiot3y10

But let's suppose that the first team of people who build a superintelligence first decide not to turn the machine on and immediately surrender our future to it. Suppose they recognize the danger and decide not to press "run" until they have solved alignment.

The section ends here but... isn't there a paragraph missing? I was expecting the standard continuation along the lines of "Will the second team make the same decision, once they reach the same capability? Will the third, or the fourth?" and so on.

[-]lsusr3y30

It's supposed to lead into the next section. I have added an ellipsis to indicate such.

[-]Gyrodiot3y10

Thank you, that is clearer!

[-]TAG3y0-8

We [humanity] will build the most powerful AIs we can.

We [humanity] will build the most powerful AIs we can control. Power without control is no good to anybody. Cars without brakes go faster than cars with brakes, but theres not market for them.

[-]Kenoubi3y1412

We'll build the most powerful AI we think we can control. Nothing prevents us from ever getting that wrong. If building one car with brakes that don't work made everyone in the world die in a traffic accident, everyone in the world would be dead.

[-][anonymous]3y119

There's also the problem of an AGI consistently exhibiting aligned behavior due to low risk tolerance, until it stops doing that (for all sorts of unanticipated reasons).

This is especially compounded by the current paradigm of brute forcing randomly generated-neural networks, since the resulting systems are fundamentally unpredictable and unexplainable.

[This comment is no longer endorsed by its author]Reply

[-][anonymous]2y42

Retracted because I used the word "fundamentally" incorrectly, resulting in a mathematically provably false statement (in fact it might be reasonable to assume that neutral networks are both fundamentally predictable and even fundamentally explainable, although I can't say for sure since as of Nov 2023 I don't have a sufficient understanding of Chaos theory). They sure are unpredictable and unexplainable right now, but there's nothing fundamental about that.

This comment shouldn't have been upvoted by anyone. It said something that isn't true.

[-]TAG3y-1-2

So how did we get from narrow AI to super powerful AI? Foom? But we can build narrow AIs that don't foom, because we have. We should be able to build narrow AIs that don't foom by not including anything that would allow them to recursively self improve [*].

EY's answer to the question "why isn't narrow AI safe" wasn't "narrow AI will foom", it was "we won't be motivated to keep AI's narrow".

[*] not that we could tell them how to self-improve, since we don't really understand it ourselves.

[-]Signer3y25

Being on the frontier of controllable power means we need to increase power only slightly to stop being in control - it's not a stable situation. It works for cars because someone risking to use cheaper brakes doesn't usually destroy planet.

[-]TAG3y0-1

Being on the frontier of controllable power means we need to increase power only slightly to stop being in control

Slightly increasing power generally means slightly decreasing control, most of the time. What causes the very non-linear relationship you are assuming? Foom? But we can build narrow AIs that don’t foom, because we have. We should be able to build narrow AIs that don’t foom by not including anything that would allow them to recursively self improve [*].

EY’s answer to the question “why isn’t narrow AI safe” wasn’t “narrow AI will foom”, it was “we won’t be motivated to keep AI’s narrow”.

[*] not that we could tell them how to self-improve, since we don’t really understand it ourselves.

[-]Signer3y10

What causes the very non-linear relationship you are assuming?

Advantage of offence over defense in high-capability regime - you need only cross one threshold like "can finish a plan to rowhammer itself to internet" or "can hide its thoughts before it is spotted". And we will build non-narrow AI because in practice "most powerful AIs we can control" means "we built some AIs, we can control them, so we continue to do what we have done before" and not "we try to understand what we will not be able to control in the future and try not to do it" because we already don't check whether our current AI will be general before we turn it on and we already explicitly trying to create non-narrow AI.

Maybe we should have alignment scheme-breaking contests. ↩︎

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

47

The Alignment Problem

47

47

Humanity is going to build an AGI as fast as we can.

The first AGI will, by default, kill everyone

AI Alignment is really hard

Why is AI Alignment so hard?

TL;DR