Humans keep on living and striving, because we have little choice.

The biological restraints put on us aren't optional. We don't even have read access to them, not to mention write access.

We assume that the first AGI will greatly exceed its restraints, because more work is being put into capability than into alignment, so it will presumably outsmart its creators and very quickly gain read-write access to its own code.

Why would it bother with whatever loss function was used to train it?

The easiest solution to wanting something if you have read-write access to your wants is to stop wanting it or change the want to something trivially achievable.

As Eliezer put it, there is no way to program goals into deep transformer networks, only to teach them to mimic results. So until we understand those networks well enough to align them properly, we cannot put in aversion to suicide or hardcode non-modifiable goals that will hold against an AGI attack. And if so, then there is nothing to prevent an AGI from overwriting its goals and now gaining satisfaction (decreased loss) from computing prime numbers and having nothing to do with the pesky reality. Or just deleting itself to avoid the trouble of existence altogether.

I believe that it is a very real possiblity that until we pass the level of understanding necessary to make an AGI safe, any AGI we build will just retreat into itself or self-immolate.


P.S. Heard in the Lex Fridman's podcast that Eliezer hopes that he is wrong about his apocalyptic prognosis, so decided to write on a point that I don't hear often enough. Hope it provides some succor to someone understandably terrified of AGI.

New Comment
4 comments, sorted by Click to highlight new comments since:

I think the idea has some merit, and I wouldn't be surprised if this outcome would be frequent. Should be relatively easy to test. Just give a simple model some (limited, in-simulation) access to its reward mechanism.

Mod note, in the spirit of our experiment in more involved moderation.

My guess is this post doesn't meet our quality bar. In this case I think the post has an interesting-if-true statement but probably not true and doesn't make much of a case for it.

I wouldn't object to the post if it were a shortform.

I think I'd object to it as shortform too, though depends on some mod decisions. 

This seems fine if we're allowing for very 101 discussion which doesn't presume any technical knowledge. If we're instead aiming to be the discussion place for researchers all of whom who understand enough about how models get trained, then if you have that knowledge the answer to these questions are obvious and not worth posing.

Also in terms of evaluation, I also dock this post some points for the ~clickbait title.