Suppose an emulation of Eliezer Yudkowsky, as of January 2023, discovered how to self-modify. It's possible that the returns of capability per unit effort would start increasing, which would make its intelligence FOOM.

The emulation starts with an interest in bringing about utopia rather than extinction in the form of humanity's CEV. In the beginning, the emulation doesn't know how to implement CEV yet, and it doesn't know how humanity's CEV is specified in practice. But as the emulation self-improves, they can make more and more refined guesses, and well before being able to simulate humanity's history (if such a thing is possible), it nails down what humanity's CEV is and how to bring it about.

In this thought experiment, you could say that the emulation is aligned from the beginning, but it's just not sure about some details. On the other hand, it doesn't know its own goal precisely yet.

So, here are some questions:

1. Is it possible to start with some very simple goal kernel that very likely transforms into something that is still aligned during self-improvement? After all, it's plausible that this would happen with an emulation of Eliezer Yudkowsky or other particular humans.

2. Does this make the kernel already aligned? Is this just a definitional thing?

3. Does this mean that increasing capabilities do help with alignment? A system as smart as Eliezer can't specify humanity's CEV, but something smarter might be able to.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 4:03 PM

I publish posts like this one to clarify my doubts about alignment. I don't pay attention to whether I'm beating a dead horse or if there's previous literature about my questions or ideas. Do you think this is an OK practice? One pro is that people like me learn faster, and one con is that it may pollute the site with lower-quality posts.

As a single data point, I replied to the post and didn't drive by downvote.

I think clearing up honest confusions is a valuable service. LW should be noob friendly I think. The karma mechanisms already serve to filter out low quality content from most members feeds.

I think recursive self improvement is probably the wrong frame, but I'll make a best effort attempt to answer the questions.

  1. I think corrigibility basically does that. Properly corrigible agents should remain corrigible under amplification, and competent corrigible agents should design/instantiate corrigible successors.

Another way this could be attained is if values are robust or we get alignment by default.

  1. It does seem to me like the kernel is aligned (as much as is feasible at its limited capability level).

  2. No it does not. Your scenario only works because we've already solved the hard parts of alignment; we've already succeeded at making the AI corrigible in a way that's robust to scale/capability amplification or succeeded in targeting it at "human values" in a way that's robust to scale/capability amplification.

Of course if you solve alignment more capable systems would be more competent in acting in accordance with their alignment target(s).

I use Eliezer Yudkowsky in my example because it makes the most sense. Don't read anything else into it, please.

Another thing one might wonder about is if performing iterated amplification with constant input from an aligned human (as "H" in the original iterated amplification paper) would result in a powerful aligned thing if that thing remains corrigible during the training process.