Util — LessWrong

I just recalled I've read ACX: Janus' Simulators which outlines a 5th missing interpretation: Neither current nor future LLMs will develop goals, but will become dangerous nevertheless.

If future superintelligences look like GPT, is there anything to worry about?
Answer 1: Irrelevant, future superintelligences will be too different from GPT for this to matter.
Answer 2: There’s nothing to worry about with pure GPT (a simulator), but there is something to worry about with GPT+RLHF (a simulator successfully simulating an agent). The inner agent can have misaligned goals and be dangerous. For example, if you train a future superintelligence to simulate Darth Vader, you’ll probably get what you deserve. Even if you avoid such obvious failure modes, the inner agent can be misaligned for all the usual agent reasons. For example, an agent trained to be Helpful might want to take over the world in order to help people more effectively, including people who don’t want to be helped.
Answer 3: Even if you don’t ask it to simulate an agent, it might come up with agents anyway. For example, if you ask it “What is the best way to obtain paperclips?”, and it takes “best way” literally, it would have to simulate a paperclip maximizer to answer that question. Can the paperclip maximizer do mischief from inside GPT’s simulation of it? Probably the sort of people who come up with extreme AI risk scenarios think yes. This post gives the example of it answering with “The best way to get paperclips is to run this code” (which will turn the AI into a paperclip maximizer). If the user is very dumb, they might agree.

Clarifying how misalignment can arise from scaling LLMs

Answer by UtilAug 24, 202310

Is it correct to frame alignment as "programming a good philosophy of meaning"?

Util3y10

Yes, it's a tough problem! :) However, your points seem to expand, rather than correct my points, which makes me think it's not a bad way to compress the problem into a few words. Thanks!

Edit: (It seems to me that if an AI can correct for its mistakes in misinterpretation, when you look at it from the outside, it's accurate to say it uses the correct model of interpretation, but I can see why you could disagree)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments