LESSWRONG
LW

[ Question ]

What's the problem with having an AI align itself?

by FinalFormal2

1 min read6th Apr 20221 answer 1 comment

0

It seems like it would be really easy to come up with a lot of moral questions and answers and then ask an AI to tell us what it predicts humans preferring as an outcome.

There's a possibility that AI is not good at modeling human preferences, but if that's the case, it'll be very apparent at lower levels because that will mean commands will have to be very specific to get results. Any model that can't answer basic questions about it's intended goals is not going to be given the (metaphorical) nuclear codes.

In fact, why wouldn't you just test every AI by asking it to explain how it's going to solve your problem before it actually solves it?

How do I die?

What's the problem with having an AI align itself?

New Answer

New Comment

1 Answers sorted by
top scoring

Apr 06, 2022

40

This article (by Eliezer Yudkowsky) explains why the suggestion in your 2nd paragraph won't work: https://arbital.com/p/goodharts_curse/

I'm afraid I'll butcher the argument in trying to summarize, but essentially it is because even slight misalignments will get blown up (i.e. it will pursue the areas where it is misaligned at the expense of everything else) at increasing optimization pressure. So you might have something aligned fairly well, and at the test optimization level, you can check that it is indeed aligned pretty well - but then when you turn up the pressure, it will find weaker points in the specification and optimize for that instead. And this problem recurs at the meta-level, so there's not an obvious way to say "well obviously just don't do that" in a way that would actually work.

The problem with asking the AI how it will solve your problem is that if it is misaligned, it will just lie to you if that helps it complete its objective more effectively.

[-]FinalFormal22y40

I think there may have been some miscommunication here, either I'm not understanding you or you're not understanding me, so I'll explain my second paragraph point in a different way in case it was my mistake.

My model is that at lower levels of AI, 'misalignment' will be measurable but not catastrophic. It would look like producing an advertising campaign that is funny but does not feature the product, or a tool that is very cheap but very useless. Any misunderstanding of human preferences will lead to failure, so either humans will improve their ability to... (read more)

1dmav2y

I think the question of you/Adele miscommunicating is mostly under-specification of what features you want your test-AGI to have. * If you throttle its ability to optimize for its goals, see EY and Adele's arguments. * If you don't throttle in this way, you run into goal-specification/constraint-specification issues, instrumental convergence concerns and everything that goes along with it. I think most people here will strongly feel a (computationally) powerful AGI with any incentives is scary, and that any test-versions should require using at-most a much-less-powerful one. Sorry if I've misunderstood you at all. If you specify the nature of/goals/constraints etc of your test-AI more specifically, maybe I or someone else can try to give you more specific failure-modes.