1 Answers sorted by
top scoring

Feb 24, 2023

This sort of works, but not enough to solve it.

A core problem lies in the distribution of goals that you vary things over. The AI will be trained to be corrigible within the range of that distribution, but there is no particular guarantee that it will be corrigible outside it.

So you need to make sure that your distribution of goals contain human values. How do you guarantee that it contains that without getting goodharted by instead containing something that superficially resembles human values?

It might be tempting to achieve this by making the distribution very general, with lots of varied goals, so it contains lots of alien values including human values. But then human values are given exponentially snall probability, which utility-wise is similar to the distribution not containing human values.

So you need to somehow give human values a high probability within the distribution. But at that point you're most of the way to just figuring out what human values are in the first place and directly aligning to them.

[This comment is no longer endorsed by its author]

[-]Ben Amitay3y10

More on the meta level: "This sort of works, but not enough to solve it." - do you mean "not enough" as in "good try but we probably need something else" or as in "this is a promising direction, just solve some tractable downstream problem"?

[-]Ben Amitay3y10

"which utility-wise is similar to the distribution not containing human values." - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don't see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than "don't care about controling goals". For capabilities my intuition is that starting with superficially-aligned goals is enough.

3tailcalled3y

Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn't 100% fit and I can't quickly analyze your proposal and apply my point to it.

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:26 PM

[-]Shmi3y31

Most approaches work fine when the AI is weak and so is within the training distribution that we can imagine. All known approaches fail once the intelligence being trained is much more powerful than the trainers. An obvious flaw here is that the trainer has to realize that something is wrong and needs adjusting. While the idea that the agent should "not instrumentally-care about its goals" is a good... instrumental goal, it does not get us far. See the discussion about emergent mesaoptimizers, for example, where an emergent internal agent has an instrumental goal as a terminal. In general, if you have the security mindset, you should be able to find holes easily in, say, your first 10 alignment-oriented proposals.

[-]Ben Amitay3y10

Hi, sorry for commenting on ancient comment, but I just read it again and found that I'm not convinced that the mesaoptimizers problem is relevant here. My understanding is that if you switch goals often enough, every mesaiptimizer that isn't corrigible should be trained away as it hurt the utility as defined.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

4

[ Question ]

Training for corrigability: obvious problems?

4

4

1 Answers sorted by
top scoring

Feb 24, 2023

4

[ Question ]

Training for corrigability: obvious problems?

4

4

1 Answers sorted by top scoring

Feb 24, 2023

1 Answers sorted by
top scoring