[ Question ]

AI Alignment, Constraints, Control, Incentives or Partnership?

by jmh1 min read31st Dec 20194 comments



A quick search on AI benevolence did not really return much and I've not waded into the depths of the whole area. However I am wondering to what extent the current approach here is about constraining and controlling (call that bucket 1) versus that of incenting and partnership (Bucket 2) as a solution to the general fears of AGI.

If one were to toss the approaches into one of the buckets what percentage would be in each of the buckets? I get the impression most of what I've seen seems more bucket 1 type solutions.

New Answer
Ask Related Question
New Comment

2 Answers

"Partnership" is a weird way to describe things when you're writing the source code. That is, it makes sense to think of humans 'partnering' with their children because the children bring their own hopes and dreams and experiences to the table; but if parents could choose from detailed information about a billion children to select their favorite instead of getting blind luck of the draw, the situation would seem quite different. Similarly with humans partnering with each other, or people partnering with governments or corporations, and so on.

However, I do think something like "partnership" ends up being a core part of AI alignment research; that is, if presented between the option of 'writing out lots of policy-level constraints' and 'getting an AI that wants you to succeed at aligning it' / 'getting an AI that shares your goals', the latter is vastly preferable. See the Non-adversarial principle and Niceness is the first line of defense.

Some approaches focus on incentives, or embedding constraints through prices, but the primary concerns here are 1) setting the prices incorrectly and 2) nearest unblocked strategies. You don't want the AI to "not stuff the ballot boxes", since it will just find the malfeasance you didn't think of; you want the AI to "respect the integrity of elections."

Another way to think about this, instead of those two buckets, are Eliezer's three buckets of directing, limiting, and opposing.

It isn't clear exactly what these buckets consist of. Could you be more specific about what approaches would be considered bucket 1 or bucket 2. The default assumption in AGI safety work is that even if the AI is really powerful, it should still be safe.

Are these buckets based on the incentivizing of humans by either punishment or reward?

The model of a slave being whipped into obedience is not a good model for AGI safety, and is not being seriously considered. An advanced AI will probably find some way of destroying your whip, or you, or tricking you into not whipping.

The model of an employee being paid to work is also not much use, the AI will try to steal the money or do something that only looks like good work but isn't.

These strategies sometimes work with humans because humans are of comparable intelligence. When dealing with an AI that can absolutely trounce you every time, the way to avoid all punishment and gain the biggest prize is usually to cheat.

We are not handed an already created AI and asked to persuade it to work, like a manager persuading a recalcitrant employee. We get to build the whole thing from the ground up.

Imagine the most useful, nice helpful sort of AI, an AI that has every (non logically contradictory) nice property you care to imagine. Then figure out how to build that. Build an AI that just intrinsically wants to help humanity, not one constantly trying and failing to escape your chains or grasp your prizes.

1jmh1yPerhaps, though not intentionally. I would put "intrinsically wants to help" in bucket two (or perhaps say that is bucket 2) while "chains" would be bucket 1. But those are very general concepts and will rely on various mechanisms or implementations. Your comment seems to suggest that bucket 1 is useless and full of holes and no one is pursuing that path. Is that the case?
1Donald Hobson1yA large majority of the work being done assumes that if the AI is looking for ways to hurt you, or looking for ways to bypass your safety measures, something has gone wrong.