If an AI observers strong inconsistency in my liking-wanting-approving, should it stop (and inform me about it), or try to agregate my preference anyway?

Reply

[-]Shmi7y30

I like the idea of multidimensional preferences, such as liking/wanting/approving, not just maximizing a single utility function. I suspect that there are more dimensions that are worth considering. For example "deserving" is the one often missed by those with reasonably happy childhood. In those who faced emotional abuse growing up and eventually internalized it, the difference between wanting and deserving can be very considerable. It is quite common to hear "I want to be happy" but when you ask something like "Do you feel that you deserve to be happy?" the answer is often either a pause or a negative, something like "I am not a good person, I do not deserve to be happy." Not sure if this can be incorporated into your model, and what other axes are potentially worth considering.j

Reply

[-]avturchin7y20

Yes, other types of "preferences" are conceivable. For example, if a person is acting under an order of another person, like a soldier, he may not like, nor want or approve the order, but still obey it, as he has to.

Reply

[-]Charlie Steiner7y10

Interesting! I'm still concerned that, since you need to aggregate these things in the end anyhow (because everything is commensurable in the metric of affecting decisions), the aggregation function is going to be allowed to be very complicated and dependent on factors that don't respect the separation of this trichotomy.

But it does make me consider how one might try to import this into value learning. I don't think it would work to take these categories as given and then try to learn meta-preferences to sew them together, but most (particularly more direct) value learning schemes have to start with some "seed" of examples. If we draw that seed only from "approving," does that mean that the trained AI isn't going to value wanting or liking enough? Or would everything probably be fine, because we wouldn't approve of bad stuff?

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

34

Acknowledging Human Preference Types to Support Value Learning

34

Ω 11

34

Ω 11

Motivation

Framework: Liking, Wanting and Approving

Aggregating Preferences

Choosing an Aggregation Method

Final Remarks

References