Don't use the phrase "human values"

The ambiguity of "human values" is a feature, not a bug.

When talking about AI alignment with noobs, there's this problem where everyone and their grandmother instinctively wants to jump immediately to arguing about which particular values/principles/etc the AI should be aligned with. Arguing about what values/principles/etc other people should follow is one of humanity's major passtimes, it's familiar territory, everyone has stupid takes on it and can spew those stupid takes for ages without getting smacked in the face by reality because there's usually no good feedback mechanism on value claims, so those habits generalize readily to talking about AI values. It's very much a bike shed^[1]. One of the most common first takes people have upon hearing the problem is "Maybe it's easy, people just haven't thought about aligning it to X", where X is love or The Bible or preserving Markov blankets or complexity or niceness or [...]. Or, rather than a positive suggestion, a negative suggestion, like e.g. the classic "But maybe [humans/AI] won't turn out to have a utility function at all".

On the other hand, from a technical perspective, the entire "What values tho?" question is mostly just not that central/relevant. Understanding how minds work at all and how to robustly point them at anything at all is basically the whole problem.

So when talking about the topic, I (and presumably others) have often found myself reaching for a way to talk about the alignment target which is intentionally conflationary. Because if I say some concrete specific target, then the idiots will come crawling out of the woodwork saying "what if you align it to <blah> instead?" or "but what if there is no utility function?" or whatever. By using a generic conflationary term, I can just let people fill in whatever thing they think they want there, and focus on the more central parts of the problem.

Historically, when I've felt that need, I've usually reached for the term "values". It's noncommital about what kind of thing we're even talking about, it's mildly emphasizing that we're not yet sure what we're talking about, and that's a feature rather than a bug of the term. I've historically used "human values" and "AI values" similarly; they're intentionally noncommittal, and that's a feature rather than a bug, because it redirects attention to the more central parts of what I'm talking about, rather than triggering peoples' takes on alignment target.

^{^}
The old addage is that, when a committee is planning a new nuclear power plant, far more time will be spent arguing about what color to paint the bike shed than on any technically load bearing aspect, because everybody feels qualified to offer a take on the color of the bike shed and to argue about it.

[-]Nina Panickssery1h40

This makes sense as a strategic choice, and thank you for explaining it clearly, but I think it’s bad for discussion norms because readers won’t automatically understand your intent as you’ve explained it here. Would it work to substitute the term “alignment target” or “developer’s goal”?

[-]quetzal_rainbow1h20

When I say "human values" without reference I mean "type of things that human-like mind can want and their extrapolations". Like, blind from birth person can want their vision restored, even if they have sufficiently accommodating environment and other ways to orient, like echolocation. Able-bodied human can notice this and extrapolate this into possible new modalities of perception. You can be not vengeful person, but concept of revenge makes sense to almost any human, unlike concept of paperclip-maximization.

[-]Nina Panickssery1h20

When I say "human values" without reference I mean "type of things that human-like mind can want and their extrapolations"

This is a reasonable concept, but should have a different handle from “human values”. (Because it makes common phrases like “we should optimize for human values” nonsensical. For example, human-like minds can want chocolate cake but that tells us nothing about the relative importance of chocolate cake and avoiding disease, which is relevant for decision making.)

[-]Vladimir_Nesov2h22

I think "someone's preferences", or "moral goodness" are approximately the same as "human values" in meaning and ambiguity unless clarified, and the clarifications would work similarly well or poorly for either of them. What "human values" gesture at is distinction from values-in-general, while "preferences" might be about arbitrary values. Taking current wishes/wants/beliefs as the meaning of "preferences" or "values" (denying further development of values/preferences as part of the concept) is similarly misleading as taking "moral goodness" as meaning anything in particular that's currently legible, because the things that are currently legible are not where potential development of values/preferences would end up in the limit.

[-]Nina Panickssery1h20

What "human values" gesture at is distinction from values-in-general, while "preferences" might be about arbitrary values.

I don’t understand what this means.

Taking current wishes/wants/beliefs as the meaning of "preferences" or "values" (denying further development of values/preferences as part of the concept) is similarly misleading as taking "moral goodness" as meaning anything in particular that's currently legible, because the things that are currently legible are not where potential development of values/preferences would end up in the limit.

Is your point here that “values” and “preferences” are based on what you would decide to prefer after some amount of thinking/reflection? If yes, my point is that this should be stated explicitly in discussions, for example like “here I am discussing the preferences you, the reader, would have, after thinking for many hours.”

If you want to additionally claim that these preferences are tied to moral obligation, this should also be stated explicitly.

[-]Vladimir_Nesov1h20

Stating things explicitly is a tradeoff that must be decided on success or failure in conveying the intended point, not by stricture of form.

By "human values" being distinct from arbitrary values I simply mean that anything called "human values" is less likely to be literal paperclipping than values-in-general, it's suggesting a distribution over values that's human-specific in some way. By "preferences" also gesturing at their further development on reflection I'm pointing out that this is a strong possibility for what the term might mean, so unless a clarification rules it out, it remains a possible intended meaning. (More specifically, I meant the whole process of potential ways of developing values/preferences, not some good-enough end-point, so not just thinking for many hours, but also not disregarding current wishes/wants/beliefs, as they too are part of this process.)

LESSWRONG
LW

LESSWRONG
LW

12

Don't use the phrase "human values"

12

12