2029

LESSWRONG
LW

2028
AI

20

The Ambiguity Of "Human Values" Is A Feature, Not A Bug

by johnswentworth
16th Nov 2025
3 min read
0

20

AI

20

New Comment
Moderation Log
More from johnswentworth
View more
Curated and popular this week
0Comments

This post started as two comments in response to Don't use the phrase "human values".

I find myself using the term “human values” for two main reasons. And for both of them, the ambiguity of the term is load bearing.

First Use Case: Anti Bike Shedding

When talking about AI alignment with noobs, there's this problem where everyone and their grandmother instinctively wants to jump immediately to arguing about which particular values/principles/etc the AI should be aligned with. Arguing about what values/principles/etc other people should follow is one of humanity's major passtimes, it's familiar territory, everyone has stupid takes on it and can spew those stupid takes for ages without getting smacked in the face by reality because there's usually no good feedback mechanism on value claims, so those habits generalize readily to talking about AI values. It's very much a bike shed[1]. One of the most common first takes people have upon hearing the problem is "Maybe it's easy, people just haven't thought about aligning it to X", where X is love or The Bible or preserving Markov blankets or complexity or niceness or [...]. Or, rather than a positive suggestion, a negative suggestion, like e.g. the classic "But maybe [humans/AI] won't turn out to have a utility function at all".

On the other hand, from a technical perspective, the entire "What values tho?" question is mostly just not that central/relevant. Understanding how minds work at all and how to robustly point them at anything at all is basically the whole problem.

So when talking about the topic, I (and presumably others) have often found myself reaching for a way to talk about the alignment target which is intentionally conflationary. Because if I say some concrete specific target, then the idiots will come crawling out of the woodwork saying "what if you align it to <blah> instead?" or "but what if there is no utility function?" or whatever. By using a generic conflationary term, I can just let people fill in whatever thing they think they want there, and focus on the more central parts of the problem.

Historically, when I've felt that need, I've often reached for the term "values". It's noncommital about what kind of thing we're even talking about, it's mildly emphasizing that we're not yet sure what we're talking about, and that's a feature rather than a bug of the term. I've historically used "human values" and "AI values" similarly; they're intentionally noncommittal, and that's a feature rather than a bug, because it redirects attention to the more central parts of what I'm talking about, rather than triggering peoples' takes on alignment target.

I do typically prefer the less ambiguous term “alignment target” in situations where it makes sense (like e.g. the previous sentence). But sometimes, it doesn’t quite fit for whatever reason, so I fall back on the conflationary term “values”.

Second Use Case: There’s A Thing Here Which Is Not Any Of Those Other Things

Often I talk about humans’ values without reference to AI at all (e.g. some recent posts). Why not instead talk about preferences? Or morals? Or goals? Or emotional affect? Those are all very different things; why not explicitly say the one which I’m thinking about?

In this case, the problem is that I want to point to a thing which is not any of those other things. It doesn’t have a name other than “a human’s values”. I have written a few posts trying to gesture at what I mean by human values, and I think they do a decent job, but also Steve Byrnes has left IMO pretty compelling arguments that my gesturing isn’t fully right yet.

In this case, I use “human values” as an intentionally-underdefined term for this thing which I don’t know how to make fully legible yet. It’s essentially a placeholder, a box which I hope and expect to fill in later as my own understanding improves. It’s one of those problems where understanding the meaning of the term is tightly coupled to understanding the abstract structure of a human mind, so we know what structures within the mind even make sense to talk about. What Is The Alignment Problem? contains my best current explanation for what it means and looks like to figure out this sort of problem, with a section on my current best (but still not yet perfect) guess at what “a human’s values” gestures at.

  1. ^

    The old addage is that, when a committee is planning a new nuclear power plant, far more time will be spent arguing about what color to paint the bike shed than on any technically load bearing aspect, because everybody feels qualified to offer a take on the color of the bike shed and to argue about it.