Resolving human values, completely and adequately

I very much appreciate the amount of time and effort you're putting into this!

That said, as much as I'd like to engage with this post, it feels very hard for me to do. The main problem I'm having is that there are a lot of very specific details where I feel like I don't have enough context to evaluate the details. By "context", I mean that there are a million different ways by which one could choose to formalize human values, and I assume that you've got some very specific reasons for why you've made the specific formalization choices that you have made. And in order to evaluate whether these are good choices, I'd need to understand your goals in making said choices, but you seem to have only given us the end results of your thought process rather than the original goals of it.

For instance, you note that $W_{H} (v)$ can be 0 if a human has carefully considered it and found it to be irrelevant or negative. This sentence jumped out at me somewhat, since I would have intuitively assumed that if the human had evaluated something negative, it would be assigned a negative value rather than a 0; at least I wouldn't have expected values that were evaluated as irrelevant, to be assigned the same score as values that were evaluated as negative!

Reading on, I found that you separately define an endorsement of v, which can be negative - so apparently if we have evaluated things as negative, we can maybe still model that by assigning the thing a positive value and then giving it a negative endorsement value? I'm confused as to why these are split into two different variables. "Endorsement" suggests that it's about meta-values, so that the intent of this separation would be to model things which the human likes but doesn't actually endorse liking. But that doesn't capture the possibility that they e.g. dislike pain, and also endorse disliking pain.

Or maybe, since a value v was supposed to be defined as a statement which a human might agree to, we're supposed to model pain avoidance as a positive claim, "pain is to be avoided", which is then given a positive value? That would make sense, but in that case I'm again unclear on what the endorsement thing is meant to model, since apparently it doesn't take things like "liking" into account at all, but rather acts directly on endorsements?

So I mentally tag this as unclear and try to read on, hoping that this will be clarified later in the article, but instead I seem to run into a lot more specific choices and assumptions, and get the feeling that the article's assuming me to already have understood the previous sections in each new section it introduces... at which point I gave up.

What would make this much more readable for me would be something like, each subsection starting with the philosophical motivation and desiderata for the formalization choices made in that section, then having the content that it has now, and then finally giving some practical examples of what these formalizations imply and what kinds of mathematical objects result as a consequence. (Not necessarily always in that order: some mixing might be in order. E.g. for section 1.1, you have the line "Object level values are those which are non-zero only on rewards"; this seems to suggest that there may be values which refer to other values, separately from having the value also contain an endorsement for its assigned reward...? So you could have a value that assigns a positive value to some reward, a negative endorsement of that reward, and then a separate value which assigns treats the outcome of the first value as a positive reward with some weight, and it also assigns a positive or negative endorsement to the result of that computation...? I'm probably misunderstanding this somehow, which a bunch of examples about object-level and non-object-level values would clear up.)

Knowing at least what's the kind of real-world thing that the formalism is trying to capture, would help a lot when I was trying to evaluate whether I'd interpreted something you said correctly.

[-]Stuart_Armstrong8y50

Thanks!

Ok, I will rework it for improved clarity; but not all the options I chose have deep philosophical justifications. As I said, I was aiming for an adequate resolution, with people's internal meta-values working as philosophical justifications for their own resolution.

As for the specific case that tripped you up: I wanted to distinguish between endorsing a reward or value, endorsing its negative, and endorsing not having it. "I want to be thin" vs "I want to be fat" vs "I don't want to care about my weight". The first one I track as a positive endorsement of R, the second as a positive endorsement of -R, the third as a negative endorsement of R (and of -R).

But I'll work on it more.

[-]Kaj_Sotala8y40

Thanks!

not all the options I chose have deep philosophical justifications.

Just to be clear, when I said that each section would be served by having a philosophical justification, I don't mean that it would necessarily need to be super-deep; just something like "this seems to make sense because X", which e.g. sections 2.4 and 2.5 already have.

[-]habryka8y50

Reason why the LaTeX is breaking: We parse each LaTeX block separately, and sometimes out of order (for performance reasons), this means you can't use "newcommand" in one LaTeX block and expect it to work in future LaTeX blocks. The editor loads all LaTeX simultaneously, so you won't run into this problem in the editor, but we will run into this problem when we try to render the LaTeX for other users.

If you want to make sure your LaTeX works, you want to avoid using "newcommand", or redefine the command at the top of the relevant LaTeX blocks.

[-]Stuart_Armstrong8y70

...that makes newcommand almost useless (though it's always worked previously for me). And in some cases, it was things like \Theta that was not being rendered!

Hum. Any way of getting round this? Is there a way of editing the whole post as text (since then I can run a substitution on the text, replacing all the commands with their full version)?

But thanks for figuring it out!

[-]habryka8y20

Huh, then maybe it's something else. Do you have a post in which $n e w c o m m a n d$ worked fine?

[-]Stuart_Armstrong8y20

See eg https://www.lesswrong.com/posts/dFyqTAyG2oCSr6S4K/intuitive-examples-of-reward-function-learning

In this current post, where things stopped working, it seemed that the number of latex fomulas was relevant? If I added any more latex box, no matter no simple, it would fail?

[-]Gordon Seidoh Worley8y30

I really like this in that it's approaching an issue I view as currently neglected within AI safety research: how to determine human values to be learned. Like Kaj I find this a bit hard to engage with specific issues to give feedback, but I look forward to where this goes since I expect us to eventually need more formal approaches to axiology, even if they are only "adequate".

[-]Dacyn8y30

Some typos:

"any reward's self-endorsement" -> "any value's self-endorsement"
"in favour of if" -> "in favour of it"
"denominated it days" -> "denominated in days"
"many possible future" -> "many possible futures"
"The more weight it given" -> "The more weight is given"
"fomalise" -> "formalise"

[-]mako yass8y30

Additional typo/request for clarification; is w supposed to be v' ?

Object level values are those which are non-zero only on rewards; ie the v∈V for which θ(v)(w)=0 for all v′∈V

[-]Stuart_Armstrong8y20

Thanks, now corrected.

[-]Stuart_Armstrong8y20

Now corrected, thanks.

[-]Stuart_Armstrong8y20

Thanks! Will correct once I have a decent conection.

[-]Charlie Steiner4y*40

More interesting post to me now than it was to past-me :) Thanks from the future. Anyhow, typos for the typo god:

"doing to little"->"doing too little"

Also the second link in "many ways" is broken now, I think it was probably to https://www.semanticscholar.org/paper/The-emotional-dog-and-its-rational-tail%3A-a-social-Haidt/b74e8da297574fd071d4b48b7aa94ea16861aea6 ?

[-]Stuart_Armstrong4y20

Thanks! Glad you got good stuff out of it.

I won't edit the post, due to markdown and latex issues, but thanks for pointing out the typos.

[-]William_S7y20

Glad to see this work on possible structure for representing human values which can include disagreement between values and structured biases.

I had some half-formed ideas vaguely related to this, which I think map onto an alternative way to resolve self reference.

Rather than just having one level of values that can refer to other values on the same level (which potentially leads to a self-reference cycle), you could instead explicitly represent each level of value, with level 0 values referring to concrete reward functions, level 1 values endorsing or negatively endorsing level 0 values, and generally level n values only endorsing or negatively endorsing level n-1 values. This might mean that you have some kinds of values that end up being duplicated between multiple levels. For any n, there's a unique solution to the level of endorsement for every concrete value. We can then consider the limit as n->infinity as the true level of endorsement. This allows for situations where the limit fails to converge (ie. it alternates between different values at odd and even levels), which seems like a way to handle self reference contradictions (possibly also the all-or-nothing problem if it results from a conflict between meta-levels).

I think this maps into the case where we don't distinguish between value levels if we define an function that just adjusts the endorsement of each value by the values that directly to refer to it. Then iterating this function n times gives the equivalent of having an n-level meta-hierarchy.

I think there might be interesting work in mapping this strategy into some simple value problem, and then trying to perform bayesian value learning in that setting with some reasonable prior over values/value endorsements.

[-]avturchin8y20

Imagine a person, simple John, who doesn't have values, but who's behaviour is controlled by a random generator. For example, Jonh randomly choose between sleeping, collecting flowers and killing cats. However, neither John, not anyone outside doesn't know that he has not values.

The question is: will the outside observer (human or AI described above in your post) recognise that John has no values, or it will construct some model of John's values to better explain and predict John behaviour? It would be like seeing patterns in random noise.

My example may be applicable to a lot of actual human behaviour which is not controlled by values, and is either random or automatic like a reflex.

Is your theory able to correctly recognise such random behaviour and don't produce model of non-existing values?

[-]Stuart_Armstrong8y20

Because of the no-free lunch theorem in value learning, you can't say anything at all about the values of an irrational agent - without making assumptions.

In practice, these assumptions are the common ones shared by most humans. So the AI would necessarily project human properties on John.

But John doesn't behave anything like a proper human, so the AI would have difficulty interpreting him, and would probably default to something like "crazy basic human".

[-]avturchin8y10

Does it mean that AI has to have a model of human mind and its typical values before reading John's values?

[-]Stuart_Armstrong8y30

The AI has to have a model of how humans distinguish preferences from irrationality; see eg https://www.lesserwrong.com/posts/pQz97SLCRMwHs6BzF/using-lying-to-detect-human-values

https://www.lesserwrong.com/posts/kmLP3bTnBhc22DnqY/beyond-algorithmic-equivalence-self-modelling

and some of the links within those links.

[-]avturchin8y10

Thanks for links. By the way, do we have a definition of "human value" about which we agree?

[-]Stuart_Armstrong8y20

>do we have a definition of "human value" about which we agree?

Of course not; that would make things far too easy! :-)

Though in https://www.lesswrong.com/posts/weHuX2qkTxgAXBw8t/defining-the-ways-human-values-are-messy , I define human values as preferences (which is a lot clearer), with the distinction between values and more normal preferences being due to a human meta-preference.

[-]avturchin8y10

Ok, what about preferences? Is it correct to call the preference "a probability distribution of expected human choices"? For example, my preference is 70 percent to take coffee and 30 percent to take tea at breakfast.

[-]Stuart_Armstrong8y20

>Is it correct to call the preference "a probability distribution of expected human choices"

No, because the assumption of irrationality means that preferences don't match up with choices. Preferences are rankings of possible worlds/rewards/outcomes on an ordinal and cardinal scale. The challenge is to infer these preference from human behaviour.

[-]avturchin8y10

If preferences will be equal to choices, then predicting preferences will be predicting future choice which may be relatively simple task of extrapolation of the past behaviour, and it could be computed without assuming existence of two parts of the human mind: constant preferences and noise.

[-]Stuart_Armstrong8y20

>If preferences will be equal to choices

Unless you are arguing that humans are fully rational in every decision they ever make, this is not the case.

[-]avturchin8y10

Yes, but this happens only because of the way we define preferences, imho. We define preferences as purely rational part, then compare this definition with actual humans, and see that there is also another irrational part.

Example: the same way we could say: every human being is six feet high, plus minus some noise variable. This may be useful way to describe humans, but it has obvious limitations.

What I suggest to do, is to look why we decided that humans have values or preferences at all? It is idea which appeared somewhere in 20 century psychology or philosophy, and it is only one of several ways to describe humans behaviour.

[-]Stuart_Armstrong8y20

I want to construct/extract/extrapolate/define human preferences (or make a human reward/utility function), in order to have something we can give AI as a goal. Whether we count this as defining or extrapolating doesn't really matter; it's the result that's important.

One of the things that gives me hope is that actual humans overlap considerably in their judgement of what is rational and irrational. Almost everyone agrees that the anchoring bias is bias, not a preference; almost everyone agrees that people are less rational when drunk (with the caveat that drunkeness can also suppress certain other irrationalities, like social phobia - but again, that more complicated story is also something that people tend to agree on).

And values, and debates over values, date back at least to tribal times; dehumanising foreigners was based a lot around their strange values and untrustrworthiness.

[-]avturchin8y40

I understand it and I think it is important project.

I will try to write something in next couple of months where I will check another approach: is it possible to describe AI-human positive relations without extracting or extrapolating values at all. For now I have some gut feeling that it could be interesting point of view, but I am not ready to formalize it.

[-]Stuart_Armstrong8y40

Good luck with that! I'm skeptical of that approach, but it would be lovely if it could be worked out...

LESSWRONG
LW

LESSWRONG
LW

32

Resolving human values, completely and adequately

32

32

Adequate versus elegant

Basic framework, then modifications

1 Terminology and basic concepts

1.1 The role of the AI

2 The basic framework

2.1 Contradictory values

2.2 Unendorsing rewards

2.3 Underdefined rewards

2.4 Moral errors and moral learning

2.5 Automated philosophy and CEV

2.6 Meta-values

3. The "wrong" $Θ$ : meta-values for the resolution process

4 Problems with self-referential $Θ$

4.1 All-or-nothing values, and personal identity

4.2 You're not the boss of me!

5 Conclusion: much more work

32

Resolving human values, completely and adequately

32

32

Adequate versus elegant

Basic framework, then modifications

1 Terminology and basic concepts

1.1 The role of the AI

2 The basic framework

2.1 Contradictory values

2.2 Unendorsing rewards

2.3 Underdefined rewards

2.4 Moral errors and moral learning

2.5 Automated philosophy and CEV

2.6 Meta-values

3. The "wrong" Θ: meta-values for the resolution process

4 Problems with self-referential Θ

4.1 All-or-nothing values, and personal identity

4.2 You're not the boss of me!

5 Conclusion: much more work

3. The "wrong" $Θ$ : meta-values for the resolution process

4 Problems with self-referential $Θ$