AI Alignment Problem: “Human Values” don’t Actually Exist

[-]Gunnar_Zarncke3y70

Late comment but I recently posted how human values arise naturally by the brain learning to keep its body healthy in the ancestral environment by a process that could be simplified like this:

First, the brain learns how the body functions. The brain then figures out that the body works better if senses and reflexes are coordinated. Noticing patterns and successful movement and action feels good.
Then the brain discovers the abstraction of interests and desires and that the body works better (gets the nutrients and rest that it needs) if interests and desires are followed. Following your wants feels rewarding.
Then the brain notices personal relationships and that interests and wants are better satisfied if relationships are cultivated (the win-win from cooperation). Having a good relationship feels good, and the thought of the loss of a relationship feels painful.
The brain then discovers the commonalities of expectations within groups - group norms and values - and that relationships are easier to maintain and have less conflict if a stable and predictable identity is presented to other people. Adhering to group norms and having stable values feels rewarding.

These natural learning processes are supported by language and culture by naming, and suggestion behaviors make some variants more salient and thus more likely to arrive - but humans would pick up on the principles even without a pre-existing society - and that is what actually happens in certain randomly assembled societies.

https://www.lesswrong.com/posts/Di4bFP7kjoLEQLpQd/what-s-the-relationship-between-human-values-and-the-brain-s?commentId=Qh3MsAvqFemXMcYcz

[-]avturchin3y40

This describes convergent value system of any mind, not only human one. So there is nothing specially human in it.

[-]Gunnar_Zarncke3y20

Correct.

The human aspect results from

the structure of the needs of the body and its low-level regulation (food, temperature, but also reproductive drives), and
the structure of the environment - how many other humans there are, how and where resources can be acquired.

[-]Raemon7y70

Most of this didn't seem new to my thinking, but I appreciated this post as a comprehensive writeup of the various issues here.

(This post also motivates me to work on a Table Of Contents view that is more optimized as a primary reading experience. Because most of the points where things I'd heard before, I found myself preferring to skim the ToC and then click to zoom into particular arguments that seemed new or interesting)

[-]avturchin7y30

I got the idea of Table of Content as primary reading experience form Drexler's CAIS, where each subsection's name is a short sentence with a statement, like "I.6 The R&D automation model distinguishes development from functionality."

[-]moridinamael7y60

The idea of AI alignment is based on the idea that there is a finite, stable set of data about a person, which could be used to predict one’s choices, and which is actually morally good. The reasoning behind this basis is because if it is not true, then learning is impossible, useless, or will not converge.

Is it true that these assumptions are required for AI alignment?

I don't think it would be impossible to build an AI that is sufficiently aligned to know that, at pretty much any given moment, I don't want to be spontaneously injured, or be accused of doing something that will reliably cause all my peers to hate me, or for a loved one to die. There's quite a broad list of "easy" specific "alignment questions", that virtually 100% of humans will agree on in virtually 100% of circumstances. We could do worse than just building the partially-aligned AI who just makes sure we avoid fates worse than death, individually and collectively.

On the other hand, I agree completely that coupling the concepts of "AI alignment" and "optimization" seems pretty fraught. I've wondered if the "optimal" environment for the human animal might be a re-creation of the Pleistocene, except with, y'know, immortality, and carefully managed, exciting-but-not-harrowing levels of resource scarcity.

[-]avturchin7y40

There is some troubles in creating full and safe list of such human preferences, and there were an idea that AI will be capable to learn actual human preferences by observing human behaviour or by other means, like inverse reinforcement learning.

This my post basically shows that value learning will also have troubles, as there is no real human values, so some other ways to create such list of preferences is needed.

How to align the AI with existing preference, presented in human language, is another question. Yudkowsky wrote that without taking into account the complexity of value, we can't make safe AI, as it would wrongly interpret short commands without knowing the context.

[-]Stuart_Armstrong6y40

Hey there!

Wondering how you felt my research agenda addressed, or failed to address, many of these points: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/research-agenda-v0-9-synthesising-a-human-s-preferences-into

I have my own opinions on these, but interested in yours.

[-]avturchin6y10

In short, I am impressed, but not convinced :)

One problem I see is that all information about human psychology should be more explicitly taken into account as some independent input in the model. For example, if we take a model M1 of human mind, in which there are two parts, consciousness and unconsciousness, both of which are centered around mental models with partial preferences - we will get something like your theory. However, there could be another theory M2 well supported by psychological literature, where there will be 3 internal parts (e.g. Libido, Ego, SuperEgo). I am not arguing that M2 is better than M1. I am argue that M should be taken as independent variable (and supported by extensive links of actual psychological and neuroscience research for each M).

In other words, as soon as we define human values as some theory V (there is around 20 theories only between AI safety researcher about V, of which I have in a list), we could create an AI which will learn V. However, internal consistency of the theory V is not the evidence that it is actually good, as other theories about V are also internally consistent. Some way of testing is needed, may in the form in which human could play, so we could check what could go wrong - but to play such game, the preference learning method should be specified in more details.

During reading I was expecting to get more on the procedure of learning partial preferences. However, it was not explained in details and was only (as I remember) mentioned that future AI will able to learn partial preferences by some deep scan methods. But it is too advance method of value learning to be safe. In it we have to give AI very dangerous capabilities like nanotech for brain reading before it will learn human values. So AI could start acting dangerously before it learns all these partial preferences. Other methods of value learning are safer: like an analysis of previously written human literature by some ML, which would extract human norms from it. Probably, some word2vec could do it even now.

Now, it may turn out that I don't need that AI will know the whole my utility function, I just want it to obey human norms plus do what I said. "Just brink me tee, without killing my cat and tilling universe with teapots." :)

Another thing which worry me about personal utility function is that it could be simultaneously fragile(in time) and grotesque and underfdefined – at least based on my self-observation. Thus again I would prefer collectively codified human norm (laws) over extrapolated model of my utility function.

[-]Stuart_Armstrong6y30

Thanks! For the M1 vs M2, I agree these could reach different outcomes - but would either one be dramatically wrong? There are many "free variables" in the process, aiming to be ok.

I'll work on learning partial preferences.

"Just brink me tee, without killing my cat and tilling universe with teapots." [...] and underfdefined – at least based on my self-observation. Thus again I would prefer collectively codified human norm (laws) over extrapolated model of my utility function.

It might be underdefined in some sort of general sense - I understand the feeling, I sometimes get it too. But in practice, it seems like it should ground out to "obey human orders about tea, or do something that is strongly preferred to that by the human". Humans like their orders being obeyed, and presumably like getting what they're ordering for; so to disobey that, you'd need to be very sure that there's a clearly better option for the human.

Of course, it might end up having a sexy server serve pleasantly drugged tea ^_^

[-]avturchin6y30

One more thing: you model assumes that mental models of situations are actually preexisting. However, imagine a preference between tea and coffee. Before I was asked, I don't have any model and don't have any preference. So I will generate some random model, like large coffee and small tea, and when make a choice. However, the mental model I generate depends on framing of the question.

In some sense, here we are passing the buck of complexity from "values" to "mental models", which are assumed to be stable and actually existing entities. However, we still don't know what is a separate "mental model", where it is located in the brain, how it is actually encoded in neurons.

[-]Stuart_Armstrong6y20

The human might have some taste preferences that will determine between tea and coffee, general hedonism preferences that might also work, and meta-preferences about how they should deal with future choices.

Part of the research agenda - "grounding symbols" - about trying to determine where these models are located.

[-]habryka7y40

There is a duplicate "Religion" heading in section 3.2, and a missing heading with the number 5

[-]avturchin7y10

Thanks, will correct in my working draft.

[-]Jay Molstad7y4-3

There's an existentialist saying "existence precedes essence". In other words, we aren't here doing things because we have some values that we need to pursue by existing. We're here because our parents got horny. We create values for ourselves to give us (the illusion of) purpose and make our existence bearable, but the values are post-hoc inventions. As such, they are not more consistent that we make them, nor more binding than we choose to let them be.

[-]Gordon Seidoh Worley6y50

Your point runs even deeper than you suggest (to my reading). We can read "existence precedes essence" as Sartre's take on Heidegger's "back to the things themselves", i.e. to put the ontic before the ontological, or noumena before phenomena. You suggest a teleological approach to essence, that we create values and other forms of understanding to make sense of the senseless because we need sense to make life bearable (that's our telos or purpose here: coping with the lack of extrinsic meaning), but the point holds even if we consider it non-teleologically: all of our understanding is post hoc, it always comes after the thing itself, and our sense of values is consequently not whatever the thing is that causes us to act such that we can interpret our observations of those actions as if they were essential values, but instead a posterior pattern matching to that which happened that we call "values" (caveat of course being that the very act of understanding is embodied and so feeds back into the real thing itself, only without necessarily fitting our understanding of it).

[-]Charlie Steiner7y40

Thanks for this! Definitely some themes that are in the zeitgeist right now for whatever reason.

One thing I'll have to think about more is the idea of natural limits (e.g. the human stomach's capacity for tasty food) as a critical part of "human values," that keeps them from exhibiting abstractly bad properties like monomania. At first glance one might think of this as an argument for taking abstract properties (meta-values) seriously, or taking actual human behavior (which automatically includes physical constraints) seriously, but it might also be regarded as an example of where human values are indeterminate when we go outside the everyday regime. If someone wants to get surgery to make their stomach 1000x bigger (or whatever), and this changes the abstract properties of their behavior, maybe we shouldn't forbid this a priori.

[-]Teo Ajantaival7y40

● Humans do not have one terminal value (unless they are mentally ill).

Why though?

I don't see any other way to (ultimate) alignment/harmony/unification between (nor within) minds than to use a single terminal value-grounded currency for resolving all conflicts.

For as soon as we weigh two terminal values against each other, we are evaluating them through a shared dimension (e.g., force or mass in the case of a literal scale as the comparator), and are thus logically forced to accept that either one of the terminal values (or its motivating power) could be translated into the other, or that there was this third terminal {value/motivation/tension} for which the others are tools.

Do you suggest getting rid of the idea of terminal value(s) altogether, or could you explain how we can resolve conflicts between two terminal values, if terminal means irreducible?

(To the extent that I think in terminal and instrumental values, I claim to care terminally only about suffering. I also claim to not be mentally ill. A lot of Buddhists etc. might make similar claims, and I feel like the statement above quoted from the Conclusion without more context would label a lot of people either mentally ill or not human, while to me the process of healthy unification feels like precisely the process of becoming a terminal value monist. :-))

[-]Kaj_Sotala7y80

could you explain how we can resolve conflicts between two terminal values, if terminal means irreducible?

Suppose the following mind architecture:

When in a normal state, the mind desires games.
When the body reports low blood sugar levels, the mind desires food.
When in danger, the mind desires running away.
When in danger AND with low blood sugar levels, the mind desires freezing up.

Something like this has a system of resolving conflicts between terminal values: different terminal values are swapped in as the situation warrants. But although there is an evolutionary logic to them - their relative weights are drawn from the kind of a distribution which was useful for survival on average - the conflict-resolution system is not explicitly optimizing for any common currency, not even survival. There just happens to be a hodgepodge of situational variables and processes which end up resolving different conflicts in different ways.

I presented a more complex model of something like this in "Subagents, akrasia and coherence in humans" - there I did say that the subagents are optimizing for an implicit utility function, but the values for that utility function come from cultural and evolution-historical weights so it still doesn't have any consistent "common currency".

Often minds seem to end up at states where something like a particular set of goals or subagents ends up dominating, because those are the ones which have managed to accumulate the most power within the mind-system. This does not look like some of them became the most powerful through something like an appeal to shared values, but rather through just the details of how that person's life-history, their personal neurobiological makeup, etc. happen to be set up and which kinds of neurological processes those details have happened to favor.

Similarly, governments repeat the same pattern at the intrapersonal level - value conflicts are not resolved through being weighted in terms of some higher-level value. Rather they are determined through a complex process where a lot of contingent details, such as a country's parliamentary procedures, cultural traditions, voting systems etc. having a big influence on shaping which way the chips happen to fall WRT any given decision.

[-]avturchin7y50

It occurred to me that, for a human being, there is no way not to make a choice between different preferences: in any next moment of time I do something, even continue to think or indulge in procrastination. I either eat, or run, so the conflict is always resolved.

However, an interesting thing is that sometimes a person tries to do two things simultaneously, for example, if content of the speech and the tone do not match. It has happened to me – and I had to explain that only content matter, and the tone should be ignored.

[-]Gordon Seidoh Worley6y30

It occurred to me that, for a human being, there is no way not to make a choice between different preferences: in any next moment of time I do something, even continue to think or indulge in procrastination. I either eat, or run, so the conflict is always resolved.

This matches an excise you may be asked to do as part of Buddhist training towards enlightenment. During meditation, get your attention focused on itself, then try to do something other than what you would do. If you have enough introspective access, you'll get an experience of being unable to do anything other than exactly what you do—you get first-hand experience with determinism at a level that bypasses the process that creates the illusion of free will. So not only can you only ever do the one thing you actually do (for some reasonable definition of what "one thing" is here), you can't every do anything other than the one thing you end up doing, viz. there was no way any counterfactual was ever going to be realized.

[-]avturchin7y30

A good description why any one value may be not good is in https://www.academia.edu/173502/A_plurality_of_values

I am sure you have more than one value - for example, the best way to prevent even slightest possibility of suffering is suicide, but as you are alive, you care to be alive. Moreover, I think that claims about values are not values - they are just good claims.

The real case of "one value person" are maniacs: that is a human version of a paperclipper. Typical examples of such maniacs are people obsessed with sex, money, or collecting of some random things; also drug addicts. Some of them are psychopaths: they look normal and are very effective, but do everything just for one goal.

Thanks for your comment - I will update the conclusion, so the bullet points will be linked with parts of the text which will explains them.

[-]Teo Ajantaival7y00

Terminal value monism is possible with impersonal compassion as the common motivation to resolve all conflicts. This means that every thus aligned small self lives primarily to prevent hellish states wherever they may arise, and that personal euthanasia is never a primary option, especially considering that survivors of suffering may later be in a good position to understand and help it in others (as well as contributing themselves as examples for our collective wisdom of life narratives that do/don't get stuck in hellish ways).

[-]avturchin7y30

Terminal value monism may be possible as a pure philosophical model, but real biological humans have more complex motivational systems.

[-]Gordon Seidoh Worley6y20

Are you speaking from personal experience here, Teo? This seems like a plausible interpretation of self experience under certain conditions based on your mention of "impersonal compassion" (I'm being vague to avoid biasing your response), but it's also contradictory to what we theorize to be possible based on the biological constructs on which the mind is manifested. I'm curious because it might point to a way to better understand the different viewpoints in this thread.

[-]Gordon Seidoh Worley6y30

So, finally caught up to reading this, and wow, this is great! This is exactly the kind of look at values I've been looking for over the last couple months, and it was sitting here the whole time. You do a great job of capturing the many ways in which I think we should be confused about values, and lay them all out so we can acknowledge each aspect. I think grappling with this confusion is key if we are to develop better models, because often from the confusion we can find our way through to better understanding when the confusion causes our conceptions to unravel enough that we can look fresh and start building up our understanding again.

[-]avturchin6y30

Thanks!

[-]romeostevensit7y*10

Guess: human values reflect beliefs about the modularity of reality. A necessary component of the counterfactual simulator.

The counterfactual simulator, in turn, seems to be about convex optimization of tradeoff space.

[-]avturchin7y10

Before "being something", values need to actually exist as some kind of object. Non-existing object can't have properties. For example, Sun exists, and thus we can discuss its mass. Zeus doesn't exist, and it makes any discussion about his mass futile.

46

AI Alignment Problem: “Human Values” don’t Actually Exist

46

46

Introduction

1. Ontological status and sources of human values

1.1. AI alignment requires an actually existing, stable, finite set of predicting data about peoples’ motivation, which is called “human values”

1.2. Human values do not actually exist; they are only useful descriptions of human behavior and rationalization

1.3. Five sources of information of human values: verbalizations, thoughts, emotions, behavior and neurological scans

1.4. Where do human values come from: evolution, culture, ideologies, logic and personal events

2. Critics of the idea of human values as a constant set of personal preferences: it is based on many assumptions

2.1. Human preferences are not constant

2.2. Human choices are not defined by human values

1. Situation

2. Internal conflicts

3. Emotional affect

4. Peer pressure

5. Random processes in the brain

6. Conditional and unconditional reflexes

7. Somnambulism, the bisected brain, and actions under hypnosis

8. Actions under influence of drugs; demented people and children

2.3. “Human values” can’t be easily separated from biases

2.4. Human values are subject-centered and can’t be separated from the person

2.5. Many human values are not actionable + the hidden complexity of values

2.6. Open question: is there any relation between values, consciousness and qualia?

2.7 Human values as a transient phenomenon: my values are not mine

3. The idea of “human values” maybe not as useful concept as it looks like for AGI Safety

3.1. Human values are not safe if scaled, extracted from a human or separated

3.2. It is wrong to think of values as of property of a single human: values are social phenomena

1. Not human have values, but values have humans

2. Values as memes, used for group building

3. Ideologies

4. Religion

4. Religion

6. Hypnosis and unconscious learning

3.3. Humans don’t “have” values; they are vessels for values, full of different subpersobalities

1. Values are changeable but identity is preserved

2. Subpersonalities

3. The infinite diversity of human values

4. Normative plurality of values

3.4. Values, choices, and commands

Difference between values and desires

3.5. Meta-values: preference about values

1. Meta-values as morals

2. Suppressed values

3. Subconscious values and sub-personalities

4. Zero behavior and contradicting values

5. Preference about other’s preferences

3.6. Human values cannot be separated from the human mind

1. Values are not encoded separately in the brain

2. Human values are dispersed inside “training data” and “trained neural nets”

3. Amorphous structure of human internal processes and false positives in finding internal parts

3.7. Human values presentation is biased towards presenting socially accepted claims

3.8. Human values could be manipulated by the ways and order they are extracted

3.9 Human values are, in fact, non-human

Viruses

3.10. Any human value model has not only epistemological assumptions, but also has axiological (normative) assumptions

3.11. Values may be not the best route for simple and effective descriptions of human motivation

3.12. Who are real humans in “human values”?

3.13. The human reward function is not “human values”

3.14. Difficult cases for value learning: enlightenment, art, religion, homosexuality and psi

3.15. Human values excluding each other and Categorical imperative as a meta-value

Conclusion

Acknowledgments

Appendix. Table of assumptions in the idea of human values