I don't know what my values are. I don't even know how to find out what my values are. But do I know something about how I (or an FAI) may be able to find out what my values are? Perhaps... and I've organized my answer to this question in the form of an "Outline of Possible Sources of Values". I hope it also serves as a summary of the major open problems in this area.

  1. External
    1. god(s)
    2. other humans
    3. other agents
  2. Behavioral
    1. actual (historical/observed) behavior
    2. counterfactual (simulated/predicted) behavior
  3. Subconscious Cognition
    1. model-based decision making
      1. ontology
      2. heuristics for extrapolating/updating model
      3. (partial) utility function
    2. model-free decision making
      1. identity based (adopt a social role like "environmentalist" or "academic" and emulate an appropriate role model, actual or idealized)
      2. habits
      3. reinforcement based
  4. Conscious Cognition
    1. decision making using explicit verbal and/or quantitative reasoning
      1. consequentialist (similar to model-based above, but using explicit reasoning)
      2. deontological
      3. virtue ethical
      4. identity based
    2. reasoning about terminal goals/values/preferences/moral principles
      1. responses (changes in state) to moral arguments (possibly context dependent)
      2. distributions of autonomously generated moral arguments (possibly context dependent)
      3. logical structure (if any) of moral reasoning
    3. object-level intuitions/judgments
      1. about what one should do in particular ethical situations
      2. about the desirabilities of particular outcomes
      3. about moral principles
    4. meta-level intuitions/judgments
      1. about the nature of morality
      2. about the complexity of values
      3. about what the valid sources of values are
      4. about what constitutes correct moral reasoning
      5. about how to explicitly/formally/effectively represent values (utility function, multiple utility functions, deontological rules, or something else) (if utility function(s), for what decision theory and ontology?)
      6. about how to extract/translate/combine sources of values into a representation of values
        1. how to solve ontological crisis
        2. how to deal with native utility function or revealed preferences being partial
        3. how to translate non-consequentialist sources of values into utility function(s)
        4. how to deal with moral principles being vague and incomplete
        5. how to deal with conflicts between different sources of values
        6. how to deal with lack of certainty in one's intuitions/judgments
      7. whose intuition/judgment ought to be applied? (may be different for each of the above)
        1. the subject's (at what point in time? current intuitions, eventual judgments, or something in between?)
        2. the FAI designers'
        3. the FAI's own philosophical conclusions

Using this outline, we can obtain a concise understanding of what many metaethical theories and FAI proposals are claiming/suggesting and how they differ from each other. For example, Nyan_Sandwich's "morality is awesome" thesis can be interpreted as the claim that the most important source of values is our intuitions about the desirability (awesomeness) of particular outcomes.

As another example, Aaron Swartz argued against "reflective equilibrium" by which he meant the claim that the valid sources of values are our object-level moral intuitions, and that correct moral reasoning consists of working back and forth between these intuitions until they reach coherence. His own position was that intuitions about moral principles are the only valid source of values and we should discount our intuitions about particular ethical situations.

A final example is Paul Christiano's "Indirect Normativity" proposal (n.b., "Indirect Normativity" was originally coined by Nick Bostrom to refer to an entire class of designs where the AI's values are defined "indirectly") for FAI, where an important source of values is the distribution of moral arguments the subject is likely to generate in a particular simulated environment and their responses to those arguments. Also, just about every meta-level question is left for the (simulated) subject to answer, except for the decision theory and ontology of the utility function that their values must finally be encoded in, which is fixed by the FAI designer.

I think the outline includes most of the ideas brought up in past LW discussions, or in moral philosophies that I'm familiar with. Please let me know if I left out anything important.

New to LessWrong?

New Comment
30 comments, sorted by Click to highlight new comments since: Today at 8:57 AM
[-]Jack11y60

There is ambiguity in "source of values" which this outline elides. There is source as "method of discovery" and source as "kind of fact that determines the truth value of normative statements" and source as "causal origin of value-based thinking and language". And there is no reason why the "source of values" should be the same for each of those readings. Though obviously they could be.

So under "conscious cognition": traditional Kantian deontology is discovered by conscious cognition but the facts that make it true are supposed to be external to human minds (purportedly they exist in reason itself, like mathematical truths). But there are definitely also people who hold the view that what makes a decision a moral one is just that it was arrived at by the properly and rationally reflecting on values.

I don't think any position can be accurately categorized without understanding how that position answers each of the above readings of "source". So for instance, my own position would be that the causal origin of values is a synthesis of culturally constructed intuitions and innate, evolved intuitions on which we perform conscious cognitive operations (which are themselves a synthesis of innate abilities and technologies). So a mix between conscious and unconscious cognition, other humans, and some degree of reason. I don't think there are any facts that make value language true and method of discovery is just what we think we're doing with the above. It's then helpful to ask something like: well how do you want a superintelligence to decide what to do? The answer to which somehow involves recursing the causal source of value judgments upon itself. E.g. It's the same as asking "How ought a super intelligence behave?" which is just an instance of a value judgment. But obviously it is really difficult to a) precisely define that process and b)determine how exactly that recursion ought to happen.

You make a good point. Thanks.

I thought "indirect normativity" was a general term due to Nick Bostrom meant to cover e.g. CEV among other proposals. Could be wrong.

Yeah, it is a general term due to Nick, but Paul also used it to title his proposal, so I wasn't sure what else to call Paul's specific proposal. Maybe I'll just add a parenthetical clarification. ETA: Clarification added to OP.

Aaron Swartz argued against "reflective equilibrium" by which he meant the claim that the valid sources of values are our object-level moral intuitions, and that correct moral reasoning consists of working back and forth between these intuitions until they reach coherence. His own position was that intuitions about moral principles are the only valid source of values and we should discount our intuitions about particular ethical situations.

Usually in philosophy, "reflective equilibrium" means wide reflective equilibrium, unless "narrow" is specified. Wide reflective equilibrium takes both object-level moral intuitions and intuitions about principles into account, as well as more meta-ethical or general philosophical considerations. Stanford Encyclopedia has a good article on it.

I find the title a bit confusing. To me it seems a better one would be "Outline of Possible Sources of Knowledge of Values." Or am I misunderstanding you?

[-][anonymous]11y20

For example, Nyan_Sandwich's "morality is awesome" thesis can be interpreted as the claim that the most important source of values is our intuitions about the desirability (awesomeness) of particular outcomes.

Sortof. More like the most usable currently available source, rather than most important.

While the outline is nice, I think this is the wrong place to start. Instead, we should start by answering some more basic questions (not necessarily just these, all of these, or in this order).

What do we plan to do with the concept of "values"? Keeping the previous question in mind, what are values, and do we intend to use the same concept for AI and human "values"? Do humans, in fact, have values? If they do, are these values internally consistent, or do they show order-dependence (or some other "problem") somewhere? Are they consistent across time and inputs in some respect?

While these questions are also hard to answer, there's actually a good chance they have answers multiple people can agree on, and answering them should hopefully gives us the ability to your original questions as well.

Only read "External" so far, but I propose god(s) be divided into "trusted and idealized authority figures", "internalized sense of commitment to integrity of respected and admirable reputation (honor)", and "external personification of inner conscience".

If people cite God as the source of spiritual value, it's because he represents a combination of these things and the belief that their values are ingrained in reality. God isn't the root cause, and taking Him out of the equation still leaves the relevant feelings and commitment.

Also, "other humans" isn't relevantly different from "other agents".

Also, also, I'm not entirely clear on the point of this post (probably should've brought that up before correcting you, really). Are you citing actual sources of value, or the things people sometimes believe are the sources of value, whether or not they're correct? Value is necessarily formed from concepts in the mind, so the brain can be assumed to be the thing most usefully termed the origin.

Also, also, also, when you say "value" do you just mean moral value, or things people care about on the whole?

It may be interesting to note that this outline implies that when we discuss questions like "What's your utility function?" or "Do humans have utility functions?" we should be careful to distinguish what kind of utility function we are talking about. Examples:

  • a utility function that represents my revealed preferences
  • the utility function implied by my consequentialist moral principles
  • the utility function that corresponds to my intuitions about the desirabilities of various specific outcomes
  • the utility function I actually use when I engage in explicit consequentialist reasoning
  • the utility function I actually use when I engage in subconscious model-based decision making
  • the utility function I would eventually decide upon if I thought about it for a long time
  • the utility function that best captures my intuitions about what "my real values" means
  • the utility function that represents my real values (this may seem equivalent to the one above, except that I don't seem to have clear intuitions in the matter, what intuitions I do have seem subject to change, and maybe there is a fact of the matter about what my real values are beyond my intuitions about it?)

Why are you referring to all of those as one's "utility function"? I thought the term "utility function" referred to one's terminal values. Your last example seems to refer to one's terminal values, but the rest are just random instances of types of reasoning leading to instrumental values.

I don't know what my values are. I don't even know how to find out what my values are.

I find this confusing. Could you please give a precise definition of "values" in this context?

Could you please give a precise definition of "values" in this context?

If I could, then the problem I'm trying to solve would already be solved. But I can try to clarify it a bit by saying that it's something like the last item on this list.

Well then, can you taboo "values" and tell me what it is you are looking for?

But I can try to clarify it a bit by saying that it's something like the last item on this list.

That last item talks about "real" values, which doesn't make things any clearer.

I would define it as something like, "The course of action one would take if they had perfect knowledge." The only problem with this definition seems to be that one's utility function not only defines what would be the best course of action, but also defines what would be the second best, and third, etc.

I would say "utility function" takes all possible actions one could take at each moment, and ranks them from 'worst idea' to 'best idea'. A coherent agent would have no disagreement between these rankings from moment to moment, but agents with akrasia, such as humans in the modern environment, have utility functions that cycle back and forth in a contradictory fashion, where at one moment the best action to take is at a later time a bad choice (such as people who find staying up late reading Reddit the most fun option, but then always regret it in the morning when they have to wake up early for work).

When you say "values", do you mean instrumental values, or do you mean terminal values? If the former then the answer is simple. This is what we spend most of our time doing. Will tweaking my diet in this way cause me to have more energy? Will asking my friend in this particular way cause them to accept my request? Etc. This is as mundane as it gets.

If the latter, the answer is a bit more complicated, but really it shouldn't be all that confusing. As agents, we're built with motivation systems, where out of all possible sensory patterns, some present to us as neutral, others as inherently desirable, and the last subset as inherently undesirable. Some things can be more desirable or less desirable, etc., thus these sensory components each run on at least one dimension.

Sensory patterns that present originally as inherently neutral may either be left as irrelevant (these are the things put on auto-ignore, which are apt to make a return to one's conscious awareness if certain substances are taken, or if careful introspection is engaged in), or otherwise acquire a 'secondary' desirability or undesirability via being seen to be in causal connection with something that presents as inherently one way or the other, for example finding running enjoyable because of certain positive benefits acquired in the past from the activity.

Thus to discover one's terminal values, one must simply identify these inherently desirable sensory patterns, and figure out which ones would top the list as 'most desirable' (in terms of nothing other than how it strikes one's perception). A good heuristic for this would be to see what other people consider enjoyable or fun, and then try it, and see what happens, but at the same time making sure to disambiguate any identity issues from the whole thing, such as sexual hangups making one unable to enjoy something widely considered to have one of the strongest effects in terms of 'wanting to engage in this behavior because it's so great'--sexual or romantic interaction.

But at the most fundamental, there's nothing to the task of figuring out one's terminal values other than simply figuring out what sensory patterns are most 'enjoyable' in the most basic sort of way imaginable, on a timescale sufficiently long-term to be something one would be unlikely to refer to as 'akrasia'. Even someone literally physically unable to experience certain positive sensory patterns, such as someone with extremely low libido because of physiological problems, would most likely qualify as making a 'good choice' if they engage in a course of action apt to cause them to begin to be able to experience these sensory patterns, such as that person implementing a particular lifestyle protocol likely to fix their physiological issues and bring them libido to a healthy level.

It gets somewhat confusing when you factor in the fact that the sensory patterns one is able to experience can shift over time, such as libido increasing or decreasing, or going through puberty, or something like that, along with factoring in akrasia, and other problems that make us seem less 'coherent' of agents, but I believe all the fog can be cut through if one simply makes the observation that sensory patterns present to us as either neutral, inherently desirable, or inherently undesirable, and that the latter two run on a dimension of 'more or less'. Neutral sensory patterns acquire 'secondary' quality on these dimensions depending on what the agent believes to be their causal connections to other sensory patterns, each ultimately needing to run up against an 'inherently motivating' sensory pattern to acquire significance.

really it shouldn't be all that confusing. As agents, we're built with motivation systems, where out of all possible sensory patterns, some present to us as neutral, others as inherently desirable, and the last subset as inherently undesirable.

While I sympathize with you, I think you should decrease your threshold for apparent difficulty of problems.

For example, you should be able to choose between things that will make no sensory difference to you, such as the well-being of people in Xela. And of course you dodge the question of what is "enjoyable" - is a fistfight enjoyable if it makes you grin and your heart race but afterwards you never want to do it again? What algorithm should an AI follow to decide? You have to try and reduce "enjoyable" to things like "things you'd do again" or "things that make your brain release chemical cocktail X." And then you have to realize that those definitions are best met by meth, or an IV of chemical cocktail X, not by cool stuff like riding dinosaurs or having great sex.

For example, you should be able to choose between things that will make no sensory difference to you, such as the well-being of people in Xela.

This is an example of the sort of loose terminology that leads most people into the fog on these sorts of problems. If it makes no sensory difference, then it makes no sensory difference, and there's nothing to care about, as there's nothing to decide between. You can't choose between two identical things.

Or to be more charitable, I should say that what seems to have happened here is that I was using the term "sensory pattern" to refer to any and all subjective experiences appearing on one's visual field, etc., whereas you seem to be using the phrase "makes no sensory difference" to refer to the subset of subjective experience we call 'the real world'.

True, if I've never been to Xela, the well-being of the people there (presumably) makes no difference to my experience of everyday things in the outside world, such as the people I know, or what's going on in the places I do go. But this is not a problem. Mention the place, and explain the conditions in detail, employing colorful language and eloquent description, and before long there will be a video playing in my mind, apt to make me happy or sad, depending on the well-being of the people therein.

And of course you dodge the question of what is "enjoyable" - is a fistfight enjoyable if it makes you grin and your heart race but afterwards you never want to do it again?

I don't see the contradiction. Unless I'm missing something in my interpretation of your example, all that must be said is that the experience was enjoyable because certain dangers didn't play out, such as getting injured or being humiliated, but you'd rather not repeat that experience, for you may not be so lucky in the future. Plenty of things are enjoyable unless they go wrong, and are rather apt to go wrong, and thus are candidates for being something one enjoys but would rather not repeat.

For example, let's say you get lost in the moment, and have unprotected sex. You didn't have any condoms or anything, but everything else was perfect, so you went for it. You have the time of your life. After the fact you manage to put the dangers out of your mind, and just remember how excellent the experience was. Eventually it becomes clear that no STIs were transmitted, nor is there an unplanned pregnancy. The experience, because nothing went wrong, was excellent. But you decide it was a mistake.

There seems to be a contradiction here, saying that the experience was excellent, but that it was a mistake. But then you realize that the missing piece that makes it seem contradictory is the time factor. Once a certain amount of time passes, if nothing went wrong, one can say conclusively that nothing went wrong. 100% chance it was awesome and nothing went wrong. But at the time of the event, the odds were much worse. That's all.

What algorithm should an AI follow to decide?

This seems off topic. Decide what? I thought we were talking about how to discover one's terminal values as a human.

You have to try and reduce "enjoyable" to things like "things you'd do again" or "things that make your brain release chemical cocktail X." And then you have to realize that those definitions are best met by meth, or an IV of chemical cocktail X, not by cool stuff like riding dinosaurs or having great sex.

Well if that's the case then they're unhelpful definitions. As far as I can see, nothing in my post would suggest a theory weak enough to output something like 'do meth', or 'figure out how to wirehead'.

While I sympathize with you, I think you should decrease your threshold for apparent difficulty of problems.

Along with what I just posted, I should also mention that I did say these two lines:

at the most fundamental, there's nothing to the task of figuring out one's terminal values other than simply figuring out what sensory patterns are most 'enjoyable' in the most basic sort of way imaginable, on a timescale sufficiently long-term to be something one would be unlikely to refer to as 'akrasia'

It gets somewhat confusing when you factor in [...] akrasia, and other problems that make us seem less 'coherent' of agents

Those seem to suggest I wasn't being as naive as your reply seems to imply.

99% of values come from copied entities (genes, memes, etc). Basically, any non-trivial optimisation process (i.e. more advanced than exhaustive/random search) involves the copying-with-variation of previously-successful solutions. So: a useful way of classifing many values, is to consider what was copied in order to produce them.

Why is this useful to us? I am confused :( it seems that only following all the selection could make this useful?

It can tell you whether your values came from you, from another individual (often manipulation), from tradition, or from some other source. Such information is often interesting.

Where do the inborn values fit? Is it "3.2.4. Innate"?

It would also be useful to give examples of values and where they fit, or misfit. For example, if I value kindness, did I learn to value it from watching others, was born valuing it, developed it by analyzing the consequences, or maybe I believe that I value it while alieving taking advantage of others by showing kindness? Or vice versa. Or maybe it's a topic for a separate post.

[-][anonymous]11y00

Inspired by you, I attempted to answer "Where do we derive values?" and put it in a chart.

  1. via oracle

    genetic (inborn predilection toward altruism)

    revelation (direct)

    prior authority (ignore original vehicle)

  2. via search

    historical evidence

    counterfactuals (simulated behavior)

    behavioral reward/punishment (randomized search with pruning, habits)

  3. via algorithmic computation

    consequentialism (same as counterfactuals?)

    virtue heuristic (anything optimizing on fixed small rule set)

    identity/hero based (intersection of virtue and historical?)

  4. as consequence of lower level (emergence)

    arise out of selfish desires (e.g. need for food, safety)

    arise out of tribal desires (we do group optimize)

Not sure whether your section 4 represents greater subtlety that I'm missing or whether it is straying from the question/answering a different question. Thanks for offering a framework that will allow me to better target my learning.

The community consensus seems to argue that computation is the only legitimate source of values and that we should isolate the influence of other sources. I expect that most people use search methods. Provided a large enough historical record, it would be easier to ask "What has worked well for others facing similar choices?" than to reconstruct a correct choice using a heuristic based method.

Framing in terms of "[your] values" creates different emphases from framing in terms of "morality", and that makes it hard for me to comment. "Decision-policy-relevant stuff" would evoke still different intuitions. There might be a way to go meta here.

These frames don't seem to make a difference to me... How would you comment if the problem was framed in terms of "morality" or "decision-policy-relevant stuff"?

This outline seems like it could be very useful! I like this post.

I am confused about what is meant by 1.1 though. Are you referring to some hypothetical potentially real external agent who can tell us something about what to value? Or are you saying we can learn what someone values by observing their religious beliefs? Or simply that a person's religious beliefs inform their morals?

Are you referring to some hypothetical potentially real external agent who can tell us something about what to value?

Some people seem to have a strong meta-level intuition to that effect (i.e., that they ought to value what God wants them to value, or do what God wants them to do). I personally don't, but I listed it so that the outline can cover other people as well as myself.

Ah, I see, thanks for the explanation.

[-][anonymous]11y00

When you say "values", do you mean instrumental values, or do you mean terminal values? If the former then the answer is simple. This is what we spend most of our time doing. Will tweaking my diet in this way cause me to have more energy? Will asking my friend in this particular way cause them to accept my request? Etc. This is as mundane as it gets.

If the latter, the answer is a bit more complicated, but really it shouldn't be all that confusing. As agents, we're built with motivation systems, where out of all possible sensory patterns, some present to us as neutral, others as inherently desirable, and the last subset as inherently undesirable. Some things can be more desirable or less desirable, etc., thus these sensory components each run on at least one dimension.

Sensory patterns that present originally as inherently neutral may either be left as irrelevant (these are the things put on auto-ignore, which are apt to make a return to one's conscious awareness if certain substances are taken, or careful introspection is engaged in), or otherwise acquire a 'secondary' desirability or undesirability via being seen to be in causal connection with something that presents as inherently one way or the other, for example finding running enjoyable because of certain positive benefits acquired in the past from the activity.

Thus to discover one's terminal values, one must simply identify these inherently desirable sensory patterns, and figure out which ones would top the list as 'most desirable' (in terms of nothing other than how it strikes one's perception). A good heuristic for this would be to see what other people consider enjoyable or fun, and then try it, and see what happens, but at the same time making sure to disambiguate any identity issues from the whole thing, such as sexual hangups making one unable to enjoy something widely considered to have one of the strongest effects in terms of 'wanting to engage in this behavior because it's so great'--sexual or romantic interaction.

But at the most fundamental, there's nothing to the task of figuring out one's terminal values other than simply figuring out what sensory patterns are most 'enjoyable' in the most basic sort of way imaginable, on a timescale sufficiently long-term to be something one would be unlikely to refer to as 'akrasia'. Even someone literally physically unable to experience certain positive sensory patterns, such as someone with extremely low libido because of physiological problems, would most likely qualify as making a 'good choice' if they engage in a course of action apt to cause them to begin to be able to experience these sensory patterns, such as that person implementing a particular lifestyle protocol likely to fix their physiological issues and bring them libido to a healthy level.

It gets somewhat confusing when you factor in the fact that the sensory patterns one is able to experience can shift over time, such as libido increasing or decreasing, or going through puberty, or something like that, along with factoring in akrasia, and other problems that make us seem less 'coherent' of agents, but I believe all the fog can be cut through if one simply makes the observation that sensory patterns present to us as either neutral, inherently desirable, or inherently undesirable, and that the latter two run on a dimension of 'more or less'. Neutral sensory patterns acquire 'secondary' quality on these dimensions depending on what the agent believes to be its causal connection to other sensory patterns, ultimately needing to run up against an 'inherently motivating' sensory pattern.

[This comment is no longer endorsed by its author]Reply