Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

On Friday I attended the 2020 Foresight AGI Strategy Meeting. Eventually a report will come out summarizing some of what was talked about, but for now I want to focus on what I talked about in my session on deconfusing human values. For that session I wrote up some notes summarizing what I've been working on and thinking about. None of it is new, but it is newly condensed in one place and in convenient list form, and it provides a decent summary of the current state of my research agenda for building beneficial superintelligent AI; a version 1 of my agenda, if you will. Thus, I hope this will be helpful in making it a bit clearer what it is I'm working on, why I'm working on it, and what direction my thinking is moving in. As always, if you're interesting in collaborating on things, whether that be discussing ideas or something more, please reach out.

Problem overview

  • I think we're confused about what we really mean when we talk about human values.
  • This is a problem because:
  • What are values?
    • We don't have an agreed upon precise definition, but loosely it's "stuff people care about".
      • When I talk about "values" I mean the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology.
    • Importantly, what people care about is used to make decisions, and this has had implications for existing approaches to understanding values.
  • Much research on values tries to understand the content of human values or why humans value what they value, but not what the structure of human values is such that we could use it to model arbitrary values. This research unfortunately does not appear very useful to this project.
  • The best attempts we have right now are based on the theory of preferences.
    • In this model a preference is a statement located within a (weak, partial, total, etc.)-order. Often written like A > B > C to mean A is preferred to B is preferred to C.
    • Problems:
      • Goodhart effects are robust and preferences in formal models are measures that is not the thing we care about itself
      • Stated vs. revealed preferences: we generally favor revealed preferences, this approach has some problems:
      • General vs. specific preferences: do we look for context-independent preferences ("essential" values) or context-dependent preferences
        • generalized preferences, e.g. "I like cake better than cookies", can lead to irrational preferences (e.g. non-transitive preferences)
        • contextualized preferences, e.g. "I like cake better than cookies at this precise moment", limit our ability to reason about what someone would prefer in new situations
    • See Stuart Armstrong's work for an attempt to address these issues so we can turn preferences into utility functions.
  • Preference based models look to me to be trying to specify human values at the wrong level of abstraction. But what would the right level of abstraction be?

Solution overview

  • What follows is a summary of what I so far think moves us closer to less confusion about human values. I hope to either think some of this is wrong or insufficient by the end of the discussion!
  • Assumptions:
    • Humans are embedded agents.
    • Agents have fuzzy but definable boundaries.
      • Everything in every moment causes everything in every next moment up to the limit of the speed of light, but we can find clusters of stuff that interact with themselves in ways that are "aligned" such that the stuff in a cluster makes sense to model as an agent separate from the stuff not in an agent.
  • Basic model:
    • Humans (and other agents) cause events. We call this acting.
    • The process that leads to taking one action rather than another possible action is deciding.
    • Decisions are made by some decision generation process.
    • Values are the inputs to the decision generation process that determine its decisions and hence actions.
    • Preferences and meta-preferences are statistical regularities we can observe over the actions of an agent.
  • Important differences from preference models:
    • Preferences are causally after, not causally before, decisions, contrary to the standard preference model.
      • This is not 100% true. Preferences can be observed by self-aware agents, like humans, and influence the decision generation process.
  • So then what are values? The inputs to the decision generation process?
    • My best guess: valence
    • This leaves us with new problems. Now rather than trying to infer preferences from observations of behavior, we need to understand the decision generation process and valence in humans, i.e. this is now a neuroscience problem.

Discussion

  • underdetermination due to noise; many models are consistent with the same data
    • this makes it easy for us to get confused, even when we're trying to deconfuse ourselves
    • this makes it hard to know if our model is right since we're often in the situation of explaining rather than predicting
  • is this a descriptive or causal model?
    • both. descriptive of what we see, but trying to find the causal mechanism of what we reify as "values" at the human level in terms of "gears" at the neuron level
  • what is valence?
  • complexities of going from neurons to human level notions of values
    • there's a lot of layers of different systems interacting on the way from neurons to values and we don't understand enough about almost any of them or even for sure what systems there are in the causal chain
  • Valence in human computer interaction research

Acknowledgements

Thanks to Dan Elton, De Kai, Sai Joseph, and several other anonymous participants of the session for their attention, comments, questions, and insights.

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 9:14 PM

Planned summary for the Alignment Newsletter:

This post argues that since 1. human values are necessary for alignment, 2. we are confused about human values, and 3. we couldn't verify it if an AI system discovered the structure of human values, we need to do research to become less confused about human values. This research agenda aims to deconfuse human values by modeling them as the input to a decision process which produces behavior and preferences. The author's best guess is that human values are captured by valence, as modeled by minimization of prediction error.

Planned opinion:

This is similar to the argument in <@Why we need a *theory* of human values@>, and my opinion remains roughly the same: I strongly agree that we are confused about human values, but I don't see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don't need to specify the ultimate human values (or even a framework for learning them) before running the AI system. As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).

Yep, agree with the summary.

I'll push back on your opinion a little bit here as if it were just a regular LW comment on the post.

I strongly agree that we are confused about human values, but I don't see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don't need to specify the ultimate human values (or even a framework for learning them) before running the AI system.

This is a reasonably hope but I generally think hope is dangerous when it comes to existential risks, so I'm moved to pursue this line of research because I believe it to be neglected, I believe it's likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have. I also don't think it much takes away from other AI safety research, since the skills needed to work on this problem are somewhat different than those needed to address other AI safety problems (or so I think), so I mostly think we can pursue it for a fairly low opportunity cost.

As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).

I expect we have a disagreement on how robust Goodhart problems are, as in I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned, and that the level of alignment you are talking about only works because of lack of optimization power. I suspect that at the level of measurement you're talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.

Thankfully I know others are working on ways to engineer us around Goodhart problems, and maybe these solutions will be robust enough to work over such large measurement gaps, but again I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get "under" Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.

This is a reasonably hope but I generally think hope is dangerous when it comes to existential risks

When I say "hope", I mean "it is reasonably likely that the research we do pans out and leads to a knowably-aligned AI system", not "we will look at the AI system's behavior, pull a risk estimate out of nowhere, and then proceed to deploy it anyway".

In this sense, literally all AI risk research is based on hope, since no existing AI risk research knowably will lead to us building an aligned AI system.

I'm moved to pursue this line of research because I believe it to be neglected, I believe it's likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have.

This is all reasonable; most of it can be said about most AI risk research. The main distinguishing feature between different kinds of technical AI risk research is:

it's likely enough to be useful to building aligned AI to be worth pursuing

So that's the part you'd have to argue for to convince me (but also it would be reasonable not to bother).

I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned

Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?

Perhaps you think AI systems will be different in kind to your friends, in which case see next point.

I suspect that at the level of measurement you're talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.

Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.

I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get "under" Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.

It's not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It's much easier for me to model humans than to model evolution.

Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?

I expect it to be net negative. My model is something like humans are not very agentic (able to reliably achieve/optimize for a goal) in absolute terms even though we may feel as though humans are especially agentic relative to other systems, and because humans bumble a lot they don't tend to have a lot of impact and things work out well or poorly on average as a result of lots of moves that cancel each other out and only leave a small gain or loss in valued outcomes in the end. A 10x smarter human would be more agentic, and if they are not exactly right about how to do good they could more easily do harm that would normally be buffered by their ineffectiveness.

I build this intuition from, for example, the way dictators often screw things up even when they are well intentioned because they now have more power to achieve their goals and it amplifies their mistakes and misunderstandings in ways that cause more impact, more variance, and historically worse outcomes than less agentic methods of leadership.

Although this is not a perfect analogy because 10x smarter is not just 10x more powerful/agentic but 10x better able to think through consequences (which the dictators lacks), I also think the orthogonality thesis is robust enough that it's more likely to me that 10x smarter will not mean a match in ability to think through consequences that will perfectly offset the risks of greater agency.

Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.

Exactly, because you can't infer alignment from observed behavior without normative assumptions. I'm saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.

It's not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It's much easier for me to model humans than to model evolution.

It's definitely harder. That's a reasonable consideration when we're trying to engineer a system that will be good enough while racing against the clock, and I think it's quite reasonable, for example, that we're going to try to tackle value alignment via extensions to narrow value learning approaches first because that's easier to build. But I also think those approaches will fail and so I'm looking ahead to where I see the limits of our knowledge for what we'll have to do conditioned on this bet I'm making that value learning approaches similar in kind to those we're trying now won't produce aligned AIs.

I expect it to be net negative.

Man, I do not share that intuition.

I'd be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren't well-intentioned or 2. they didn't have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).

I'm saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.

I know you're saying that, I just don't see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That's fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.

It's definitely harder.

This is an assertion, not an argument.

Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you've learned by looking at humans), relative to building a model by looking at humans (and using no information you've learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.

I'd be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren't well-intentioned or 2. they didn't have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).

Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:

  • Joseph Stalin's collectivization of farms
  • Tokugawa Iemitsu's closing off of Japan
  • Hugo Chávez's nationalization of many industries
I know you're saying that, I just don't see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That's fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.

I've made my case for that here.

Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you've learned by looking at humans), relative to building a model by looking at humans (and using no information you've learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.

No, it's not my goal that we not look at humans. I instead think we're currently too focused on trying to figure out everything from only looking at the kinds of evidence we can easily collect today, and that we also don't have detailed enough models to know what other evidence is likely relevant. I think understanding whatever is going on with values is hard because there is data further "down the stack", if you will, from observations of behavior that is relevant. I think that because I look at issues like latent preferences that by definition exist because we didn't have enough data to infer their existence but that need not necessarily exist if we gather more data about how those latent preferences are generated such that we could discover them in advance by looking earlier in the process that generates them.

Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:

What's your model for why those actions weren't undone?


To pop back up to the original question -- if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it's only good to make them 2x smarter, but after that more marginal intelligence is bad?

It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we're at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let's suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?

(I'm aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)

Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they'd be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.

What's your model for why those actions weren't undone?

Not quite sure what you're asking here. In the first two cases they eventually were undone after people got fed up with the situation, the last is recent enough I don't consider it's not having already been undone as evidence people like it, only that they don't have the power to change it. My view is that these changes stayed in place because the dictators and their successors continued to believe the good out weighted the harm when either this was clearly contrary to the ground truth but served some narrow purpose that was viewed as more important or when the ground truth was too hard to discover at the time and we only believe it was net harmful through the lens of historical analysis.

To pop back up to the original question -- if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it's only good to make them 2x smarter, but after that more marginal intelligence is bad?
It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we're at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let's suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?

I'm not claiming we're at some optimal level of intelligence for any particular purpose, only that more intelligence leads to greater agency which, in the absence of sufficient mechanisms to constrain actions to beneficial ones, results in greater risk of negative outcomes due to things like deviance and unilateral action. Thus I do in fact think we'd be safer from ourselves, for example screening off existential risks humanity faces due to outside threats like asteroids, if we were dumber.

By comparison, chimpanzees may not live what look to us like very happy lives, they are some factor dumber than us, but also they aren't at risk of making themselves extinct because one chimp really wanted a lot of bananas.

I'm not sure how much smarter we could all get without putting us at too much risk. I think there's an anthropic argument to be made that we are below whatever level of intelligence is dangerous to ourselves without greater safeguards because we wouldn't exist in such universes due to having killed ourselves, but I feel like I have little evidence to make a judgement about how much smarter is safe given, for example, being, say, 95th percentile smart didn't stop people from building things like atomic weapons or developing dangerous chemical applications. I would expect making my friends smarter to risk similarly bad outcomes. Making them dumber seems safer, especially when I'm in the frame of thinking about AGI.

I almost agree, but still ended up disagreeing with a lot of your bullet points. Since reading your list was useful, I figured it would be worthwhile to just make a parallel list. ✓ for agreement, × for disagreement (• for neutral).

Problem overview

✓ I think we're confused about what we really mean when we talk about human values.

× But our real problem is on the meta-level: we want to understand value learning so that we can build an AI that learns human values even without starting with a precise model waiting to be filled in.

_× We can trust AI to discover that structure for us even though we couldn't verify the result, because the point isn't getting the right answer, it's having a trustworthy process.

_ × We can't just write down the correct structure any more than we can just write down the correct content. We're trying to translate a vague human concept into precise instructions for an AI

✓ Agree with extensional definition of values, and relevance to decision-making.

• Research on the content of human values may be useful information about what humans consider to be human values. I think research on the structure of human values is in much the same boat - information, not the final say.

✓ Agree about Stuart's work being where you'd go to write down a precise set of preferences based on human preferences, and that the problems you mention are problems.

Solution overview

✓ Agree with assumptions.

• I think the basic model leaves out the fact that we're changing levels of description.

_ × Merely causing events (in the physical level of description) is not sufficient to say we're acting (in the agent level of description). We need some notion of "could have done something else," which is an abstraction about agents, not something fundamentally physical.

_ × Similar quibbles apply to the other parts - there is no physically special decision process, we can only find one by changing our level of description of the world to one where we posit such a structure.

_ × The point: Everything in the basic model is a statistical regularity we can observe over the behavior of a physical system. You need a bit more nuanced way to place preferences and meta-preferences.

_ • The simple patch is to just say that there's some level of description where the decision-generation process lives, and preferences live at a higher level of abstraction than that. Therefore preferences are emergent phenomena from the level of description the decision-generation process is on.

_ _ × But I think if one applies this patch, then it's a big mistake to use loaded words like "values" to describe the inputs (all inputs?) to the decision-generation process, which are, after all, at a level of description below the level where we can talk about preferences. I think this conflicts with the extensive definitions from earlier.

× If we recognize that we're talking about different levels of description, then preferences are not either causally after or causally before decisions-on-the-basic-model-level-of-abstraction. They're regular patterns that we can use to model decisions at a slightly higher level of abstraction.

_ • How to describe self-aware agents at a low level of abstraction then? Well, time to put on our GEB hats. The low level of abstraction just has to include a computation of the model we would use on the higher level of abstraction.

✓ Despite all these disagreements, I think you've made a pretty good case that the human brain plausibly computes a single currency (valence) that it uses to rate both most decisions and most predictions.

_ × But I still don't agree that this makes valence human values. I mean values in the sense of "the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology." So I don't think we're left with a neuroscience problem, I still think what we want the AI to learn is on that higher level of abstraction where preferences live.

Thanks for your detailed response. Before I dive in, I'll just mention I added a bullet point about Goodhart because somehow when I wrote this up initially I forgot to include it.

× But our real problem is on the meta-level: we want to understand value learning so that we can build an AI that learns human values even without starting with a precise model waiting to be filled in.
_× We can trust AI to discover that structure for us even though we couldn't verify the result, because the point isn't getting the right answer, it's having a trustworthy process.
_ × We can trust AI to discover that structure for us even though we couldn't verify the result, because out human values is because we need to give the AI a precise instruction based on a very vague human concept. The structure is vague for the same reasons as the content.

I don't exactly disagree with you, other than to say that I think if we don't understand enough about human values (for some yet undetermined amount that is "enough") we'd fail to build something that we could trust, but I also don't expect we have to solve the whole problem. Thus I think we need to know enough about the structure to get there, but I don't know how much enough is, so for now I work on the assumption that we have to know it all, but maybe we'll get lucky and can get there with less. But if we don't at least know something of the structure, such as at the fairly abstract level I consider here, I don't think we can precisely specify what we mean by "alignment" to not fail to build aligned AI.

So it's perhaps best to understand my position as a conservative one that is trying to solve problems that I think might be issues but are not guaranteed to be issues because I don't want to find ourselves in a world where we wished we had solved a problem, didn't, and then suffer negative consequences for it.

_ × Merely causing events (in the physical level of description) is not sufficient to say we're acting (in the agent level of description). We need some notion of "could have done something else," which is an abstraction about agents, not something fundamentally physical.
_ × Similar quibbles apply to the other parts - there is no physically special decision process, we can only find one by changing our level of description of the world to one where we posit such a structure.
_ × The point: Everything in the basic model is a statistical regularity we can observe over the behavior of a physical system. You need a bit more nuanced way to place preferences and meta-preferences.

I don't think I have any specific response other than to say that you're right, this is a first pass and there's a lot of hand waving going on still. One difficulty is that we want to build models of the world that will usefully help us work with it while the world also doesn't itself contain the modeled things as themselves, it just contains a soup of stuff interacting with other stuff. What's exciting to me is to get more specific on where my new model breaks down because I expect that to lead the way to become yet less confused.

_ _ × But I think if one applies this patch, then it's a big mistake to use loaded words like "values" to describe the inputs (all inputs?) to the decision-generation process, which are, after all, at a level of description below the level where we can talk about preferences. I think this conflicts with the extensive definitions from earlier.

So this is a common difficulty in this kind of work. There is a category we sort of see in the world, we give it a name, and then we look to understand how that category shows up at different levels of abstraction in our models because it's typically expressed both at a very high level of abstraction and made up of gears moving at lower levels of abstraction. I'm sympathetic to this argument that talking about "values" or any other word in common use is a mistake because it invites confusion, but when I've done the opposite and used technical terminology it's equally confusing but in a different direction, so I no longer think word choice is really the issue here. People are going to be confused because I'm confused, and we're on this ride of being confused together as we try to unknot our tangled models.

× If we recognize that we're talking about different levels of description, then preferences are not either causally after or causally before decisions-on-the-basic-model-level-of-abstraction. They're regular patterns that we can use to model decisions at a slightly higher level of abstraction.

This is probably correct and so in my effort to make clear what I see as the problem with preference models maybe I claim too much. There's a lot to be confused about here.

_ × But I still don't agree that this makes valence human values. I mean values in the sense of "the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology." So I don't think we're left with a neuroscience problem, I still think what we want the AI to learn is on that higher level of abstraction where preferences live.

I don't know how to make the best case for valence. To me it seems good model because it fits with a lot of other models I have of the world, like that the interesting thing about consciousness is feedback and so lots of things are conscious (in the sense of having the fundamental feature that separates things with subjective experience from those without).

Also to be clear I think we are not left with only a neuroscience problem but also a neuroscience problem. What happens at higher levels of abstraction is meaningful, but I also think it's insufficient on its own and requires us to additionally address questions of how neurons behave to generate what we recognize at a human level as "value".

I really like the idea that preferences are observed after the fact, because I feel like there is some truth to it for human beings. We act, and then become self-aware of our reactions and thoughts, which leads us to formulate some values. Even when we act contrary to those values, at least inside, we feel shitty.

But that doesn't address the question of where do these judgements and initial reactions come from. And also how this self-awareness influences the following actions.

Still, this makes me want to read the rest of your research!

I specifically propose they come from valence, recognizing we know that valence is a phenomena generated by the human brain but not exactly how it happens (yet).