Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think of ambitious value learning as a proposed solution to the specification problem, which I define as the problem of defining the behavior that we would want to see from our AI system. I italicize “defining” to emphasize that this is not the problem of actually computing behavior that we want to see -- that’s the full AI safety problem. Here we are allowed to use hopelessly impractical schemes, as long as the resulting definition would allow us to in theory compute the behavior that an AI system would take, perhaps with assumptions like infinite computing power or arbitrarily many queries to a human. (Although we do prefer specifications that seem like they could admit an efficient implementation.) In terms of DeepMind’s classification, we are looking for a design specification that exactly matches the ideal specification. HCH and indirect normativity are examples of attempts at such specifications.

We will consider a model in which our AI system is maximizing the expected utility of some explicitly represented utility function that can depend on history. (It does not matter materially whether we consider utility functions or reward functions, as long as they can depend on history.) The utility function may be learned from data, or designed by hand, but it must be an explicit part of the AI that is then maximized.

I will not justify this model for now, but simply assume it by fiat and see where it takes us. I’ll note briefly that this model is often justified by the VNM utility theorem and AIXI, and as the natural idealization of reinforcement learning, which aims to maximize the expected sum of rewards, although typically rewards in RL depend only on states.

A lot of conceptual arguments, as well as experiences with specification gaming, suggest that we are unlikely to be able to simply think hard and write down a good specification, since even small errors in specifications can lead to bad results. However, machine learning is particularly good at narrowing down on the correct hypothesis among a vast space of possibilities using data, so perhaps we could determine a good specification from some suitably chosen source of data? This leads to the idea of ambitious value learning, where we learn an explicit utility function from human behavior for the AI to maximize.

This is very related to inverse reinforcement learning (IRL) in the machine learning literature, though not all work on IRL is relevant to ambitious value learning. For example, much work on IRL is aimed at imitation learning, which would in the best case allow you to match human performance, but not to exceed it. Ambitious value learning is, well, more ambitious -- it aims to learn a utility function that captures “what humans care about”, so that an AI system that optimizes this utility function more capably can exceed human performance, making the world better for humans than they could have done themselves.

It may sound like we would have solved the entire AI safety problem if we could do ambitious value learning -- surely if we have a good utility function we would be done. Why then do I think of it as a solution to just the specification problem? This is because ambitious value learning by itself would not be enough for safety, except under the assumption of as much compute and data as desired. These are really powerful assumptions -- for example, I'm assuming you can get data where you put a human in an arbitrarily complicated simulated environment with fake memories of their life so far and see what they do. This allows us to ignore many things that would likely be a problem in practice, such as:

  • Attempting to use the utility function to choose actions before it has converged
  • Distributional shift causing the learned utility function to become invalid
  • Local minima preventing us from learning a good utility function, or from optimizing the learned utility function correctly

The next few posts in this sequence will consider the suitability of ambitious value learning as a solution to the specification problem. Most of them will consider whether ambitious value learning is possible in the setting above (infinite compute and data). One post will consider practical issues with the application of IRL to infer a utility function suitable for ambitious value learning, while still assuming that the resulting utility function can be perfectly maximized (which is equivalent to assuming infinite compute and a perfect model of the environment after IRL has run).

New to LessWrong?

New Comment
28 comments, sorted by Click to highlight new comments since: Today at 2:39 AM

A conversation that just went down in my head:

Me: "You observe a that a bunch of attempts to write down what we want get Goodharted, and so you suggest writing down what we want using data. This seems like it will have all the same problems."

Straw You: "The reason you fail is because you can't specify what we really want, because value is complex. Trying to write down human values is qualitatively different from trying to write down human values using a pointer to all the data that happened in the past. That pointer cheats the argument from complexity, since it lets us fit lots of data into a simple instruction."

Me: "But the instruction is not simple! Pointing at what the "human" is is hard. Dealing with the fact that the human in inconsistent with itself gives more degrees of freedom. If you just look at the human actions, and don't look inside the brain, there are many many goals consistent with the actions you see. If you do look inside the brain, you need to know how to interpret that data. None of these are objective facts about the universe that you can just learn. You have to specify them, or specify a way to specify them, and when you do that, you do it wrong and you get Goodharted."

The next four posts are basically making exactly these points (except for "pointing at what the human is is hard"). Or actually, it doesn't talk about the "look inside the brain" part either, but I agree with your argument there as well.

I'm going to argue that ambitious value learning is difficult and probably not what we should be aiming for. (Or rather, I'm going to add posts that other people wrote to this sequence, that argue for that claim or weaker versions of it.)

Can you clarify, do "queries to a human" and "data about human behavior" mean things like asking humans questions and observing human behavior in real/historical situations, or does it mean being able to put humans in arbitrary virtual environments (along with fake memories of how they got there) in order to observe their reactions? If it's the former, I'm not sure how that lets us ignore "Distributional shift causing the learned utility function to become invalid". If it's the latter, I think a lot of people might be surprised by that assumption so it would be good to spell it out.

The latter. Good point about clarity, I've added a sentence making that clearer, thanks!

On second thought, even if you assume the latter, the humans you're learning from will themselves have problems with distributional shifts. If you give someone a different set of life experiences, they're going to end up a different person with different values, so it seems impossible to learn a complete and consistent utility function by just placing someone in various virtual environments with fake memories of how they got there and observing what they do. Will this issue be addressed in the sequence?

No, I'm not planning to tackle this issue.

One approach would be to take current-me and put current-me through a variety of virtual environments with fake memories that start from current-time without removing my real memories and use whatever is inferred from that as my utility function. (Basically, treat all experiences and memories up to the current time as "part of me", and treat that as the initial state from which you are trying to determine a utility function.)

But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that's a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.

But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.

Maybe it's not that bad? For example I can imagine learning the human utility function in two stages. The first stage uses the current human to learn a partial utility function (or some other kind of data structure) about how they want their life to go prior to figuring out their full utility function. E.g., perhaps they want a safe and supportive environment to think, talk to other humans, and solve various philosophical problems related to figuring out one's utility function, with various kinds of assistance, safeguards, etc. from the AI (but otherwise no strong optimizing forces acting upon them). In the second stage, the AI use that information to compute a distribution of "preferred" future lives and then learns the full utility function only from those lives.

Another possibility is if we could design an Oracle AI that is really good at answering philosophical questions (including understanding what our confused questions mean), we can just ask it "What is my utility function?"

So I would argue that your proposal is one example of how you could learn a utility function from humans assuming you know the full human policy, where you are proposing that we pay attention to a very small part of the human policy (the part that specifies our answers to the question "how do we want our life to go" at the current time, and then the part that specifies our behavior in the "preferred" future lives).

You can think of this as ambitious value learning with a hardcoded structure by which the AI is supposed to infer the utility function from behavior. (A mediocre analogy: AlphaGoZero learns to play Go with a hardcoded structure of MCTS.) As a result, you would still need to grapple with the arguments against ambitious value learning brought up in subsequent posts -- primarily, that you need to have a good model of the mistakes that humans make in order to better than humans would themselves. In your proposal, I think the mistake model is "everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that". This seems like a better mistake model than most, and it could work -- but we are hardcoding in an assumption about humans here that could be misspecified. (Eg. humans say they want autonomy and freedom from manipulation but actually they would have been better off if they had let the AI make arguments to them about what they care about.)

In your proposal, I think the mistake model is “everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that”.

Ok, this is helpful for making a connection between my way of thinking and the "mistake model" way, but it seems a bit of a stretch, since I almost certainly am mistaken (or suboptimal) about how I want my life to go. I only want autonomy and freedom from manipulation because I don't know how to let an AI manipulate me (i.e., make arguments to me about my values) in a way that would be safe and lead to good results. If I did, I may well let the AI do that and save myself the trouble and risk of trying to figure out my values on my own.

Yeah, I agree that the mistake model implied by your proposal isn't correct, and as a result you would not infer the true utility function. Of course, you might still infer one that is sufficiently close that we get a great future.

Tbc, I do think there are lots of other ways of thinking about the problem that are useful that are not captured by the "mistake model" way of thinking. I use the "mistake model" way of thinking because it often shows a different perspective on a proposal, and helps pinpoint what you're relying on in your alignment proposal.

Of course this is all assuming that there does exist a true utility function, but I think we can replace "true utility function" with "utility function that encodes the optimal actions to take for the best possible universe" and everything still follows through. But of course, not hitting this target just means that we don't do the perfectly optimal thing -- it's totally possible that we end up doing something that is only very slightly suboptimal.

Of course this is all assuming that there does exist a true utility function, but I think we can replace "true utility function" with "utility function that encodes the optimal actions to take for the best possible universe" and everything still follows through.

The replacement feels just as obscure to me as the original.

What do you mean by "obscure"?

People often argue "there is no true utility function for humans" because we often do things that are contradictory that imply that we violate the VNM axioms. However, in theory you could look at all action sequences, rank them, take the best one, and find a utility function for which that action sequence is optimal, and you could call that the utility function that you want. That utility function exists as long as you agree that an ordering over action sequences exists, which seems very reasonable.

TL;DR the point of that reframing is to overcome objections that the "true utility function" doesn't exist.

Thanks! I think I understand the intent of the rephrasing now.

What I meant with "obscure" is that both "true utility function" and "utility function that encodes the optimal actions to take for the best possible universe" have normative terminology in them that I don't know how to reduce or operationalize.

For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I'd feel nervous about still making some kind of mistake. Both your phrasings (to my ears) carry the connotation that there is a "best" mistake model, one which is in a relevant sense independent from our own judgment, where we can learn things that will make us more and more confident that now we're probably not making mistakes anymore because of progress in finding the correct way of thinking about our values. That's the part that feels obscure to me because I think we'll always be in this unsatisfying epistemic situation where we're nervous about making some kind of mistake by the light of a standard that we cannot properly describe.

I do get the intuition for thinking in these terms, though. It feels conceivable that another discovery similar to what cognitive biases did could improve our thinking, and I definitely agree that we want a concept for staying open to this possibility. I'm just pointing out that non-operationalized normative concepts seem obscure. (Though maybe that's fine if we're treating them in the same way Yudkowsky treats "magic reality fluid" – as a placeholder for whatever comes once we're less confused about "measure".)

What I meant with "obscure" is that both "true utility function" and "utility function that encodes the optimal actions to take for the best possible universe" have normative terminology in them that I don't know how to reduce or operationalize.

Oh yeah, I was definitely speaking normatively there.

For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I'd feel nervous about still making some kind of mistake.

Agreed, I'm just saying that in principle there exists some "best" way of making those calls.

Both your phrasings (to my ears) carry the connotation that there is a "best" mistake model, one which is in a relevant sense independent from our own judgment

Agreed that I'm assuming that there is a "best" mistake model, I wouldn't say that it has to be independent from our own judgment.

But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that's a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.

This statement feels pretty strong, especially given that I find it trivially true that I'd be a different person under many plausible alternative histories. This makes me think I'm probably misinterpreting something. :)

At first I read your paragraph as the strong claim that if it's true that individual human values are underdetermined at birth, then ambitious value learning looks doomed. And I'd take it as proof for "individual human values are underdetermined at birth" if, replaying history, I'd now have different values (or a different probability distribution over values) if I had encountered Yudkowsky's writings before Singer's, rather than vice-versa. Or if I would be less single-minded about altruism had I encountered EA a couple of years later in life, after already taking on another self-identity.

But these points (especially the second example) seem so trivially true that I'm probably talking about a different thing. In addition, they're addressed by the solution you propose in your first paragraph, namely taking current-you as the starting point.

Another concern could be that "there is almost never a stable core of an individual human's values", i.e., that "even going forward from today, the values of Lukas or Rohin or Wei are going to be heavily underdetermined". Is that the concern? This seems like it could be possible for most people, but definitely not for all people. And undetermined values are not necessarily that bad (though I find it mildly disconcerting, personally). [Edit: Wei's comment and your reply to it sounds like this might indeed be the concern. :) Good discussion there!]

The fact that I have a hard time understanding the framework behind your statement is probably because I'm thinking in terms of a different part of my brain when I talk about "my values". I identify very much with my reflective life goals to a point that seems unusual. I don't identify much with "What Lukas's behavior, if you were to put him in different environments and then watch, would indirectly consistently tell you about the things he appears to want – e.g., 'values' like being held in high esteem by others, having a comfortable life, romance, having either some kind of overarching purpose or enough distractions to not feel bother by the lack of purpose, etc.". There is definitely a sense in which the code that runs me is caring about all these implicit goals. But that's not how I most want to see it. I also know that in all the environments that offer the options to self-modify into a more efficient pursuer of explicitly held personal ideals, I would make substantial use of the option to self-modify. And that seems relevant for the same reason that we wouldn't want to count cognitive biases as people's values.

(I should probably continue reading the sequence and then come back to this later if I still feel unclear about it.)

Another concern could be that "there is almost never a stable core of an individual human's values", i.e., that "even going forward from today, the values of Lukas or Rohin or Wei are going to be heavily underdetermined". Is that the concern?

Yeah. Also I suspect some people are worried about taking current-you as a starting point -- that seems somewhat arbitrary. But if you're fine with that, then the major concern is that values are still underdetermined going forward.

The fact that I have a hard time understanding the framework behind your statement is probably because I'm thinking in terms of a different part of my brain when I talk about "my values". I identify very much with my reflective life goals to a point that seems unusual.

I interpreted Wei's comment as saying that even your reflective life goals would be underdetermined -- presumably even now if you hear convincing moral argument A but not B, then you'd have different reflective life goals than if you hear B but not A. This seems broadly correct to me.

I interpreted Wei's comment as saying that even your reflective life goals would be underdetermined -- presumably even now if you hear convincing moral argument A but not B, then you'd have different reflective life goals than if you hear B but not A.

Okay yeah, that also seems broadly correct to me.

I am hoping though that, as long as I'm not subjected to optimization pressures from outside that weren't crafted to be helpful, it's very rare that something I'd currently consider very important can end up either staying important or becoming completely unimportant merely based on order of new arguments encountered. And similarly I'm hoping that my value endpoints would still cluster decisively around the things I currently consider most important, – though that's where it becomes tricky to trade off goal preservation versus openness for philosophical progress.

The Hansonian discussion of shared priors seems relevant. (For those not familiar with it: https://mason.gmu.edu/~rhanson/prior.pdf ) Basically, we should have convergent posteriors in an Aumann sense unless we have not only different priors and different experiences, but also different origins.

But what this means is that *to the extent that human values are coherent and based on correct bayesian reasoning* - which, granted, is a big assumption - distributional shifts shouldn't exist. (And now, back to reality.)

This also presumes that human values are an empirical fact about reality that you can have beliefs over, which seems at least controversial.

I don't think you are correct about the implication of "not up for grabs" - it doesn't mean it is not learnable, it means that we don't update or change it, and that it is not constrained by rationality. But even that isn't quite right - rational behavior certainly requires that we change preferences about intermediate outcomes when we find that our instrumental goals should change in response to new information.

And if the utility function changes as a result of life experiences, it should be in a way that reflects learnable expectations over how experiences change the utility function - so the argument about needing origin disputes still applies.

I'm not claiming (in the parent comment) that values aren't learnable.

I am claiming that they are not constrained by rationality (or rather, that this is a reasonable position to have, corresponding roughly to moral anti-realism).

I was talking about terminal values, not instrumental values. I certainly agree that if we take terminal values as given, instrumental values are an empirical fact about reality.

Though I think I see my misunderstanding now. I thought you were claiming that humans arrived at their values by a process of Bayesian updating on what their values should be. But actually what you're claiming is that to the extent that human beliefs (not values!) are based on correct Bayesian reasoning with shared origins, distributional shifts shouldn't exist. Humans may still disagree on values.

I was confused because your original comment used the assumption that human values were based on correct Bayesian reasoning, am I correct that you meant that to apply to human beliefs?

Sorry, I needed to clarify my thinking and my claim a lot further. This is in addition to the (what I assumed was obvious) claim that correct Bayesian thinkers should be able to converge on beliefs despite potentially having different values. I'm speculating that if terminal values are initially drawn from a known distribution, AND "if you think that a different set of life experiences means that you are a different person with different values," but that values change based on experiences in ways that are understandable, then rational humans will act in a coherent way so that we should expect to be able to learn human values and their distribution, despite the existence of shifts.

Conditional on those speculative thoughts, I disagree with your conclusion that "that's a really good reason to assume that the whole framework of getting the true human utility function is doomed." Instead, I think we should be able to infer the distribution of values that humans actually have - even if they individually change over time from experiences.

But what do you optimize then?

That's an important question, bu it's also fundamentally hard, since it's almost certainly true that human values are inconsistent - if not individually, than at an aggregate level. (You can't reconcile opposite preferences, or maximize each person's share of a finite resource.)

The best answer I have seen is Eric Drexler's discussion of Pareto-topia, where he suggests that we can make huge progress and gain of utility according to all value-systems held by humans, despite the fact that they are inconsistent.

That seems right. Though if you accept that human values are inconsistent and you won't be able to optimize them directly, I still think "that's a really good reason to assume that the whole framework of getting the true human utility function is doomed."

By "true human utility function" I really do mean a single function that when perfectly maximized leads to the optimal outcome.

I think "human values are inconsistent" and "people with different experiences will have different values" and "there are distributional shifts which cause humans to be different than they would otherwise have been" are all different ways of pointing at the same problem.

although typically rewards in RL depend only on states,

Presumably this should be a period? (Or perhaps there's a clause missing pointing out the distinction between caring about history and caring about states, tho you could transform one into the other?)

Supposed to be a period, fixed now. While you can transform one into the other, I find it fairly unnatural, and I would guess this would be the case for other ML researchers. Typically, if we want to do things that depend on history, we just drop the Markov assumption, rather than defining the state to be the entire history.

Also, if you define the state to be the entire history, you lose ergodicity assumptions that are needed to prove that algorithms can learn well.