Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

How to understand non-technical proposals

This post grew out of conversations at EA Hotel, Blackpool about how to think about the various proposals for ‘solving’ AI Alignment like CEV, iterated amplification and distillation or ambitious value learning. Many of these proposals seemed to me to combine technical and ethical claims, or to differ in the questions they were trying to answer in confusing ways. In this post I try to come up with a systematic way of understanding the goals of different high-level AI safety proposals, based on their answers to the Value Definition Problem. Framing this problem leads to comparing various proposals by their level of Normative Directness, as defined by Bostrom in Superintelligence. I would like to thank Linda Linsefors and Grue_Slinky for their help refining these ideas, and EA Hotel for giving us the chance to discuss them.

Defining the VDP

In Superintelligence (2014) Chapter 14, Bostrom discusses the question of ‘what we should want a Superintelligence to want’, defining a problem;

“Supposing that we could install any arbitrary value into our AI, what should that value be?”

The Value Definition Problem

By including the clause ‘supposing that we could install any arbitrary value into our AI’, Bostrom is assuming we have solved the full Value Loading Problem and can be confident in getting an AGI to pursue any value we like.

Bostrom’s definition of this ‘deciding which values to load’ problem is echoed in other writing on this topic. One proposed answer to this question, the Coherent Extrapolated Volition (CEV) is described by Yudkowsky as

‘a proposal about what a sufficiently advanced self-directed AGI should be built to want/target/decide/do’.

With the caveat that this is something you should do ‘with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets’.

However, if we only accept the above as problems to be solved, we are being problematically vague. Bostrom explains why in Chapter 14. If we really can ‘install any arbitrary value into our AI’, we can simply require the AI to ‘do what I mean’ or ‘be nice’ and leave it at that. If an AGI successfully did “want/target/decide to do what I meant”, then we would have successful value alignment!

Answers like this are not even wrong - they shunt all of the difficult work into the question of solving the Value Loading Problem, i.e. in precisely specifying ‘do what I mean’ or ‘be nice’.

In order to address these philosophical problems in a way that is still rooted in technical considerations, I propose that instead of simply asking what an AGI should do if we could install any arbitrary value, we should seek to solve the Value Definition Problem:

“Given that we are trying to solve the Intent Alignment problem for our AI, what should we aim to get our AI to want/target/decide/do, to have the best chance of a positive outcome?”

In other words, instead of the unconditional, ‘what are human values’ or ‘what should the AI be built to want to do’, it is the conditional, ‘What should we be trying to get the AI to do, to have the best chance of a positive outcome’.

This definition of the VDP excludes excessively vague answers like ‘do what I mean’, because an AI with successful intent alignment is not guaranteed to be capable enough to successfully determine ‘what we mean’ under all circumstances. In extreme cases, like the Value Definition ‘do what I mean’, "what we mean" is undefined because we don't know what we mean, so there is no answer that could be found.

If we have solved the VDP, then an Intent-Aligned AI, in the course of trying to act according to the Value Definition, should actually be able to act according to the Value Definition. In acting according to this Value Definition, the outcome would be beneficial to us. Even if a succesfully aligned AGI is nice, does what I mean and/or acts according to Humanity's CEV, these were only good answers to the VDP if adopting them was actually useful or informative in aligning this AGI.

What counts as a good solution to the VDP depends on our solution to intent alignment and the AGI’s capabilities, because what we should be wanting the AI to do will depend on what the AGI can discover about what we want.

This definition of the VDP does not precisely cleave the technical from the philosophical/ethical issues in solving AI value alignment, but I believe it is well-defined enough to be worth considering. It has the advantage of bringing the ethical and technical AI Safety considerations closer together.

A good solution to the VDP would still be an informal definition of value: what we want the AI to pursue. However, it should give us at least some direction about technical design decisions, since we need to ensure that the Intent-Aligned AI has the capabilities necessary to learn the given definition of value, and that the given definition of value does not make alignment very hard or impossible.

Criteria for judging Value Definitions

  1. How hard would Intent-Aligning be; How hard would it be to ensure the AI ‘tries to do the right thing’, where ‘right’ is given by the Value Definition. In particular, does adopting this definition of value make intent-alignment easier?
  2. How great would our AGI capabilities need to be; How hard would it be for the AGI to ‘[figure] out which thing is right’, where ‘right’ is given by the Value Definition. In particular, does adopting this definition of value help us to understand what capabilities or architecture the AI needs?
  3. How good would the outcome be; If the AGI is successfully pursuing our Value Definition, how good would the outcome be?

3 is what Bostrom focuses on in Chapter 14 of Superintelligence, as (with the exception of dismissing useless answers to the VDP like ‘be nice’ or ‘do what I mean’) he does not consider whether different value definitions would influence the difficulty of Intent Alignment or the required AI Capabilities. Similarly, Yudkowsky assumes we are ‘extremely confident’ of our ability to get the AGI to pursue an arbitrarily complicated goal. 3 is a normative ethical question, whereas the first two are (poorly understood and defined) technical questions.

Some values are easier to specify and align to than others, so even when discussing pure value definitions, we should keep the technical challenges at the back of our mind. In other words, while 3 is the major consideration used for judging value definitions, 1 or 2 must also be considered. In particular, if our value definition is so vague that it makes intent alignment impossible, or requires capabilities that seem magical, such as ‘do what I mean’ or ‘be nice’, we do not have a useful value definition.

Human Values and the VDP

While 1 and 2 are clearly difficult questions to answer for any plausible value definition, 3 seems almost redundant. It might seem as though we should expect at least a reasonably good outcome if we were to ‘succeed’ with any definition that is intended to extract the values of humans, because by definition success would result in our AGI having the values of humans.

Stuart Armstrong argues that to properly address 3 we need a definition - a theory - of what human values actually are’. This is necessary because different interpretations of our values tend to diverge when we are confronted by extreme circumstances and because in some cases it is not clear what our ‘real preferences’ actually are.

An AI could remove us from typical situations and put us into extreme situations - at least "extreme" from the perspective of the everyday world where we forged the intuitions that those methods of extracting values roughly match up.
Not only do we expect this, but we desire this: a world without absolute poverty, for example, is the kind of world we would want the AI to move us into, if it could. In those extreme and unprecedented situations, we could end up with revealed preferences pointing one way, stated preferences another, while regret and CEV point in different directions entirely.

3 amounts to a demand to reach at least some degree of clarity (if not solve) normative ethics and metaethics - we have to understand what human values are in order to choose between or develop a method for pursuing them.

Indirect vs Direct Normativity

Bostrom argues that our dominant consideration in judging between different value definitions should be the ‘principle of epistemic deference’

The principle of epistemic deference
A future superintelligence occupies an epistemically superior vantage point: its beliefs are (probably, on most topics) more likely than ours to be true. We should therefore defer to the superintelligence’s opinion whenever feasible.

In other words, in describing the 'values' we want our superintelligence to have, we want to hand over as much work to the superintelligence as possible.

This takes us to indirect normativity. The obvious reason for building a super-intelligence is so that we can offload to it the instrumental reasoning required to find effective ways of realizing a given value. Indirect normativity would enable us also to offload to the superintelligence some of the reasoning needed to select the value that is to be realized.

The key issue here is given by the word ‘some’. How much of the reasoning should we offload to the Superintelligence? The principle of epistemic deference answers ‘as much as possible’.

What considerations push against the principle of epistemic deference? One consideration is the metaethical views we think are plausible. In Wei Dai’s Six Plausible Meta-Ethical Alternatives, two of the more commonly held views are that ‘intelligent beings have a part of their mind that can discover moral facts and find them motivating, but those parts don't have full control over their actions’ and that ‘there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences’.

Either of these alternatives suggest that too much epistemic deference is not valuable - if, for example, there are facts about what everyone should value but a mind must be structured in a very specific way to discover and be motivated by them, we might want to place restrictions on what the Superintelligence values to make sure we discover them. In the extreme case, if a certain moral theory is known to be correct, we could avoid having to trust the Superintelligence’s own judgment by just getting it to obey that theory. This extreme case could never practically arise, since we could never achieve that level of confidence in a particular moral theory. Bostrom says it is ‘foolhardy’ to try and do any moral philosophy work that could be left to the AGI, but as Armstrong says, it will be necessary to do some work to understand what human values actually are - how much work?

Classifying Value Definitions

The Scale of Directness

Issa Rice recently provided a list of ‘[options] to figure out the human user or users’ actual preferences’, or to determine definitions of value. These ‘options’, if successfully implemented, would all result in the AI being aligned onto a particular value definition.

We want good outcomes from AI. To get this, we probably want to figure out the human user's or users' "actual preferences" at some point. There are several options for this.

Following Bostrom’s notion of ‘Direct and Indirect Normativity’ we can classify these options by how direct their value definitions are - how much work they would hand off to the superintelligence vs how much work the definition itself does in defining value.

Here I list some representative definitions from most to least normatively direct.

Value Definitions

Hardwired Utility Function

Directly specify a value function (or rigid rules for acquiring utilities), assuming a fixed normative ethical theory.

It is essentially impossible to directly specify a correct reward function for a sufficiently complex task. Already, we use indirect methods to align an RL agent on a complex task (see e.g. Christiano (2017)). For complex, implicitly defined goals we are always going to need to learn some kind of reward/utility function predictor.

Ambitious Learned Value Function

Learn a measure of human flourishing and aggregate it for all existing humans, given a fixed normative (consequentialist) ethical theory that tells us how to aggregate the measure fairly.

E.g. have the AI learn a model of the current individual preferences of all living humans, and then maximise that using total impersonal preference utilitarianism.

This requires a very high degree of confidence that we have found the correct moral theory, including resolving all paradoxes in population ethics like the Repugnant conclusion.

Distilled Human Preferences

Taken from IDA. Attempt to ‘distil out’ the relevant preferences of a human or group of humans, by imitation learning followed by capability amplification, thus only preserving those preferences that survive amplification.

Repeat this process until we have a superintelligent agent that has the distilled preferences of a human. This subset of the original human’s preferences, suitably amplified, defines value.

Note that specific choices about how the deliberation and amplification process play out will embody different value definitions. As two examples, the IDA could model either the full and complete preferences of the Human using future Inverse Reinforcement Learning methods, or it could model the likely instructions of a ‘human-in-the-loop’ offering low-resolution feedback - these could result in quite different outcomes.

Coherent Extrapolated Volition / Christiano’s Indirect Normativity

Both Christiano’s formulation of Indirect Normativity and the CEV define value as the endpoint of a value idealization and extrapolation process with as many free parameters as possible.

Predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge

Moral Realism

Have the AI determine the correct normative ethical theory, whatever that means, and then act according to that.

'Do What I Mean'

'Be Nice'

I have tried to place these different definitions of value in order from the most to least normatively direct. In the most direct case, we define the utility function ourselves. Less direct than that is defining a rigid normative framework within which the AGI learns our preferences. Then, we could consider letting the AGI also have decisions over which normative frameworks to use.

Much less direct, we come to deliberation-based methods or methods which define value as the endpoint of a specific procedure. Christiano’s Iterated Amplification and Distillation is supposed to preserve a particular subset of human values (those that survive a sequence of imitation and capability amplification). This is more direct than CEV because there some details about the distillation procedure are given. Less direct still is Yudkowsky’s CEV, because CEV merely places its value as the endpoint of some sufficiently effective idealisation and convergence procedure, which the AGI is supposed to predict the result of, somehow. Beyond CEV, we come to ‘methods’ that are effectively meaningless.


Here I briefly summarise the considerations that push us to accept more or less normatively direct theories. Epistemic Deference and Conservatism were taken from Bostrom (2014), while Well-definedness and Divergence were taken from Armstrong.

Epistemic Deference: Less direct value definitions defer more reasoning to the superintelligence, so assuming the superintelligence is intent-aligned and capable, there are fewer opportunities for mistakes by human programmers. Epistemic Deference effectively rules out direct specification of values, on the grounds that we are effectively guaranteed to make a mistake resulting in misalignment.

Well-definedness: Less direct value definitions require greater capabilities to implement, and are also less well-defined in the research directions they suggest for how to construct explicit procedures for capturing the definition. Direct utility specification is something we can do today, while CEV is currently under-defined.

Armstrong argues that our value definition must eventually contain explicit criteria for what ‘human values’ are, rather than the maximal normative indirectness of handing over judgments about what values are to the AGI - ‘The correct solution is not to assess the rationality of human judgements of methods of extracting human values. The correct solution is to come up with a better theoretical definition of what human values are.’

Conservatism: More direct theories will result in more control over the future by the programmers. This could be either good or bad depending on your normative ethical views and political considerations at the time the AI is developed.

For example, Bostrom states that in a scenario where the morally best outcome includes reordering all matter to some optimal state, we might want to turn the rest of the universe over to maximising moral goodness but leave an exception for Earth.This would involve more direct specification.

Divergence: If you are a strong externalist realist (believes that moral truth exists but might not be easily found or motivating) then you will want to take direct steps to mandate this. If the methods that are designed to extract human preferences diverge strongly in what they mandate, we need a principled procedure for choosing between them, based on what actually is morally valuable. More normatively direct methods provide a chance to make these moral judgement calls.


I have provided two main concepts which I think are useful for judging nontechnical AI Safety proposals - these are, The Value Definition Problem, and the notion of the Scale of Normative Directness and the considerations that affect positioning on it. Both these considerations I consider to be reframings of previous work, mainly done by Bostrom and Armstrong.

I also note that, on the Scale of Directness, there is quite a large gap between a very indirect method like CEV, and the extremely direct methods like ambitious value learning.

‘Ambitious Value Learning’ defines value using a specific, chosen-in-advance consequentialist normative ethical theory (which tells us how to aggregate and weight different interests) that we then use an AI to specify in more detail, using observations of humans’ revealed preferences.

Christiano says of methods like CEV, which aim to extrapolate what I ‘really want’ far beyond what my current preferences are; ‘most practitioners don’t think of this problem even as a long-term research goal — it’s a qualitatively different project without direct relevance to the kinds of problems they want to solve’. This is effectively a statement of the Well-definedness consideration when sorting through value definitions - our long-term ‘coherent’ or ‘true’ preferences currently aren’t well understood enough to guide research so we need to restrict ourselves to more direct normativity - extracting the actual preferences of existing humans.

After CEV, the next most ‘direct’ method, Distilled Human preferences (the definition of value used in Christiano’s IDA), is still far less direct than ambitious value learning, eschewing all assumptions about the content of our values and placing only some restrictions on their form. Since not all of our preferences will survive the amplification and distillation processes, the hope is that the morally relevant ones will - even though as yet we do not have a good understanding of how durable our preferences are and which ones correspond to specific human values.

This vast gap in directness suggests a large range of unconsidered value definitions that attempt to ‘defer to the Superintelligence’s opinion’ not whenever possible but only sometimes.

Armstrong has already claimed we must do much more work in defining what me mean by human values than the more indirect methods like IDA/CEV suggest when he argued, ‘The correct solution is not to assess the rationality of human judgements of methods of extracting human values. The correct solution is to come up with a better theoretical definition of what human values are.’

I believe that we should investigate ways to incorporate our high-level judgements about which preferences correspond to ‘genuine human values’ into indirect methods like IDA, making the indirect methods more direct by rigidifying parts of the deliberation or idealization procedure - but that is for a future post.

New Comment
6 comments, sorted by Click to highlight new comments since:

Planned summary:

This post considers the Value Definition Problem: what should our AI system <@try to do@>(@Clarifying "AI Alignment"@), to have the best chance of a positive outcome? It argues that an answer to the problem should be judged based on how much easier it makes alignment, how competent the AI system has to be to optimize it, and how good the outcome would be if it was optimized. Solutions also differ on how "direct" they are -- on one end, explicitly writing down a utility function would be very direct, while on the other, something like Coherent Extrapolated Volition would be very indirect: it delegates the task of figuring out what is good to the AI system itself.

Planned opinion:

I fall more on the side of preferring indirect approaches, though by that I mean that we should delegate to future humans, as opposed to defining some particular value-finding mechanism into an AI system that eventually produces a definition of values.

I appreciate the summary, though the way you state the VDP isn't quite the way I meant it.

what should our AI system <@try to do@>(@Clarifying "AI Alignment"@), to have the best chance of a positive outcome?

To me, this reads like, 'we have a particular AI, what should we try to get it to do', wheras I meant it as 'what Value Definition should we be building our AI to pursue'. So, that's why I stated it as ' what should we aim to get our AI to want/target/decide/do' or, to be consistent with your way of writing it 'what should we try to get our AI system to do to have the best chance of a positive outcome', not 'what should our AI system try to do to have the best chance of a positive outcome'. Aside from that minor terminological difference, that's a good summary of what I was trying to say.

I fall more on the side of preferring indirect approaches, though by that I mean that we should delegate to future humans, as opposed to defining some particular value-finding mechanism into an AI system that eventually produces a definition of values.

I think your opinion is probably the majority opinion - my major point with the 'scale of directness' was to emphasize that our 'particular value-finding mechanisms' can have more or fewer degrees of freedom, since from a certain perspective 'delegate everything to a simulation of future humans' is also a 'particular mechanism' just with a lot more degrees of freedom, so even if you strongly favour indirect approaches you will still have to make some decisions about the nature of the delegation.

The original reason that I wrote this post was to get people to explicitly notice the point that we will probably have to do some philosophical labour ourselves at some point, and then I discovered Stuart Armstrong had already made a similar argument. I'm currently working on another post (also based on the same work at EA Hotel) with some more specific arguments about why we should construct a particular value-finding mechanism that doesn't fix us to any particular normative ethical theory, but does fix us to an understanding of what values are - something I call a Coherent Extrapolated Framework (CEF). But again, Stuart Armstrong anticipated a lot (but not all!) of what I was going to say.

To me, this reads like, 'we have a particular AI, what should we try to get it to do'

Hmm, I definitely didn't intend it that way -- I'm basically always talking about how to build AI systems, and I'd hope my readers see it that way too. But in any case, adding three words isn't a big deal, I'll change that.

(Though I think it is "what should we get our AI system to try to do", as opposed to "what should we try to get our AI system to do", right? The former is intent alignment, the latter is not.)

even if you strongly favour indirect approaches you will still have to make some decisions about the nature of the delegation

In some abstract sense, certainly. But it could be "I'll take no action; whatever future humanity decides on will be what happens". This is in some sense a decision about the nature of the delegation, but not a huge one. (You could also imagine believing that delegating will be fine for a wide variety of delegation procedures, and so you aren't too worried which one gets used.)

For example, perhaps we solve intent alignment in a value-neutral way (that is, the resulting AI system tries to figure out the values of its operator and then satisfy them, and can do so for most operators), and then every human gets an intent aligned AGI, this leads to a post-scarcity world, and then all of the future humans figure out what they as a society care about (the philosophical labor) and then that is optimized.

Of course, the philosophical labor did eventually happen, but the point is that it happened well after AGI, and pre-AGI nothing major needed to be done to delegate to the future humans.

The scenario where every human gets an intent-aligned AGI, and each AGI learns their own particular values would be a case where each individual AGI is following something like 'Distilled Human Preferences', or possibly just 'Ambitious Learned Value Function' as its Value Definition, so a fairly Direct scenario. However, the overall outcome would be more towards the indirect end - because a multipolar world with lots of powerful Humans using AGIs and trying to compromise would (you anticipate) end up converging on our CEV, or Moral Truth, or something similar. I didn't consider direct vs indirect in the context of multipolar scenarios like this (nor did Bostrom, I think) but it seems sufficient to just say that the individual AGIs use a fairly direct Value Definition while the outcome is indirect.

Possibly related but with a slightly different angle, you may have missed my work on trying to formally specify the alignment problem, which is pointing to something similar but arrives at somewhat different results.

Thanks for pointing that out to me; I had not come across your work before! I've had a look through your post and I agree that we're saying similar things. I would say that my 'Value Definition Problem' is an (intentionally) vaguer and broader question about what our research program should be - as I argued in the article, this is mostly an axiological question. Your final statement of the Alignment Problem (informally) is:

A must learn the values of H and H must know enough about A to believe A shares H’s values

while my Value Definition Problem is

“Given that we are trying to solve the Intent Alignment problem for our AI, what should we aim to get our AI to want/target/decide/do, to have the best chance of a positive outcome?”

I would say the VDP is about what our 'guiding principle' or 'target' should be in order to have the best chance of solving the alignment problem. I used Christiano's 'intent alignment' formulation but yours actually fits better with the VDP, I think.