Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I’m Jose.  I realized recently I wasn’t taking existential risk seriously enough, and in April, a year after I first applied, I started running a MIRIx group in my college.  I’ll write summaries of the sessions that I thought were worth sharing.  Most of the members are very new to FAI, so this will partly be an incentive to push upward and partly my own review process.  Hopefully some of this will be helpful to others.

This one focuses on how aligning creator intent with the base objective of an AI might not be enough for outer alignment, starting with an overview of Coherent Extrapolated Volition and its flaws.  This was created in collaboration with Jacob Abraham and Abraham Francis.

Coherent Extrapolated Volition

From the wiki,

In calculating CEV, an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI's utility function. 

In other words, CEV is an AI having not only a precise model of human values, but also the meta understanding of how to resolve contradictions and incompleteness in those values in a friendly way.  

This line of research was considered obsolete by Eliezer, due to the problems it runs into - some of which make it appear like the proposal itself only shifts the pointers to the goals.  In the time we spent discussing it, we ended up with a (most likely not comprehensive) list of the major flaws of CEV.


  1. Goodhart’s Law.  CEV doesn’t address the issue of wireheading, where an AI maximizes a proxy utility function instead of the intended utility function.  For example, without a sufficiently descriptive embedded understanding of what it means to be “more the people we wished we were” or even “smarter”, the AI is incentivized to exploit that ambiguity to create the cheapest outcomes.  To a sufficiently powerful AI, you’re giving it limited indirect write access to the values it will implement.  The obvious solution is a formal framework for these properties, which might just be rephrasing the general alignment problem.
    However, CEV can be viewed as a proposal of what to do with AGI after problems like Goodhart’s Law have been solved.
  2. Computational Costs.  Brute forcing extended high-fidelity simulations of all the humans that have ever lived in an attempt to formulate CEV will probably be too expensive for any first-generation AGI.  It’ll definitely be too expensive to make it a viable option in a competitive scenario.  Solutions to these include abstracting the details to make it more cost-efficient (which exacerbates Goodhart’s Law), and choosing a subset of the human population to extrapolate, which runs into philosophical dilemmas.
  3. Value Convergence.  CEV works under the assumption that at the limit, human values will converge to an equilibrium.  This need not necessarily be the case, and at the very least, there’s enough breathing room for the alternative that we can’t be assured of its safety.  Some variants that have been proposed remedy this, but run into issues of their own.
  4. Ideal Advisor Theory.  CEV shares much in common with the Ideal Advisor theories of philosophy.  Some of the criticisms raised against it by Sobel (1994) are also applicable to vanilla CEV, such as:
    • The views held by a perfectly rational and fully informed human may change over time; i.e., values need not converge on a reasonable time scale.  Mitigations to this come with their own bag of issues, such as privileging a single perspective at a predetermined stopping point, or managing a complex trade-off of values between different voices.
    • Under the assumptions that some lives can only be evaluated if they are experienced, and that for those experiences to be unbiased you’d have to start all of them from the same “blank slate”, Sobel proposes an amnesia model of agent experience as the most plausible implementation.  Under this model, agents undergo all lives sequentially, but have their memories extracted and wiped after each one.  These memories are returned to them at the very end.  
      Sobel considers two objections to the amnesia model:
      • The perspective of the person at the end, informed by all the lives they now remember, would be different from the perspective of their self living that life, so properly evaluating those memories may be beyond them.
      • This entire process might drive the agent insane.  “Idealizing” them to a point where that isn’t true might leave them different enough from the original to not qualify as the same person.
    • Extrapolated models may be so much more advanced than their non-extrapolated selves that they can no longer objectively weigh the life and well-being of the latter.  In an extreme case, this could end up being the same way in which we value lesser sentient forms of life.  In a more moderate case, the way we might judge our life’s worth after an accident causing serious mental damage - there are some who would consider death as a viable alternative to that.  
      This particular objection seems unlikely to me, but not obviously implausible.
  • There are variants to CEV - such as Bostrom’s parliamentary model - that resolve some of these criticisms (the parliamentary model solves all but the last of Sobel’s arguments).  However, they run into new problems of their own.  This isn’t solid or even convincing proof to me that good variants can’t exist though, but it serves as precedent for new issues forming when old ones are avoided.

I thought of CEV to begin this because it specifically targets the assumption that human values themselves might not be enough.

Inner and Outer Alignment

There is little consensus on a definition for the entire alignment problem, but a large part, intent alignment, i.e. making sure the AI does what the programmers want it to, is composed of two components: inner and outer alignment.

Inner Alignment is about making sure the AI actually optimizes what our reward function specifies.  In other words, the reward function is the base objective, the objective an AI can search for optimizers to implement.  But in its search, the AI may find proxy objectives that are easier to optimize, and do the job fairly well (think of evolution, where the base objective is reproductive fitness, while the mesa objective includes heuristics like pain aversion, status signaling, etc.).  This is the mesa objective.  Inner Alignment is aligning the base objective with the mesa objective.

Outer Alignment is about making sure we like the reward function we’re training the AI for.  That is, if we had a model that solves inner alignment and was actually optimizing for the objective it’s given, would we like that model?  This is the centre of much of classical alignment discussion (the paperclip AI thought experiment, for example).

Recall that what CEV addresses is the potential for aligning our intent with the base objective to be insufficient, that a model that optimizes an objective we like can still fail in the limit as it runs into inconsistencies or other problems with our value systems.  The friendly resolution of these problems may be beyond a base human or human model at test time; far from necessarily so, but I think with enough probability in at least a few instances to be a problem.

Insufficient Values

Note: Epistemic status on the following is speculative at best, and is based on what posts and papers we could read in the time we had.

Based on my limited understanding of Outer Alignment, it doesn’t include a formalization of aligning AI with the values we would hold at the limit.  Some of the proposals we looked at also ran into this problem.

Imitative amplification, for example, relies on a model that tries to imitate a human with access to the model.  With oversight using transparency tools to account for deceptive or other harmful behaviour, it is plausibly outer aligned.  However, a base human may not be able to reliably resolve in a friendly way the contradictions and inconsistencies it would face at the limit.  That’s fairly uncharted territory, and might involve the human model diverging from the human template too far.  I don’t think the oversight would be of much help here either, because it isn’t necessary that these problems would come up as early as training time.  It’s also possible nearly any sort of resolution would seem misaligned to us.

Some proposals bypass this problem altogether, but at terrible cost.  STEM AI, for example, avoids value modelling entirely, but does so by ignoring the class of use cases where those would be relevant.

It’s possible that we wouldn’t need to worry about this problem at all.  Perhaps they’ll be addressed during training time, or instead of resolving inconsistencies, the AI could account for them as new value axioms.  But while the former may even be the likely scenario, the alternative still holds a distinct probability, especially in realistic scenarios where training until hypothetical future value conflicts are resolved isn’t competitive.  Treating inconsistencies as new axioms could likely be dangerous, and might not even solve the core problem because of an endless chain of new inconsistencies as we add new ones, in Godelian fashion.

Endnote: I hesitated for a while before posting this because it felt like something that must have been addressed already.  I didn’t find much commenting on this in any of the posts we went through though, so I just peppered this with what was possibly an irksome number of uncertainty qualifiers.  Whatever we got wrong, tell us.

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 8:02 AM

Mostly the newer posts that you're reading are not aiming to come up with The One True Encoding Of Human Values, which is why people don't talk about these problems in relation to them. Rather, the hope is to create an AI system that does the specific things we ask of it, but ensures that we remain in control (see discussion of "corrigibility"). Such an AI system need not know The One True Encoding Of Human Values, 

I don't understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.

I'm not joking or being hyperbolic; I genuinely don't get it. A lot of people seem to think that humans being in control is obviously good, but it seems really, really obvious to me that it's a likely path to horrible outcomes.

Humans haven't had access to all that much power for all that long, and we've already managed to create a number of conditions that look unstable and likely to go bad in catastrophic ways.

We're on a climate slide to who-knows-where. The rest of the environment isn't looking that good either. We've managed to avoid large-scale nuclear war for like 75 whole years after developing the capability, but that's not remotely long enough to call "stable". Those same 75 years have seen some reduction in war in general, but that looks like it's turning around as the political system evolves. Most human governments (and other institutions) are distinctly suboptimal on a bunch of axes, including willingness to take crazy risks, and, although you can argue that they've gotten better in maybe the last 100 to 150 years, a large number of them now seem to have stopped getting better and started getting worse. Humans in general are systematically rotten to each other, and most of the advancement we've gotten against that to come from probably unsustainable institutional tricks that limit anybody's ability to get the decisive upper hand.

If you gave humans control over more power, then why wouldn't you expect all of that to get even worse? And even if you could find a way to make such a situation stably not-so-bad, how would you manage the transition, where some humans would have more power than others, and all humans, including the currently advantaged ones, would feel threatened?

It seems to me that the obvious assumption is that humans being in control is bad. And trying to think out the mechanics of actual scenarios hasn't done anything to change that belief. How can anybody believe otherwise?

There's a difference between "AI putting humans in control is bad", and "AI putting humans in control is better than other options we seem to have for alignment." For many people, it may be as you mentioned:

I don't understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.

(I'm somewhat less pessimistic than you are, I think, but I agree it could go pretty damn poorly, for many ways the AI could "leave us in control.")

What TurnTrout said. What's the alternative to which you're comparing?

I don't have an alternative, and no I'm not very happy about that. I definitely don't know how to build a friendly AI. But, on the other hand, I don't see how "corrigibility" could work either, so in that sense they're on an equal footing. Nobody seems to have any real idea how to achieve either one, so why would you want to emphasize the one that seems less likely to lead to a non-sucky world?

Anyway, what I'm reacting to is this sense I get that some people assume that keeping humans in charge is good, and that humans not being in charge is in itself an unacceptable outcome, or at least weighs very heavily against the desirability of an outcome. I don't know if I've seen very many people say that, but I see lots of things that seem to assume it. Things people write seem to start out with "If we want to make sure humans are still in charge, then...", like that's the primary goal. And I do not think it should be a primary goal. Not even a goal at all, actually.

Nobody seems to have any real idea how to achieve either one

I think that's not true and we in fact have a much better idea of how to achieve corrigibility / intent alignment. (Not going to defend that here. You could see my comment here, though that one only argues why it might be easier rather than providing a method.)

Others will disagree with me on this.

humans not being in charge is in itself an unacceptable outcome, or at least weighs very heavily against the desirability of an outcome

The usual argument I'd give is "if humans aren't in charge, then we can't course correct if something goes wrong". It's instrumental, not terminal. If we ended up in a world like this where humans were not in charge, that seems like it could be okay depending on the details.

I consider the Arbital article on CEV the best reference for the topic. It says:

CEV is rather complicated and meta and hence not intended as something you'd do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)

So MIRI doesn't focus on CEV, etc. because the world hasn't nailed down step one yet. We're extremely worried that humanity's on track to fail step one; and it doesn't matter how well we do on step two if we don't pull off step one. That doesn't mean that stopping at step one and never shooting for anything more ambitious would be acceptable; by default I'd consider that an existential catastrophe in its own right.

Yeah, CEV itself seemed like a long shot - but my thought process was that maintaining human control wouldn't be enough for step one, both because I think it's not enough at the limit, but also because the human component might inherently be a limiting factor that makes it not very competitive.  But the more I thought about it, the weaker that assumption of inherent-ness seemed, so I agree in that the most this post could be saying is that the timeline gap between something like Task AGI and figuring out step 2 is short - but which I expect isn't very groundbreaking.

Also ,there's no proof that CEV would work. Maybe values are incoherent.

The arbital article is no help.

Asking what everyone would want* if they knew what the AI knew, and doing what they’d all predictably agree on, is just about the least jerky thing you can do.

How do we know that they would agree? That just begs the question. Saying that you shouldn't be "jerky", ie. selfish, doesn't tell you what kind of unselfishness to have instead. Clearly ,the left and the right don't agree on the best kind of altruism -- laying down your life to stop the spread of socialism, versus sacrificing your income to implement socialism.

So if I'm understanding it correctly, it's that maintaining human control is the best option we can formally work toward?  The One True Encoding of Human Values would most likely be a more reliable system if we could create it, but that it's a much harder problem, and not strictly necessary for a good end outcome?

The One True Encoding of Human Values would most likely be a more reliable system if we could create it

My best guess is that this is a confused claim and I have trouble saying "yes" or "no" to it, but I do agree with the spirit of it.

but that it's a much harder problem, and not strictly necessary for a good end outcome?

Yes, in the same way that if you're worried about mosquitoes giving you malaria, you start by getting a mosquito net or moving to a place without mosquitoes, you don't immediately try to kill all mosquitoes in the world.

This is a totally fine thing to post :) I agree with most of the things, and agree about their importance.

I think the Goodhart's law side to CEV is more subtle. To be pithy, it's like it doesn't have a problem with Goodhart's law yet because it's not specific enough to even get noticed by Goodhart. If CEV hypothetically considers doing something bad, we can just reassure ourselves that surely that's not what our ideal advisor would have wanted. It's only once we pick a specific method of implementation that we have to confront in mechanistic detail what we could previously hide under the abstraction of anthropomorphic agency.

It's only once we pick a specific method of implementation that we have to confront in mechanistic detail what we could previously hide under the abstraction of anthropomorphic agency.

I agree.  I was trying to think of possible implementation methods, throwing out various constraints like computing power or competitiveness as it became harder to find any, and the final sticking point was still Goodhart's Law.  For the most part, I kept it in to give an example to the difficulty of meta-alignment (corrigibility in favourable directions).

Brute forcing extended high-fidelity simulations of all the humans that have ever lived in an attempt to formulate CEV will probably be too expensive for any first-generation AGI.

Prediction 1: that will never be a possibility, period, not just for a "first-generation" anything, but all the way out to the omega point. Not if you want the simulation to have enough fidelity to be useful for any practical purpose, or even interesting.

It probably won't even be possible, let alone cost effective, to do that for one person, since you'd have to faithfully simulate how the environment would interact with that person. Any set of ideas that relies on simulations like that is going to end up being useless.

I agree that this is probably true, but I wouldn't put it at at > 90% probability for far-future AI.  With computing power greater than Jupiter brains, it probably still wouldn't be practical, but my point in thinking about it was that if it were possible to brute-force for first-generation AGI, then there's a chance for more efficient ways.