I’m Jose. I realized recently I wasn’t taking existential risk seriously enough, and in April, a year after I first applied, I started running a MIRIx group in my college. I’ll write summaries of the sessions that I thought were worth sharing. Most of the members are very new to FAI, so this will partly be an incentive to push upward and partly my own review process. Hopefully some of this will be helpful to others.
This one focuses on how aligning creator intent with the base objective of an AI might not be enough for outer alignment, starting with an overview of Coherent Extrapolated Volition and its flaws. This was created in collaboration with Jacob Abraham and Abraham Francis.
Coherent Extrapolated Volition
From the wiki,
In calculating CEV, an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI's utility function.
In other words, CEV is an AI having not only a precise model of human values, but also the meta understanding of how to resolve contradictions and incompleteness in those values in a friendly way.
This line of research was considered obsolete by Eliezer, due to the problems it runs into - some of which make it appear like the proposal itself only shifts the pointers to the goals. In the time we spent discussing it, we ended up with a (most likely not comprehensive) list of the major flaws of CEV.
- Goodhart’s Law. CEV doesn’t address the issue of wireheading, where an AI maximizes a proxy utility function instead of the intended utility function. For example, without a sufficiently descriptive embedded understanding of what it means to be “more the people we wished we were” or even “smarter”, the AI is incentivized to exploit that ambiguity to create the cheapest outcomes. To a sufficiently powerful AI, you’re giving it limited indirect write access to the values it will implement. The obvious solution is a formal framework for these properties, which might just be rephrasing the general alignment problem.
However, CEV can be viewed as a proposal of what to do with AGI after problems like Goodhart’s Law have been solved.
- Computational Costs. Brute forcing extended high-fidelity simulations of all the humans that have ever lived in an attempt to formulate CEV will probably be too expensive for any first-generation AGI. It’ll definitely be too expensive to make it a viable option in a competitive scenario. Solutions to these include abstracting the details to make it more cost-efficient (which exacerbates Goodhart’s Law), and choosing a subset of the human population to extrapolate, which runs into philosophical dilemmas.
- Value Convergence. CEV works under the assumption that at the limit, human values will converge to an equilibrium. This need not necessarily be the case, and at the very least, there’s enough breathing room for the alternative that we can’t be assured of its safety. Some variants that have been proposed remedy this, but run into issues of their own.
- Ideal Advisor Theory. CEV shares much in common with the Ideal Advisor theories of philosophy. Some of the criticisms raised against it by Sobel (1994) are also applicable to vanilla CEV, such as:
- The views held by a perfectly rational and fully informed human may change over time; i.e., values need not converge on a reasonable time scale. Mitigations to this come with their own bag of issues, such as privileging a single perspective at a predetermined stopping point, or managing a complex trade-off of values between different voices.
- Under the assumptions that some lives can only be evaluated if they are experienced, and that for those experiences to be unbiased you’d have to start all of them from the same “blank slate”, Sobel proposes an amnesia model of agent experience as the most plausible implementation. Under this model, agents undergo all lives sequentially, but have their memories extracted and wiped after each one. These memories are returned to them at the very end.
Sobel considers two objections to the amnesia model:
- The perspective of the person at the end, informed by all the lives they now remember, would be different from the perspective of their self living that life, so properly evaluating those memories may be beyond them.
- This entire process might drive the agent insane. “Idealizing” them to a point where that isn’t true might leave them different enough from the original to not qualify as the same person.
- Extrapolated models may be so much more advanced than their non-extrapolated selves that they can no longer objectively weigh the life and well-being of the latter. In an extreme case, this could end up being the same way in which we value lesser sentient forms of life. In a more moderate case, the way we might judge our life’s worth after an accident causing serious mental damage - there are some who would consider death as a viable alternative to that.
This particular objection seems unlikely to me, but not obviously implausible.
- There are variants to CEV - such as Bostrom’s parliamentary model - that resolve some of these criticisms (the parliamentary model solves all but the last of Sobel’s arguments). However, they run into new problems of their own. This isn’t solid or even convincing proof to me that good variants can’t exist though, but it serves as precedent for new issues forming when old ones are avoided.
I thought of CEV to begin this because it specifically targets the assumption that human values themselves might not be enough.
Inner and Outer Alignment
There is little consensus on a definition for the entire alignment problem, but a large part, intent alignment, i.e. making sure the AI does what the programmers want it to, is composed of two components: inner and outer alignment.
Inner Alignment is about making sure the AI actually optimizes what our reward function specifies. In other words, the reward function is the base objective, the objective an AI can search for optimizers to implement. But in its search, the AI may find proxy objectives that are easier to optimize, and do the job fairly well (think of evolution, where the base objective is reproductive fitness, while the mesa objective includes heuristics like pain aversion, status signaling, etc.). This is the mesa objective. Inner Alignment is aligning the base objective with the mesa objective.
Outer Alignment is about making sure we like the reward function we’re training the AI for. That is, if we had a model that solves inner alignment and was actually optimizing for the objective it’s given, would we like that model? This is the centre of much of classical alignment discussion (the paperclip AI thought experiment, for example).
Recall that what CEV addresses is the potential for aligning our intent with the base objective to be insufficient, that a model that optimizes an objective we like can still fail in the limit as it runs into inconsistencies or other problems with our value systems. The friendly resolution of these problems may be beyond a base human or human model at test time; far from necessarily so, but I think with enough probability in at least a few instances to be a problem.
Note: Epistemic status on the following is speculative at best, and is based on what posts and papers we could read in the time we had.
Based on my limited understanding of Outer Alignment, it doesn’t include a formalization of aligning AI with the values we would hold at the limit. Some of the proposals we looked at also ran into this problem.
Imitative amplification, for example, relies on a model that tries to imitate a human with access to the model. With oversight using transparency tools to account for deceptive or other harmful behaviour, it is plausibly outer aligned. However, a base human may not be able to reliably resolve in a friendly way the contradictions and inconsistencies it would face at the limit. That’s fairly uncharted territory, and might involve the human model diverging from the human template too far. I don’t think the oversight would be of much help here either, because it isn’t necessary that these problems would come up as early as training time. It’s also possible nearly any sort of resolution would seem misaligned to us.
Some proposals bypass this problem altogether, but at terrible cost. STEM AI, for example, avoids value modelling entirely, but does so by ignoring the class of use cases where those would be relevant.
It’s possible that we wouldn’t need to worry about this problem at all. Perhaps they’ll be addressed during training time, or instead of resolving inconsistencies, the AI could account for them as new value axioms. But while the former may even be the likely scenario, the alternative still holds a distinct probability, especially in realistic scenarios where training until hypothetical future value conflicts are resolved isn’t competitive. Treating inconsistencies as new axioms could likely be dangerous, and might not even solve the core problem because of an endless chain of new inconsistencies as we add new ones, in Godelian fashion.
Endnote: I hesitated for a while before posting this because it felt like something that must have been addressed already. I didn’t find much commenting on this in any of the posts we went through though, so I just peppered this with what was possibly an irksome number of uncertainty qualifiers. Whatever we got wrong, tell us.