Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

My read of the AI safety space is that there are currently two major approaches to AGI alignment being researched: agent foundations and agent training. We can contrast them in part by saying the ultimate goal of the agent foundations program is to figure out how to make an AGI, possibly from scratch, that will align itself with human values even if humans don’t know what their values are or feed the AGI the “wrong” values, while the agent training program is to figure out how to teach an AGI, and really any sufficiently capable agent, human values. Having now formally stated the AI alignment problem, we can ask to what extent each program stands to satisfy the technical requirements for alignment.

To refresh your memory, I presented AI alignment as the problem of ensuring that an agent adopts the values of humanity and human operators know enough about the agent to believe it shares humanity’s values. For shorthand we can call these two properties value alignment and believable alignment, respectively. Of particular note is that alignment it not possible unless both values are aligned and that alignment is believable as one without the other allows failure cases like trivial alignment, as in the paperclip maximizer that is trivially aligned by having no model of human values at all, and the treacherous turn, where an agent appears to be aligned but will express unaligned behaviors after it is too powerful to stop. Specific instances of these properties are well known through thought experiments about how AI alignment might fail, but the formalization as two categorical properties is new, so it’s in light of these properties that we assess how well each approach to alignment addresses them.

Starting with the older of the two approaches, although it was only in 2014 that the agent foundations approach coalesced enough to have an agenda, its threads stretch back to the turn of the 21st century and the earliest attemptsto address alignment, then primarily known as the problem of building Friendly AI. MIRI is the clear champion of this approach, with support from researchers at or funded through FLI, FHI, CHAI, and other groups. We might summarize the agent foundations approach to alignment as starting from the assumption that we need to build a rational agent if we want to have any hope that it will be aligned, and then having built a rational agent instill in it the goal of aligning itself with human values. So, to what extent does the agent foundations agenda stand to satisfy value alignment and believable alignment?

The agent foundations program is clearly focused on believability at a deep level. Much of the research it has produced is about understanding how to design an agent with features that will make it believable, like being rationaland corrigible, and obtaining mathematical proofs of conditions for satisfying those properties. Although it is currently focused on specific subproblemswithin believability, it shows signs of addressing believability wholesale as it builds up results that allow it to do so. My only concern is that the focus on rational agents may be too narrow since any real system is finite and so computationally limited in achieving anything more than bounded rationality, but I also believe addressing rational agents first is a reasonable strategy since they are easier to reason about and aligning agents in general necessarily requires the ability to align rational agents.

It’s less clear how much the agent foundations program is focused on value alignment. Some work has been done to consider how an agent can be taught human values, and Stuart Armstrong has explored reinforcement learning approaches to value alignment, but my read is that agent foundations researchers view value alignment as a problem that will mostly be solved by AGI itself. For this reason work on Vingean reflection is likely to be critical to achieving value alignment within the agent foundations program since an agent may be created with only the initial intention to align itself with human values and much of the work of doing that will be left up to the agent.

Thus my overall assessment is that the agent foundations program is on course to address believable alignment (at least for rational agents) and is compatible with but has currently underspecified how it will address value alignment.

The agent training program is younger and seems to take its inspiration from recent advances in using inverse reinforcement learning to train deep learningmodels with the goal of training agents to be aligned. Much of the work in this space is currently being championed by Paul Christiano, but may be catching on with researchers at OpenAI, DeepMind, Google Brain, and other groups actively building deep learning systems. Since this approach is newer and less congealed, I will be relying primarily on Christiano’s writing to examine it.

Agent training is, as the name perhaps suggests, primarily focused on value alignment with the leading approach being one of agent amplification where weaker agents are trained into increasingly better aligned agents. More broadly, though, agent training seeks to find schemes that are robust to adversarial influence, competitive with building unaligned AGI, and scalableto sudden increases in agent capabilities. In this respect the agent training approach pays particular attention to the realities of building aligned AGI in a world where capabilities research outstrips alignment research, as in the case of an AI race, that may necessitate trying something to align AGI rather than not attempting to align AGI at all.

Unfortunately the agent training agenda seems unlikely to be able to adequately address believable alignment. This is not to say that Christiano has not considered issues of believability or that agent training is actively opposed to believable AGI, but is instead a problem with trying to achieve alignment through training only rather than agent design. Specifically such an approachsubjects the properties of believability to Goodhart’s Law making it difficult to ensure believability is actually being trained for rather than the appearance of believability. This is further complicated since believability is needed to act as a counterbalance against Goodharting in value learning and lacking reliable believability leaves value alignment itself vulnerable to subtle errors. For further evidence cf. the proven inability to reliably train human values, i.e. the unsolved human alignment problem.

I want to emphasize, though, that this does not mean I think agent training can’t work, only that it has reliability issues. That is I expect it is possible to create an aligned AGI via the agent training program but with known chance of producing malign AGI. Thus my overall assessment is that the agent training program may be able to address value alignment, especially under conditions of competitive AGI development, but is fatally flawed in its ability to address believable alignment which consequently makes it less reliable at addressing value alignment.

Despite the limitations of the value training approach, I think there is a lot of opportunity for synergy between it and agent foundations, specifically by using the agent foundations approach to design believable agents and using amplification or another (inverse) reinforcement learning scheme to align values. In particular a scheme like amplification may make it possible to get aligned AGI by starting with a simple but robustly believable agent and, through iterated training, let it build itself up into full alignment. This may even make creating aligned AGI competitive with creating malign agents, but that will depend heavily on details not yet worked out.

This analysis also suggests a way in which new approaches to AGI alignment might innovate over existing programs — by better balancing and considering the need for both value alignment and believable alignment. Specifically the existing programs were developed without a formal framework for what the AGI alignment problem is, so they were forced to explore and find their way using only an intuitive notion of “align AI with human values” as an objective. With a formal statement of the problem, though, we may now more rigorously approach solutions with less uncertainty about if they might work in theory, guiding us towards a more robust practice of alignment methods.

New Comment