The Alignment Problems

Martín Soto

Epistemic status: A core idea I've seen hinted at in many places, here presented and motivated explicitly.

This was produced during SERI MATS 3.0. Thanks to Vivek Hebbar for related discussion.

There has recently been pushback on the inner/outer alignment conceptualization: it's confused; or it complicates the problem; or it's even ill-defined or vacuous^[1]. Paul already noted this ("[it doesn't] carve the problem space cleanly enough to be good research problems") when presenting his alternative low-stakes/high-stakes operationalization, which I've come to prefer. I think this is a particular case of a more general phenomenon:

Instead of dividing the interconnected parts of an abstracted, monolithic (de-contextualized) alignment problem ("align any actuator system with human values"), we should partition the set of concrete, different (contextualized) alignment problems ("devise a training to obtain good behavioral property X in scenario Y")^[2].

If the former approach is indeed a mistake (in most research contexts), the first necessary step for it is abstracting away too much information from the problem. This is a delicate research decision, and is indeed one of the cruxes between some mainstream research methodologies, as Richard mentions:

I currently think of Eliezer as someone who has excellent intuitions about the broad direction of progress at a very high level of abstraction - but where the very fact that these intuitions are so abstract rules out the types of path-dependencies that I expect solutions to alignment will actually rely on.

Indeed, abstracting too much will make the problem unsolvable by definition. If you want to solve "the abstract alignment problem" or "alignment in general" (that is, find a general method that solves all concrete instances of alignment problems), you need to deal with systems that attain arbitrarily high cognitive power / consequentialist depth (since we don't have physical bounds on this yet, or even really know what it means), and so you can always say "given alignment strategy X, the AI is powerful enough to predict it and counter it".

Well, that is not strictly true, since your scheme can causally affect the AI's shape before it can partake in such reasoning. But on a higher level of abstraction, and when taking into account the consequences of our selection pressure and acausal reasoning, high enough capabilities always allow for presenting a failure mode of this form^[3].

Broadly speaking, our alignment scheme must satisfy two constraints: being powerful enough so as to efficiently do the search (for a capable AI)^[4], and not so complex that humans can't specify the objective correctly. This is hard, because being powerful is correlated with being complex. And if we abstracted away all remaining details, then indeed, we've just proved alignment unsolvable! But the two concepts are not literally equivalent, so the search space is not literally linear between those two extremes, and some dependencies that can be exploited by clever tricks might exist.

Of course, Eliezer knows this, and doesn't think in these terms. But something of the same flavor is going on when he stresses that all powerful cognition is essentially reformulations of the same Bayes-structure, and this structure has certain properties working against its alignment (for example, corrigibility being anti-natural for it), so that the only way to solve any relevant instance of the alignment problem is to have a general understanding of minds good enough as to clinically perform very weird setups, which amounts to knowing how to solve many other instances of the alignment problem with different details [this is only an oversimplification of my interpretation from the 2021 dialogues].

I'd guess many people disagree because they don't buy the uber-centrality of Bayes-structure (equivalently, the big overlap between all alignment problems). I share this intuition that reality is messier and the space of minds less homogeneous, and so that even if Eliezer's reasonings about information and action are locally correct, they might combine into very different minds, which will or won't be found by training depending on parameters which are very much empirical and we can't right now predict^[5].

Coming back to research methodology, if you expect such deep homogeneity, you might think the only way forward is tackling this abstract core (that is, all alignment problems at once, details abstracted away), and this indeed feels like MIRI's past agendas. But in this situation it's very hard to find useful compartmentalizations of the problem: for any such divide, there'll be some concrete contexts (contextualized instances of the alignment problem) in which the divided parts interact non-trivially, and so your division is breaking relevant interdependencies.

When instead you partition the set of alignment problems, you are not breaking any such interdependencies. And you also gain more assumptions (context details) with which to work (which trade off against the lost generality of your theorems). And this indeed feels like ARC's approach: gradually pinning down which assumptions are relevant or necessary, as if reducing the whole problem to further subproblems (which usually involves changes in environment or task, not assumptions about how internal workings or failure modes look), and converging on a hard and sufficient kernel, like anomaly detection^[6].

Note this contextualization doesn't imply the research itself should be any less abstract. For example, we could argue low-stakes scenarios don't exist in real life (I think Rohin said this somewhere): given non-zero credence on FOOMing, booting any AGI anywhere is potentially catastrophic. This doesn't prevent them from being a useful compartmentalization: both low-stakes and high-stakes provide more assumptions than the original problem, and having a solution to both would point to a compositional solution to the main problem (in a way that inner/outer doesn't due to its vagueness). Maybe trying to think up other useful partitions of the set of alignment problems is generally useful (this is somewhat equivalent to trying to think up of further assumptions we are permitted to make).

All this can have consequences for research prioritization: to the extent that deep homogeneity between different alignment problems is correlated with the difficulty of any one of them, one could argue for theoretical contextualization as "Aiming your efforts at worlds where you have the biggest marginal impact".

This could also have consequences for forecasting. If contextualized theoretical research directions are indeed more fruitful (or equivalently if the deep homogeneity amongst problems fails), we'd expect future humans to be able to align modest systems before having a deep picture of how all minds work (if the difficult bar is low enough that some relevant systems can be aligned), making dangerous multi-polar scenarios more likely. That is, we'd have solved enough small uni-polar problems for some people to deploy (and not die), but wouldn't have a deep enough understanding of them as to securely set up a whole context-changing economy of AIs.

^{^}
Indeed, without a robust measure for the boundary between the AI and the environment, you can turn any alignment failure into an inner alignment failure: just add an innocuous outer shell / envelope to your dangerous AGI, and call that whole system your AI. We've changed the classification without changing the actual relevant problem, so the classification must depend on irrelevant variables.
^{^}
Buck's The alignment problem in different capability regimes is another particular example of this latter approach. John's pushback comment defends the commonality of an empirically-relevant abstract core amongst different contexts, similar to Eliezer's views (see below).
^{^}
For more intuition on this, consider as an example Relaxed Adversarial Training. It's easy to conceptualize it as humans "attacking" the AI with pseudo-inputs, trying to get it to learn and internalize as part of its objective the concept "what humans actually want" before it groks another concept that perfectly explains those attacks and defends against them, such as "the structure shared by all of these attacks, which humans haven't noticed due to their biases". In the limit of high capabilities, the AI will become arbitrarily good at grokking such concepts before we've transmitted enough bits of alignment information to it, and so will crystallize an incorrect objective.
I think this perspective can be translated to any alignment failure, even when the human attacks are acausal, such as "trying to choose the training procedure most likely to build an aligned AI". Even then, if the produced AI is capable enough, it will be able to retrospectively reason about your alignment constraints, so it will "choose" to have a shape so as to be selected by this procedure, regardless of its objective (which remains a free variable). This is equivalent to the more common and straightforward "you are as likely to find an aligned AI than an agent with any objective pretending to be an aligned AI" (and given arbitrary capabilities, this agent will be able to fool arbitrary schemes).
^{^}
From a previous post: "The vastness of the search space itself is the enemy. This is the enemy of consequentialist thinking of any kind, with or without AGI. But it turns out that in vast and complex enough search spaces AGI occurs very naturally (or so we think), and so many dangers arise through it." This is (my understanding of) a point central to Nate's world-view.
^{^}
As an example, Eliezer says:
Definitely, "turns out it's easier than you thought to use gradient descent's memorization of zillions of shallow patterns that overlap and recombine into larger cognitive structures, to add up to a consequentialist nanoengineer that only does nanosystems and never does sufficiently general learning to apprehend the big picture containing humans, while still understanding the goal for that pivotal act you wanted to do" is among the more plausible advance-specified miracles we could get.
Regardless of Eliezer’s “deep models”, there’s no arguing this is an empirical question: we can't right now have accurate enough quantitative parameters for such inner workings, or how far they generalize. Or more generally, for how much Bayes-structure is required for certain tasks (or found by our dynamics). Any attempt to answer these messy questions now will come down to forecasting, and I don't think “our qualitative model saying consequentialism is very general and pervasive” (a not-even-well-defined scientific statement in a pre-paradigmatic field of research) constitutes such a strong argument.
^{^}
As if this post didn't contain enough proPaulganda!

20

The Alignment Problems

20

Ω 10

20

Ω 10

20

Ω 10