The self-unalignment problem
The usual basic framing of alignment looks something like this: We have a system “A” which we are trying to align with system "H", which should establish some alignment relation “f” between the systems. Generally, as the result, the aligned system A should do "what the system H wants". Two things stand out in this basic framing: * Alignment is a relation, not a property of a single system. So the nature of system H affects what alignment will mean in practice. * It’s not clear what the arrow should mean. * There are multiple explicit proposals for this, e.g. some versions of corrigibility, constantly trying to cooperatively learn preferences, some more naive approaches like plain IRL, some empirical approaches to aligning LLMs… * Even when researchers don’t make an explicit proposal for what the arrow means, their alignment work still rests on some implicit understanding of what the arrow signifies. But humans are self-unaligned To my mind, existing alignment proposals usually neglect an important feature of the system "H" : the system "H" is not self-aligned, under whatever meaning of alignment is implied by the alignment proposal in question. Technically, taking alignment as relation, and taking the various proposals as implicitly defining what it means to be ‘aligned’, the question is whether the relation is reflexive. Sometimes, a shell game seems to be happening with the difficulties of humans lacking self-alignment - e.g. assuming if the AI is aligned, it will surely know how to deal with internal conflict in humans. While what I'm interested in is the abstract problem, best understood at the level of properties of the alignment relation, it may be useful to illustrate it on a toy model. Simple model of a self-unaligned system In the toy model, we will assume a specific structure of system "H": * A set of parts p1..pn, with different goals or motivations or preferences. Sometimes, these parts might be usefully represented as agents; o