Legitimate deliberation is an alignment target; an alternative to / variation of Coherent Extrapolated Volition.
What could make us trust an AI? Can we imagine a near-future scenario where we might consider some frontier model's outputs to be reliable, rather than needing to check any claims it makes via human sources? Could anything cause us to say something like "Hmm, OK, if the AI says so, then I trust it" -- and for good reason, rather than because the AI has superhuman persuasion skill?
My aim is to discuss the role of "legitimacy" in AI alignment. "Legitimacy" as in "legitimate reasoning" -- this concept is supposed to generalize concepts like logical validity and probabilistic coherence, pointing to the most general notion of what-marks-good-reasoning. "Legitimacy" also deliberately connotes the idea of legitimate authority, legitimate government, etc: legitimacy is partly about which consensus-making mechanisms are normatively endorsed. Obeying laws is a type of legitimacy (although there can be legitimate conditions for disobeying laws).
I think that legitimacy is a better alignment target than direct value-learning, and a better articulation of what Coherent Extrapolated Volition (CEV) was trying to accomplish.
I also believe that legitimacy is complex in the same way that human values are complex: we should not expect a simple, elegant theory which precisely captures all aspects of legitimacy.
I nonetheless list some important features that legitimate reasoning seems to possess, and sketch a (tentative and incomplete) proposal for trustable AI systems based on these ideas.
This section has high overlap with Meaning & Agency.
Consider a known probability distribution evaluating an unknown probability distribution , EG Alice's opinion of the accuracy of Bob's beliefs (which Alice only knows some things about).
I define endorsement of probabilities as : Alice's expectation, if she learned Bob's beliefs, is that she would adopt those beliefs.[1] We can similarly define endorsement for expectations: . This allows us to talk about endorsement of utility evaluations. We can even define endorsement for decisions, by asking if Alice would copy Bob's best guess as to the utility-maximizing choice in some scenario. We can consider universally quantified versions (Alice endorses Bob about everything) or more specialized versions (Alice endorses Bob's opinion on a specific question).
The Van Fraassen Reflection Principle says that agents should always endorse their own future opinions. This is an important rationality condition in Radical Probabilism, and Garrabrant Induction satisfies a slightly weaker form of this principle.[2]
However, even the slightly weaker principle found in Garrabrant Induction is quite unrealistic as a rationality standard for humans. While your future self often knows more in many respects, you also forget many things. I wouldn't defer to myself-in-one-year on the subject of what I had for lunch yesterday. There are also specific states you can be in which compromise the legitimacy of your thoughts, such as dreaming, being exhausted, or being drunk. (Society recognizes the illegitimacy of some such states by not recognizing consent given under such circumstances. Contracts signed while drunk may be unenforceable, etc.)
So, we can study properties of an epistemic state which lead to endorsement. If we take this as an alignment target, it means trying to make AI have those properties.
However, I think there's a conceptual problem here.
One can imagine trying to maximize such a target by (a) learning to predict human endorsement well (having a very good understanding of the user's epistemic state, or perhaps some aggregate human epistemic state), and then (b) searching to maximize this. I'm being pretty vague here; I don't know what it would look like. My point is: I don't suspect that endorsement interacts with maximization in the right way: I don't think it makes sense to endorse the endorsement-maximizing option. An endorsement-maximizer is probably exploiting some illegitimate corner-case in our endorsement heuristics.
We can be uncertain about what to endorse. We can make mistakes. Since we've made mistakes in the past, we can generalize to the possibility of mistakes in the present. I use the term legitimacy to refer to the ideal we should endorse; the ideal we're aiming for.[3]
My intention is that legitimacy relates to endorsement as goodness (or perhaps human values) relates to utility: utility is a generic concept of preferences which apply to any agent, whereas goodness / human values specifically points at what we're after. Endorsement points generally at any agent's estimate of truth-making properties of reasoning, whereas legitimacy points at actual good reasoning.
Aiming for legitimacy therefore doesn't fall into the failure mode that maximizing endorsement might. We can at least quantilize endorsement, recognizing that it is a proxy for what we want.
More elaborately, my picture of aiming for legitimacy looks like this:
Legitimacy is unavoidably a feature of the whole epistemic process, which includes the humans doing the training as well as the machine-learning model.
I imagine the safety argument, if spelled out properly, would be a basin-of-attraction argument, very similar to Paul Christiano's basin of corrigibility argument. Under what assumptions can we be confident that errors will be corrected rather than multiplied (with high probability)?
Some quick examples of illegitimate things, to give you an idea of what I'm imagining we need to rule out:
This section tries to communicate the idea from Complete Feedback.
Imagine that we make an AI completely obey Van Fraassen's reflection principle: it endorses the opinions of its future self perfectly. ("Future self" includes opinions after having run chain-of-thought longer, future fine-tuned versions, and entirely new training runs in the same model family.)
This implies that the model's current beliefs equal its expected future beliefs:
Further, suppose that the AI is being actively modified by humans (fine-tuning and new training runs). Also suppose that the AI is quite good at predicting its future beliefs (EG it knows that if humans try to train a specific belief, they'll probably succeed).
This solves the fully-updated-deference problem in a narrow sense: an AI which anticipates receiving pro-shutdown feedback will shut down now, or make shutdown preparations it sees as appropriate and then shut down.[4] This is an important type of corrigibility.
The AI becomes a sort of time-travel device by which the humans communicate with their earlier selves: to the extent that the humans will predictably reach a conclusion about how to train the AI, the AI will already behave that way.
This is very useful if the humans can remain sane and maintain control over the feedback mechanisms. However, if the AI can manipulate humans in order to get future feedback which approves of the manipulation, it very well may do so. If the AI can get control of the feedback mechanisms and then tell itself that doing so was the correct decision, it very well may do so.
In a practical sense, it is not corrigible at all.
I diagnose the problem as follows: the AI thinks all feedback is legitimate. It endorses its future beliefs completely, no matter how those beliefs were arrived at.
Instead, we should train the AI to endorse its future beliefs selectively, based on estimated legitimacy of those beliefs.
Listing things out like this might make it seem like a hopeless game of whack-a-mole, but I think the edge instantiation problem here might be solved by appropriate learning-theoretic guarantees (ensuring appropriate caution is exercised with respect to cases which different generalizations judge differently).
It furthermore needs to be the case that the AI correctly expects the feedback it gets to be legitimate with high probability.
I imagine the state-space of the world as having an "envelope of legitimacy" based on which hypothetical world-states would continue providing the AI with legitimate feedback vs corrupt the feedback somehow. The humans and the AI need to work together to keep things comfortably within this envelope. When the AI considers plans which stray outside of the envelope (such as manipulating humans to get specific feedback), it doesn't see the conclusions it reaches in those worlds as legitimate, so it doesn't believe them.
Under these conditions, I think the AI is corrigible in a practical way: it doesn't resist shutdown or any other modifications humans could consider (unless there are real signs of illegitimacy). It doesn't avoid feedback (except illegitimate feedback) nor does it seek it.
I think it is also very useful: it still approximates a phone call from the future, but filtering legitimate futures, so you don't have to worry about weird mind-viruses getting introduced by the time-loop.
It seems plausible to me that legitimacy is complex in a similar way to how human values are complex. There are many features which reasoning can have which could make us trust it more.
What counts as manipulating a human? In exactly what states do humans tend to produce good reasoning vs bad reasoning?
There are simple theories of rationality such as Solomonoff Induction, which do capture some important information about legitimate reasoning. However, Solomonoff Induction leaves open the question of which universal machine. We know from malign prior argements that such choices can be important.
I'm not so much dogmatically asserting that legitimacy is complex, as I am pragmatically assuming that it may be so. I'm open to a simple theory.
What complexity implies is a need for an ongoing learning process, in which AIs come to have a better and better sense of legitimacy through collaborative research with humans. Hence the need for a basin-of-attraction type argument, so that we can do safe incremental improvement.
A problem with a pure value-learning approach is that human values appear to be fragile, IE, if you get the value-learning somewhat wrong, agentic maximization of those slightly-wrong values leads to very-wrong outcomes.
A working basin-of-attraction argument relies on a target that isn't like that: approximately legitimate agents are supposed to be good at becoming more legitimate, or approximately corrigible agents become more corrigible, etc.
Legitimacy is certainly fragile along specific dimensions. Ruling out many methods of manipulating humans but missing some can easily be catastrophic.
A working basin-of-attraction argument therefore requires a metric for quality of approximation, such that approximate notions of legitimacy cannot go catastrophically wrong.
It is tempting for me to summarize the math with the simpler English gloss:
If Alice learned what Bob thinks about a topic, Alice would adopt Bob's belief.
This follows from what I've written if we're working in Dogmatic Probabilism (aka classical Bayesianism), but it doesn't follow in Radical Probabilism, where conditional probabilities are only guaranteed to predict updates in expectation. If you can perfectly anticipate your update in a specific case (EG because you are putting in a lot of work to plan how you should update), then your update should exactly equal the Bayesian conditional probability on that observation (with respect to the probability distribution you've computed through your planning). However, you can also be surprised by how you update, EG if you haven't imagined the scenario vividly enough. This is not inherently irrational.
The weakening is necessary to handle problems of embedded agency. When I write my stronger notions of endorsement here, I really have in mind the weakened versions along the lines of the self-trust section (4.12) of Logical Induction. However, I think the mathematical complexity is not worth the trade-off in an essay such as this one, so I stick to the simpler version which has some implicit self-reference issues.
Depending on your perspective about moral relativism vs moral realism, you might think that legitimacy varies per-person or that there's a global notion of legitimacy. My intention is a more global version, similar to what's discussed in In Defense of Goodness.
If it is Tuesday and humans Wednesday will predictably give feedback indicating that it would have been better to immediately shut down on Tuesday (eg because the AI is in a dangerous enough state that shutting down immediately is always the correct move), then the AI will shut down immediately on Tuesday.
If it is Tuesday and humans Wednesday will predictably give feedback saying "shut down now" (ie, shut down on Wednesday) then the AI on Tuesday will make preparations for shutting down, whatever it understands to be appropriate. It will not shut down until told to do so on Wednesday.
Sometimes humans may give shutdown commands which are ambiguous between these two cases, which the AI will have to interpret.