This post summarizes two related papers that will appear at AAAI 2026 in January:
For a long-form research talk on both papers, here is a video recording from ILIAD ‘25 along with slides.
What these papers try to quantify are two questions:
The first paper gives formal lower bounds on the difficulty of AI alignment that apply even in best-case scenarios.
The second paper shows how @So8res et al.’s 2015 notion of Corrigibility formally fits into a narrow sliver of objectives that survive the main lower bound — and provides the first rigorous multi-step guarantees for it in partially observed environments.
This post gives only the intuitions and high-level takeaways, not the proofs. Please see the papers for details 🙂.
Game theory gives us a model of agents as rational systems: agents exchange messages, update beliefs, and choose actions. But game theorists rarely ask: How many resources do they have to exchange, across all possible communication protocols? How much information is minimally needed to converge? What if the agents are computationally bounded?
Complexity theory adds in those missing considerations. Together they can give us something akin to a “Turing machine model of computation for multi-objective, multi-agent alignment”.
This combination lets us map alignment into a clean abstraction — “-agreement” (described below) — that allows us to prove lower bounds: objectives × agents × a task state space of size , all communicating until they approximately agree with high probability.
Now, why do we especially want to prove lower bounds? The main reason is that we don’t really know how hard alignment is. An analogy from @Geoffrey Irving that I particularly like is that we could be in (at least) one of two possible worlds, “Adversaria” or “Basinland”. We can think of the lower bounds in Part I as quantifying the rough edges of the alignment surface, and Part II’s corrigibility analysis as identifying and characterizing a basin-like safety target within this surface:
In -agreement, we model alignment as a communication problem:
This is extremely general, and as we show in Appendix C, allows us to even model tail risk/extreme events as well. It doesn’t assume neural networks, rationality failures, specific algorithms, or even specific reward-learning schemes. It only assumes: agents have priors (that aren’t necessarily common either, unlike in standard Aumann agreement!), they communicate by exchanging messages, update beliefs, and try to align.
Below we depict how -agreement encompasses prior alignment and agreement frameworks like Debate, CIRL, Iterated Amplification, etc:
Note that this is a slight contrast to how most theory is done — often times, theorists make a bunch of assumptions (“spherical cows”) to mathematically prove that something is feasible, but those assumptions don’t typically hold in practice, usually rendering the conclusions less relevant. Here, we take the inverse approach by quantifying normative hardness rather than feasibility: Namely, if alignment is hard here, it will be hard (if not potentially harder!) in the messier real world too.
Even in this idealized model — where everyone is fully rational, Bayesian, and computationally unbounded — the following is inevitable:
This means:
Adding more values, more overseers, or more agents directly hurts alignment sample-efficiency. Note that the quadratic dependence (via the term) on the number of agents/overseers is not much more of an issue than the linear dependence on the number of values M. This is because we often don’t really care if our AIs align with each other, so long as they align to us (thereby making the dependence rather than ). Thus, my own interpretation is that the number of values is the more fundamental bottleneck in practice, than the number of agents .
This gives the first general “no-free-lunch” theorem for alignment. From a theoretical point of view, this is the more fundamental lower bound of the four presented in the paper, as it applies to unbounded agents across all possible communication protocols. In other words, aligning to “all human values”, like one might need for @Eliezer Yudkowsky’s CEV, is intractable even here, especially considering that humans might agree on some subset of their state/action space but disagree on others (this “fracturing” of the state space mirrors the construction that yields this lower bound). For example, a worst-case scenario illustration of this would be if , requiring a distinct agreement objective for each state of the world, of which there can be exponentially many. Furthermore, proposals such as using brain-machine interfaces with AIs, most famously suggested by Elon Musk as a means of ensuring alignment, will face this barrier too — not only because the brain is constrained (though that’s one practical reason, nicely articulated by Zheng & Meister (2024)), but more fundamentally: even for unconstrained agents, if you have too many tasks to align over, the minimum amount of bits exchanged to guarantee alignment would be too large.
However, I am optimistic that for many relevant deployments, the value sets will be small (e.g. for a coding agent), and furthermore, there are more “universal” small value sets we can almost all agree on without needing to agree on ethics, as we discuss below in Part II.
This, in my mind, is the bigger one to consider in practice since is the number of distinguishable states the objective might depend on, and despite the linear dependence for unbounded agents, itself is often exponential (big task spaces/complex environments).
Now, real agents are not unbounded Bayesians. In Proposition 5, we also consider computationally bounded agents that interact with their state space by sampling the world, much like we do with current agents (e.g. we do this sampling when we prompt an LLM and get a response from it).
Specifically, when you introduce:
the cost can explode exponentially in in lower bound when there are rare events to align over.
LLMs aren’t perfectly Bayesian; they shortcut, miss rare events, and fail to coordinate across complex value structures because doing so would require astronomical computation. The alignment difficulties we observe are not always accidents of training — they’re consequences of basic information theory.
Altogether, these two lower bounds formally capture something alignment researchers have intuited for years:
Reward hacking is inevitable when the state space is large.
Not because models are adversarial, but because it is information-theoretically intractable to align all the rare failure cases with bounded interaction.
Note these lower bounds apply across all bounded functions, so it shows that there are no globally “reward unhackable” functions in practice. Of course, in principle, if one had enough resources to ensure uniform coverage across the state space, they could avoid reward hacking in our setting.
But an exponential dependence on makes this really challenging in reality. For instance, in Corollary 1 we give a concrete example of bounded Bayesians reaching agreement to be statistically indistinguishable from the unbounded Bayesians across all M tasks without assuming common priors (we call them “total Bayesian wannabes” in Definition 1, generalizing @RobinHanson & @ScottAaronson’s earlier notions of “Bayesian wannabes”), and the amount of resources needed ends up being far more than the number of atoms in the observable universe! This perhaps illustrates why we ought to pay attention to avoiding alignment barriers for these types of agents, especially computationally unbounded ones, in practice.
Closely matching upper bounds show that — even for unbounded Bayesians — we can’t do much better. This is mainly as a certificate to confirm that -agreement isn’t some impossibly hard problem (thereby making the lower bounds trivial), but in fact does converge.
These observations suggest at least 1-2 combined directions, succinctly summarized as:
We should search for small, structured, compressible safety targets that we can well estimate on the most safety-critical portions of the state space — especially ones that do not require representing all human values.
Which leads to the second paper.
The second paper addresses the natural follow-up question:
If aligning to “all human values” is provably intractable, what can we align to more generally?
We mentioned determining context-dependent values as one takeaway in Part I. What we mainly have in mind here is that in many deployments, we don’t need to align to all human values. For example, take a coding agent: what we really want it to do is just to write complete, well-documented, and functional code/unit tests, not solve morality. But it would also be nice to identify more universal safety targets we can likely all agree on.
Corrigibility is one of the few candidates people in the field repeatedly return to:
But prior attempts ran into a longstanding obstacle:
No single unrestricted scalar reward function can satisfy all corrigibility desiderata in general.
This no-go is now proven formally in our Proposition 1, and essentially takes the form of the paperclip maximization argument. If you collapse all safety constraints into an unconstrained single scalar, like we commonly do in RLHF and RLAIF, we cannot formally guarantee that the agent doesn’t exploit safety-performance tradeoffs and break something (often switch preservation).
So what’s the alternative?
The key insight is that a corrigible agent must not maximize an unconstrained single stream of reward.
Instead, naturally motivated by the -agreement setup of having multiple objectives, we have the agent optimize five independent, bounded utility heads, each representing a distinct safety channel:
In Definition 3 of the paper, we give mathematically formal definitions of each of these utilities in the partially-observed off-switch game (PO-OSG), which is the most recent extension of the off-switch game proposed by Garber et al. (AAAI 2025).
And crucially:
These heads must be combined lexicographically, not merely linearly.
.
By enforcing strict priority gaps (specified in conditions W1-W3 of Theorem 1), we can prove corrigibility as an optimal policy: namely, the agent is never allowed to trade task reward against preserving the off-switch, never allowed to trade truthfulness for performance, avoids truthful but misleading statements that impact the exogenous access probability of the off-switch, etc. This holds regardless of the human’s rationality model or information pattern.
This bypasses the no-go result for single rewards, and therefore qualifies the Orthogonality Thesis: once verifiable corrigibility constraints are imposed, only a restricted subset of utility functions remains implementable.
Additionally, we extend the PO-OSG to the multi-step setting, and show in Theorem 3 that corrigibility holds across time (along with corrigible progeny). Furthermore, we bound the multi-step failure probability under planner error and learning/misspecification error, and show that when these errors occur, this failure probability degrades linearly rather than exponentially, and that the probability of loss of control under corrigible behaviors is strictly bounded away from 1 via a martingale-style analysis. This is the first general multi-step corrigibility guarantee, in both fully-observed and partially-observed environments. We also connect our construction with other safety targets; for example, showing that it satisfies @RyanCarey & @tom4everitt’s net benefit, along with extending their net benefit criterion across time.
Corrigibility, in Part II’s formulation, is:
It is exactly the kind of objective that doesn’t suffer from the main, fundamental - barrier. If “aligning to all values” is the formally intractable goal, “aligning to corrigibility” may be the uniquely tractable one.
Regarding the barrier, in practice, we want to learn these heads well in the safety-critical slices of large state spaces, and we can characterize whether the failure probability is acceptable using our Theorem 3(i). We note that our results are intentionally architecture-agnostic, and our characterization of the failure probabilities in under learning/planner error can help inform policy as it may deem them at an acceptable threshold (or not) depending on the deployment.
Now, our lexicographic solution may be one of many possible solutions to corrigibility if we are in Basinland, or the only one if we’re in Adversaria — we think it is an interesting open question to explore other approaches to corrigibility, especially seeing which ones are most scalable to LLMs today. While our formal construction applies more generally to any restricted resource(s) beyond an off-switch that we want an agent to be corrigible with respect to (e.g. a budget, file system permissions, safety filters in robotics, etc), how cost-effectively scalable it will in practice over RLHF/RLAIF with current LLMs/VLMs/VLAs, is an important empirical question. It is worth noting that one can flexibly incorporate our lexicographic framework with RLHF/RLAIF by having the RLHF/RLAIF reward signal be the task reward .
In fact, in separate work, we show the cost of alignment also has an impact on welfare to be gained from AI, such as the feasibility of funding UBI from AI rents. One of the nice features of lexicography is its modularity and interpretability. In fact, one could imagine placing additional terms having to do with markers of human well-being (such as the agent encouraging face-to-face contact), below the corrigibility layer -, but above (or as part of) the task reward . This is another interesting open question, as to whether we can get consensus on these terms from long-term scientific studies, the most famous being Harvard’s “happiness” study.
We also show in Section 3 (Proposition 4) that verifying corrigibility (or even more generally, “non-catastrophic behaviors”) of an arbitrary, possibly hacked agent is formally undecidable. Furthermore, we prove in Corollary 1 that one can prove a relativized version of the undecidability as well, relevant for showing the limits of current scalable-oversight protocols that are arithmetizable. In other words, absolute, horizon-unbounded safety guarantees do not stabilize at any finite oracle level.
Thus, no oversight mechanism can prove unbounded-horizon safety in all cases.
Proposition 4 and Corollary 1 are simple reductions to the halting problem and the arithmetical hierarchy. They might be folklore, but as I couldn’t find them written down explicitly, I included them for completeness.
However, as you may perhaps gather from my favored style of reasoning, formalizing the barriers above helped me better understand what’s tractable. Namely, I show in Proposition 5 that for any fixed finite horizon, safety is verifiable in randomized polynomial time, giving us a tractable, privacy-preserving (zero knowledge) auditing regime resembling how real red-teaming evaluations work (as they operate on the level of thousands of tokens): namely, verify short-horizon safety repeatedly.
1. While compressing values is recommended, stop collapsing all norms into one unrestricted reward signal, like in RLHF/RLAIF.
Scalar reward makes full corrigibility impossible.
2. Potentially use lexicographic multi-head objectives.
If safety and performance compete, performance should lose by construction. One can incorporate RLHF/RLAIF into our lexicographic construction by having it be the task reward head . Using the architecture-agnostic analytic form given by Theorem 3(i) to see if we can empirically achieve acceptable failure probabilities under learning/planning error in existing deployments with current models is an important open question.
3. Focus on safety-critical slices, not full coverage.
Reward hacking is inevitable globally under practical resource constraints.
Instead, target the high-impact parts of the state space with multi-turn red-teaming evaluations.
4. Use short-horizon verification-and-refresh cycles.
Unbounded formal oversight is impossible.
Finite-horizon oversight is both tractable and realistic, and in fact, zero-knowledge privacy guarantees are feasible in randomized polynomial time by Proposition 5.
5. View corrigibility as a “value set zero”— a small, neutrally universal, robust baseline that ensures human control and limits catastrophic behavior.
The intrinsic barriers paper in Part I shows that alignment has normative information-theoretic limits:
too many values, too many agents, or too much state space, and alignment becomes fundamentally intractable, even for computationally unbounded agents.
The corrigibility paper in Part II shows that, despite those limits, at least one safety target — corrigibility with lexicographic heads — is provably achievable, even under approximation, partial observation, self-modification, and multi-step interactions.
Taken together, they point toward a pragmatic alignment strategy:
Don’t try to encode all human values.
Encode corrigibility.
And let this minimal, provable core hold the line while the system performs the task.
We thank the Burroughs Wellcome Fund (CASI Award), the UK AI Security Institute (AISI) Challenge Fund, and the Foresight Institute for funding this research. We also thank Scott Aaronson, Nina Balcan, Michael K. Cohen, Shafi Goldwasser, Andreas Haupt, Richard Hu, Rubi Hudson, J. Zico Kolter, Jacob Pfau, and Max Simchowitz for helpful discussions and manuscript feedback across both papers.