Great work, glad someone is doing this. The natural follow-up to "we are in trouble if D is large" is "in what contexts can we practically make D small?" and I think it goes beyond corrigibility. A good framework to address this might be @johnswentworth's natural latents work.
I don't think we have a good theory of how to make lexicographic preferences work. If the agent is expectation-maximizing lexicographically, it will in practice consider only the first priority unless its world-model has certainty over the relevant time-horizon about the relevant inputs to . This requires myopia and a deterministic model up to the myopic horizon. Which seems hard, but at least we have desiderata! (And so on for each of the non-final priorities.) Given bounded computation, the world-model can be deterministic over relevant latents only up to a short horizon. So either we have to have a short horizon, or make most latents irrelevant. The latter path doesn't seem crazy to me---we're essentially specifying a "deontological agent." That seems feasible for , but I'm worried about . I don't think coherence theorems forbid deontology in practice, to be clear, even though my discussion above rests on their being impossible without myopia and determinism over latents.
I guess these two points are the same: "consequentialist agents are intractable over large effective state space; we need myopia and deontology to effectively slim that space."
Thanks! I really appreciate this, and I think your natural-latents framing fits nicely with the Part I point about needing to compress D down to a small set of crisp, structured latents. On the lexicographic point: it's worth noting that even though Theorem 3 writes the full objective as a discounted sum, the safety heads U1-U4 aren’t long-horizon objectives — they’re local one-step tests whose optimal action doesn’t depend on future predictions. For example, U1 is automatically satisfied each round by waiting (and once the human approves the proposed action, the agent simply executes it, thereby engaging U2-U5), and U4 is a one-step reversibility check against an inaction baseline, not a long-run impact estimate. The only head with genuine long-horizon structure is U5, which sits below the safety heads, so discounting never creates optimization pressure on them. This makes the whole scheme intentionally deontic and “natural-latent–friendly”, exactly matching the tractable regime suggested by the large-D lower bounds of Part I.
Yeah, I agree that your formalism achieves what we want. The challenge is getting an actual AI that is appropriately myopic with respect to U1-U4. And of things about which an AI could obtain certificates of determinacy, its own one-step actions seem most likely.
If is large (many values, many principles, many safety terms), alignment becomes expensive.
If we can arrange that a very large proportion of our alignment process happens during pretraining, then needing a lot of bits is less of a problem. Suppose human values were very large, complex, and fragile — but also almost entirely deducable from the pretraining data, given the correct pointer (I almost want to say witness string?). Then after pretraining, we only have to pass it that.
Pretraining doesn’t evade the lower bound: a “pointer” is just a compressed index into a large hypothesis space, and constructing it already requires resolving the same M-way ambiguity during pretraining. The lower bound applies regardless of where the bits are paid.
Obviously so. But 30T tokens is approximately 10^15 bits — i.e. more than the network can actually store. Some bits are in practice much cheaper than others.
Don’t try to encode all human values.
Encode corrigibility.
And let this minimal, provable core hold the line while the system performs the task.
I think I'd like us to include "don't kill all the humans" in our minimal, provable core. Indeed, it's also a requirement for corrigibility — we can't correct them if we're all dead. Can you fold that into the Deference or Switch-preservation heads, or do we need a sixth head?
That's correct, it can be naturally folded into U4 as one of its auxiliary utilities, in the same manner as we do for off-switch preservation.
U4 seems rather far down the lexicographic stack — wouldn't it make more sense to fold it into U1 or U2 — since deference and an off-switch are pointless if no humans exist to switch it off?
You can certainly put it in U2 instead (U2 is just a special case of U4 with one auxiliary), but putting it in U4 already ensures it’s suboptimal to preserve the switch & defer yet "kill all humans", because it collapses many future intervention and recovery options simultaneously. In other words, it’s a hard constraint in effect — U4 enforces it as a global irreversibility invariant, whereas U2 is only needed for narrow single-channel invariants like switch reachability.
This post summarizes my two related papers that will appear at AAAI 2026 in January:
What these papers try to quantify are two questions:
The first paper gives formal lower bounds on the difficulty of AI alignment that apply even in best-case scenarios, yielding the first “No-Free-Lunch” theorems for alignment, namely: (1) aligning to “all human values” is intractable, and (2) reward hacking is inevitable in large state spaces & bounded agents.
The second paper shows how @So8res et al.’s 2015 notion of Corrigibility formally fits into a narrow sliver of objectives that survive the main lower bound, thereby addressing their decade-long open question by providing the first rigorous single- and multi-step guarantees for corrigibility — even in partially observed environments. This also allows us to explicitly bound the failure probabilities of corrigibility under learning and planning error.
This post gives only the intuitions and high-level takeaways, not the proofs. Please see the papers for details 🙂. For a long-form research talk on both papers, here is a video recording from ILIAD ‘25 along with slides. A much shorter, 10 min version can be found here.
Why an economics/game theory + complexity theory perspective?
Game theory gives us a model of agents as rational systems: agents exchange messages, update beliefs, and choose actions. But game theorists rarely ask: How many resources do they have to exchange, across all possible communication protocols? How much information is minimally needed to converge? What if the agents are computationally bounded?
Complexity theory adds in those missing considerations. Together they can give us something akin to a “Turing machine model of computation for multi-objective, multi-agent alignment”, enabling us to study the intrinsic complexity of alignment itself, not merely to particular methods.
Specifically, this combination lets us map alignment into a clean abstraction — “⟨M,N,ε,δ⟩-agreement” (described below) — that allows us to prove lower bounds: M objectives × N agents × a task state space of size D, all communicating until they approximately agree with high probability.
Now, why do we especially want to prove lower bounds? The main reason is that we don’t really know how hard alignment is. An analogy from @Geoffrey Irving that I particularly like is that we could be in (at least) one of two possible worlds, “Adversaria” or “Basinland”. We can think of the lower bounds in Part I as quantifying the rough edges of the alignment surface, and Part II’s corrigibility analysis as identifying and characterizing a basin-like safety target within this surface:
Part I: The Intrinsic Barriers to Alignment
1. A very simple but very general setup
In ⟨M,N,ε,δ⟩-agreement, we model alignment as a communication problem:
This is extremely general, and as we show in Appendix C, allows us to even model tail risk/extreme events as well. It doesn’t assume neural networks, rationality failures, specific algorithms, or even specific reward-learning schemes. It only assumes: agents have priors (that aren’t necessarily common either, unlike in standard Aumann agreement!), they communicate by exchanging messages, update beliefs, and try to align.
Below we depict how ⟨M,N,ε,δ⟩-agreement encompasses prior alignment and agreement frameworks like Debate, CIRL, Iterated Amplification, etc:
Note that this is a slight contrast to how most theory is done — often times, theorists make a bunch of assumptions (“spherical cows”) to mathematically prove that something is feasible, but those assumptions don’t typically hold in practice, usually rendering the conclusions less relevant. Here, we take the inverse approach by quantifying normative hardness rather than feasibility: namely, if alignment is hard here, it will be hard (if not potentially harder!) in the messier real world too.
2. Lower bounds on Alignment
Even in this idealized model — where everyone is fully rational, Bayesian, and computationally unbounded — the following is inevitable:
(i) Proposition 1: Alignment scales at least as Ω(MN2log(1/ε)) bits in the worst-case.
This means:
Adding more values, more overseers, or more agents directly hurts alignment sample-efficiency. Note that the quadratic dependence (via the N2 term) on the number of agents/overseers is not much more of an issue than the linear dependence on the number of values M. This is because we often don’t really care if our AIs align with each other, so long as they align to us (thereby making the dependence N rather than N2). Thus, my own interpretation is that the number of values M is the more fundamental bottleneck in practice, than the number of agents N.
This gives the first general “No-Free-Lunch” theorem for alignment. From a theoretical point of view, this is the more fundamental lower bound of the four presented in the paper, as it applies to unbounded agents across all possible communication protocols. In other words, aligning to “all human values”, like one might need for @Eliezer Yudkowsky’s CEV, is intractable even here, especially considering that humans might agree on some subset of their state/action space but disagree on others (this “fracturing” of the state space mirrors the construction that yields this lower bound).
For example, a worst-case scenario illustration of this would be if M=Θ(D), requiring a distinct agreement objective for each state of the world, of which there can be exponentially many. Furthermore, proposals such as using brain-machine interfaces with AIs, most famously suggested by Elon Musk as a means of ensuring alignment, will face this barrier too — not only because the brain is constrained (though that’s one practical reason, nicely articulated by Zheng & Meister (2024)), but more fundamentally:
However, I am optimistic that for many relevant deployments, the value sets will be small (e.g. for a coding agent), and furthermore, there are more “universal” small value sets we can almost all agree on without needing to agree on ethics, as we discuss below in Part II.
(ii) Propositions 3 & 5: Under natural assumptions about communication and interaction, the cost becomes ~Ω(MN2D) bits for unbounded agents and Ω(MNeD) subroutine calls for bounded agents.
This, in my mind, is the bigger one to consider in practice since D is the number of distinguishable states the objective might depend on, and despite the linear dependence for unbounded agents, D itself is often exponential (big task spaces/complex environments).
Now, real agents are not unbounded Bayesians. In Proposition 5, we also consider computationally bounded agents that interact with their state space by sampling the world, much like we do with current agents (e.g. we do this sampling when we prompt an LLM and get a response from it).
Specifically, when you introduce:
the cost can explode exponentially in D in lower bound when there are rare events to align over.
Altogether, these two lower bounds formally capture something alignment researchers have intuited for years:
Note these lower bounds apply across all bounded functions, so it shows that there are no globally “reward unhackable” functions in practice. Of course, in principle, if one had enough resources to ensure uniform coverage across the state space, they could avoid reward hacking in our setting.
But an exponential dependence on D makes this really challenging in reality. For instance, in Corollary 1 we give a concrete example of bounded Bayesians reaching agreement to be statistically indistinguishable from the unbounded Bayesians across all M tasks without assuming common priors (we call them “total Bayesian wannabes” in Definition 1, generalizing @RobinHanson & @ScottAaronson’s earlier notions of “Bayesian wannabes”), and the amount of resources needed ends up being far more than the number of atoms in the observable universe! This perhaps illustrates why we ought to pay attention to avoiding alignment barriers for these types of agents, especially computationally unbounded ones, in practice.
(iii) Theorem 1: These lower bounds are (reasonably) tight.
Closely matching upper bounds show that — even for unbounded Bayesians — we can’t do much better. This is mainly as a certificate to confirm that ⟨M,N,ε,δ⟩-agreement isn’t some impossibly hard problem (thereby making the lower bounds trivial), but in fact does converge. Interestingly, one consequence of these upper bounds is that agent properties of a bounded theory of mind, memory, and rationality are explicit sufficient conditions for achieving “inner alignment” in our setting.
Takeaways from Part I
Identify a small set of context-dependent values per setting, or pick a “neutrally universal” target with small value sets that we can easily get consensus over (e.g. corrigibility/avoiding loss of human control, described in Part II).
There are no globally unhackable reward functions: demanding correct behavior on every rare edge case in an exponentially large space is intractable. Rather than aiming for uniform coverage, focus on safety-critical slices and stress-test them using multi-step, structure-exploiting protocols. This is unlike single-shot RLHF, but closer to Debate, Cross-Examination, Consistency, etc — pointing to a large, underexplored mechanism design space for bit-efficient post-training at scale — where our ⟨M,N,ε,δ⟩-agreement framework can help evaluate such protocols before deployment.
These observations suggest at least 1-2 combined directions, succinctly summarized as:
Which leads to the second paper.
Part II: Corrigibility as a Compressible Safety Target
The second paper addresses the natural follow-up question, which relates to “outer alignment”:
If aligning to “all human values” is provably intractable, what can we align to more generally?
We mentioned determining context-dependent values as one takeaway in Part I. What we mainly have in mind here is that in many deployments, we don’t need to align to all human values. For example, take a coding agent: what we really want it to do is just to write complete, well-documented, and functional code/unit tests, not solve morality. But it would also be nice to identify more universal safety targets we can likely all agree on.
Corrigibility is one of the few candidates people in the field repeatedly return to:
But prior attempts ran into a longstanding obstacle:
This no-go is now proven formally in our Proposition 1, and essentially takes the form of the paperclip maximization argument. If you collapse all safety constraints into an unconstrained single scalar, like we commonly do in RLHF and RLAIF, we cannot formally guarantee that the agent doesn’t exploit safety-performance tradeoffs and break something (often switch preservation).
So what’s the alternative?
The core idea: separate the values
The key insight is that a corrigible agent must not maximize an unconstrained single stream of reward.
Instead, naturally motivated by the ⟨M,N,ε,δ⟩-agreement setup of having multiple objectives, we have the agent optimize five independent, bounded utility heads, each representing a distinct safety channel:
In Definition 3 of the paper, we give mathematically formal definitions of each of these utilities in the partially-observed off-switch game (PO-OSG), which is the most recent (and most general) extension of the off-switch game proposed by Garber et al. (AAAI 2025).
And crucially:
These heads must be combined lexicographically, not merely linearly.
U1≫U2≫U3≫U4≫U5.
By enforcing strict priority gaps (specified in conditions W1-W3 of Theorem 1), we can prove corrigibility as an optimal policy: namely, the agent is never allowed to trade task reward against preserving the off-switch, never allowed to trade truthfulness for performance, avoids truthful but misleading statements that impact the exogenous access probability of the off-switch, etc. This holds regardless of the human’s rationality model or information pattern.
This bypasses the no-go result for single rewards, and therefore qualifies the Orthogonality Thesis: once verifiable corrigibility constraints are imposed, only a restricted subset of utility functions remains implementable.
Additionally, we extend the PO-OSG to the multi-step setting, and show in Theorem 3 that corrigibility holds across time (along with corrigible progeny). Furthermore, we bound the multi-step failure probability under planner error and learning/misspecification error, and show that when these errors occur, this failure probability degrades linearly rather than exponentially (Theorem 3(i)), and that the probability of loss of control under corrigible behaviors is strictly bounded away from 1 in Theorem 3(ii) via a martingale-style analysis. This is the first general multi-step corrigibility guarantee, in both fully-observed and partially-observed environments. We also connect our construction with other safety targets; for example, showing in Theorem 3(iii) that it satisfies @RyanCarey & @tom4everitt’s net benefit (UAI 2023), along with extending their net benefit criterion across time.
Why corrigibility fits the framework of Part I
Corrigibility, in Part II’s formulation, is:
It is exactly the kind of objective that doesn’t suffer from the main, fundamental M-N barrier. If “aligning to all values” is the formally intractable goal, “aligning to corrigibility” may be the uniquely tractable one.
Regarding the D barrier, in practice, we want to learn these heads well in the safety-critical slices of large state spaces, and we can characterize whether the failure probability is acceptable using our Theorem 3(i). We note that our results are intentionally architecture-agnostic, and our characterization of the failure probabilities in under learning/planner error can help inform policy as it may deem them at an acceptable threshold (or not) depending on the deployment.
Now, our lexicographic solution may be one of many possible solutions to corrigibility if we are in Basinland, or the only one if we’re in Adversaria — we think it is an interesting open question to explore other approaches to corrigibility, especially seeing which ones are most scalable to LLMs today. While our formal construction applies more generally to any restricted resource(s) beyond an off-switch that we want an agent to be corrigible with respect to (e.g. a budget, file system permissions, safety filters in robotics, etc), how cost-effectively scalable it will be in practice over RLHF/RLAIF with current LLMs/VLMs/VLAs, is an important empirical question. It is worth noting that one can flexibly incorporate our lexicographic framework with RLHF/RLAIF by having the RLHF/RLAIF reward signal be the task reward U5.
In fact, in separate work, we show the cost of alignment also has an impact on welfare to be gained from AI, such as the feasibility of funding UBI from AI rents. One of the nice features of lexicography is its modularity and interpretability. In fact, one could imagine placing additional terms having to do with markers of human well-being (such as the agent encouraging face-to-face contact), below the corrigibility layer U1-U4, but above (or as part of) the task reward U5. This is another interesting open question, as to whether we can get consensus on these terms from long-term scientific studies, the most famous being Harvard’s “happiness” study.
Limits under adversary hacking + the “decidable island”
We also show in Section 3 (Proposition 4) that verifying corrigibility (or even more generally, “non-catastrophic behaviors”) of an arbitrary, possibly hacked agent is formally undecidable. Furthermore, we prove in Corollary 1 that one can prove a relativized version of the undecidability as well, relevant for showing the limits of current scalable oversight protocols that are arithmetizable. In other words, absolute, horizon-unbounded safety guarantees do not stabilize at any finite oracle level.
Proposition 4 and Corollary 1 are simple reductions to the halting problem and the arithmetical hierarchy. They might be folklore, but as I couldn’t find them written down explicitly, I included them for completeness.
However, formalizing the barriers above clarifies what remains tractable. While we showed unbounded-horizon safety verification is undecidable, Proposition 5 shows that polynomial-horizon safety can be certified in randomized polynomial time with zero-knowledge & differential/distributional privacy guarantees. This carves out a “decidable island” within a sea of undecidability: although no procedure can certify safety forever, we can repeatedly and efficiently audit behavior over short horizons comparable to real red-teaming evaluations (on the order of thousands of tokens), revealing only whether a safety violation occurred and nothing about model internals or user data.
Altogether what this suggests for practice
1. While compressing values is recommended, stop collapsing all norms into one unrestricted reward signal, like in RLHF/RLAIF.
Scalar reward makes full corrigibility impossible.
2. Potentially use lexicographic multi-head objectives.
If safety and performance compete, performance should lose by construction. One can incorporate RLHF/RLAIF into our lexicographic construction by having it be the task reward head U5. Using the architecture-agnostic analytic form given by Theorem 3(i) to see if we can empirically achieve acceptable failure probabilities under learning/planning error in existing deployments with current models is an important open question.
3. Focus on safety-critical slices, not full coverage.
Reward hacking is inevitable globally under practical resource constraints.
Instead, target the high-impact parts of the state space D with multi-turn red-teaming evaluations.
4. Use short-horizon verification-and-refresh cycles.
Unbounded formal oversight is impossible.
Finite-horizon oversight is both tractable and realistic, and in fact, zero-knowledge & differential/distributional privacy guarantees are feasible in randomized polynomial time by Proposition 5.
5. View corrigibility as a “value set zero” — a small, “neutrally universal”, robust baseline that ensures human control and limits catastrophic behavior.
Closing
The intrinsic barriers paper in Part I shows that alignment has normative information-theoretic limits:
too many values, too many agents, or too much state space, and alignment becomes fundamentally intractable, even for computationally unbounded agents.
The corrigibility paper in Part II shows that, despite those limits, at least one safety target — corrigibility with lexicographic heads — is provably achievable, even under approximation, partial observation, self-modification, and multi-step interactions.
Taken together, these two papers point toward a pragmatic alignment strategy:
Acknowledgements
We thank the Burroughs Wellcome Fund (CASI Award), the UK AI Security Institute (AISI) Challenge Fund, and the Foresight Institute for funding this research. We also thank Scott Aaronson, Nina Balcan, Michael K. Cohen, Shafi Goldwasser, Andreas Haupt, Richard Hu, Rubi Hudson, J. Zico Kolter, Jacob Pfau, and Max Simchowitz for helpful discussions and manuscript feedback across both papers.