This post summarizes my two related papers that will appear at AAAI 2026 in January:

Part I: Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis (selected for oral presentation)
Part II: Core Safety Values for Provably Corrigible Agents

What these papers try to quantify are two questions:

What factors make alignment hard in principle, and
Which safety targets remain tractable despite those limits.

The first paper gives formal lower bounds on the difficulty of AI alignment that apply even in best-case scenarios, yielding the first “No-Free-Lunch” theorems for alignment, namely: (1) aligning to “all human values” is intractable, and (2) reward hacking is inevitable in large state spaces & bounded agents.

The second paper shows how @So8res et al.’s 2015 notion of Corrigibility formally fits into a narrow sliver of objectives that survive the main lower bound, thereby addressing their decade-long open question by providing the first rigorous single- and multi-step guarantees for corrigibility — even in partially observed environments. This also allows us to explicitly bound the failure probabilities of corrigibility under learning and planning error.

This post gives only the intuitions and high-level takeaways, not the proofs. Please see the papers for details 🙂. For a long-form research talk on both papers, here is a video recording from ILIAD ‘25 along with slides. A much shorter, 10 min version can be found here.

Why an economics/game theory + complexity theory perspective?

Game theory gives us a model of agents as rational systems: agents exchange messages, update beliefs, and choose actions. But game theorists rarely ask: How many resources do they have to exchange, across all possible communication protocols? How much information is minimally needed to converge? What if the agents are computationally bounded?

Complexity theory adds in those missing considerations. Together they can give us something akin to a “Turing machine model of computation for multi-objective, multi-agent alignment”, enabling us to study the intrinsic complexity of alignment itself, not merely to particular methods.

Specifically, this combination lets us map alignment into a clean abstraction — “ $⟨ M, N, ε, δ ⟩$ -agreement” (described below) — that allows us to prove lower bounds: $M$ objectives × $N$ agents × a task state space o...

Scalable Oversight

Scalable Oversight