I commend this paper, and the theory behind it, to people's attention! It reminds me of formal verification in software engineering, but applied to thinking systems.
It used to be that, when I wanted to point to an example of advanced alignment theory, I would single out June Ku's MetaEthical AI, and also what Vanessa Kosoy and Tamsin Leake are doing. Lately I also mention PRISM as well, as an example of a brain-derived decision theory that might result from something like CEV.
Now we have this formal theory of alignment too, based on the "FMI" theory of cognition. There would be a section like this, in the mythical "textbook from the future" that tells you how to align a superintelligence.
So if I was working for a frontier AI organization, and trying to make human-friendly superintelligence, I might be looking at how to feed PRISM and FMI to my AlphaEvolve-like hive of AI researchers, as a first draft of final superalignment theory.
Thanks! I really appreciate this—especially the connection to formal verification, which is a useful analogy for understanding FMI's role in preventing drift in internal reasoning. The comparison to PRISM and MetaEthical AI is also deeply encouraging—my hope is to make the structure of alignment itself recursively visible and correctable.
Thanks! I really appreciate this—especially the connection to formal verification, which is a useful analogy for understanding FMI's role in preventing drift in internal reasoning. The comparison to PRISM and MetaEthical AI is also deeply encouraging—my hope is to make the structure of alignment itself recursively visible and correctable.
My recent paper, Why AI Alignment Is Not Reliably Achievable Without a Functional Model of Intelligence: A Model-Theoretic Proof argues that alignment under conceptual novelty isn’t something you can train into a system or enforce externally. It must be structurally guaranteed from within—and that guarantee depends on the presence of specific internal functions. These functions, taken together, define what I refer to as the Functional Model of Intelligence (FMI). The paper shows that even capable systems, and even those that generalize well under most conditions, will eventually fail to preserve alignment unless they possess a specific internal structure. Without it, misalignment isn't a matter of insufficient training or poor objective design. It’s baked into the architecture. Even if this particular FM turns out to be incomplete, the core result still holds: alignment requires a complete and recursively evaluable functional model of intelligence. That is, some FMI must exist for alignment to be achievable under novelty. The proof shows why.
If you're curious about this result but prefer to think in diagrams rather than equations, I'm running a workshop on Visualizing AI Alignment on August 10, 2025.
The workshop will walk through the main ideas of the paper—like recursive coherence, conceptual drift, and structural completeness—using fully visual reasoning tools. The goal is to make the core constraints of alignment immediately visible, even to those without a formal logic background.
Virtual attendance is free, and we're accepting brief submissions through July 24, 2025.