This post is crossposted from my Substack, Structure and Guarantees, where I’ve been exploring how formal verification might scale to more-complex intelligent systems. This piece argues that compilers—already used in limited forms of recursive self-improvement—offer a concrete setting for studying some AI-safety challenges. The previous post in this sequence is “Formal Verification: The Ultimate Fitness Function”.
My last post pitched formal verification as an enabler of improved evolutionary search. While advances in AI capabilities and safety are often presented as trading off against each other, there is hope for improving both dimensions through adoption of formal verification. While natural selection takes a long time to “evaluate” the fitness of individuals, measuring only the concrete lives that they lead, formal verification allows up-front evaluation in all possible scenarios, dramatically shortening the feedback cycle. At the same time, the kinds of proofs that underlie formal verification can potentially establish properties that we want to make sure stay true during an evolutionary process (like safety properties, in both the AI sense and, in a bonus pun, the technical formal-methods sense).
There are a lot of details to get right, to set up such a framework that is ready to ensure safety of powerful AI systems. The canonical phenomenon that makes safety guarantees hard is recursive self-improvement, where AI systems design their own successors. A “small” misalignment between the early AI systems and their human creators can be amplified, as the AIs build up increasingly powerful capabilities, perhaps at a pace too fast for humans to notice and react usefully. How can we hope to anticipate all of the “tricks” that an intelligence much greater than our own might employ?
The good news is that the formal-methods research community has a fair amount of experience that is both broad (proving many kinds of systems) and deep (integrated proof of systems with