This post is crossposted from my Substack,Structure and Guarantees, where I’ve been exploring how formal verification might scale to more-complex intelligent systems. This piece argues that compilers—already used in limited forms of recursive self-improvement—offer a concrete setting for studying some AI-safety challenges. The previous post in this sequence is “Formal Verification: The Ultimate Fitness Function”.
My last post pitched formal verification as an enabler of improved evolutionary search. While advances in AI capabilities and safety are often presented as trading off against each other, there is hope for improving both dimensions through adoption of formal verification. While natural selection takes a long time to “evaluate” the fitness of individuals, measuring only the concrete lives that they lead, formal verification allows up-front evaluation in all possible scenarios, dramatically shortening the feedback cycle. At the same time, the kinds of proofs that underlie formal verification can potentially establish properties that we want to make sure stay true during an evolutionary process (like safety properties, in both the AI sense and, in a bonus pun, the technical formal-methods sense).
There are a lot of details to get right, to set up such a framework that is ready to ensure safety of powerful AI systems. The canonical phenomenon that makes safety guarantees hard is recursive self-improvement, where AI systems design their own successors. A “small” misalignment between the early AI systems and their human creators can be amplified, as the AIs build up increasingly powerful capabilities, perhaps at a pace too fast for humans to notice and react usefully. How can we hope to anticipate all of the “tricks” that an intelligence much greater than our own might employ?
The good news is that the formal-methods research community has a fair amount of experience that is both broad (proving many kinds of systems) and deep (integrated proof of systems with many parts). While most writing about AI safety adopts a more philosophical tone because we don’t yet have the superintelligences to evaluate techniques against, analogies to the formal-methods research literature will help ground us in more of an engineering tone. A convenient stand-in for rogue superintelligences will be security vulnerabilities, which also allow a system to take actions very much at odds with our intentions. There are even malevolent intelligences (“hackers”) at the other ends of exploited security vulnerabilities.
My “don’t panic” message animating this post and plenty of planned follow-ups is that the formal-methods community knows how to build software and hardware systems proved convincingly to be protected against common and important kinds of security attacks. There is still plenty of research and engineering left to do to expand the scope of such guarantees. There will be more work still needed to expand coverage to include the full scope of AI safety. But we can choose concrete starting points with minimal risk of spillover into real-world catastrophe, where we still have a chance to learn about the dynamics of controlling self-improving systems.
I propose a challenge problem grounded in one of my research specialties. A compiler is a software program that translates between programming languages. The classic example turns a programming language that is pleasant for humans to write into the machine language spoken natively by a CPU. Compilers form one of the most-compelling early targets for formal verification, because the correctness condition is so obvious (technically, taking advantage of formal semantics of programming languages). Formally verified compilers are actually making their way into production systems, including the one I mentioned in the previous post, whose generated code you are probably using right now to read this text. There are both elegant theoretical foundations and street smarts about how to build and maintain verified compilers, providing a good setup for a compelling challenge problem.
As a result, this domain works unusually well for studying questions of broad interest in AI safety. For decades, compilers have already been used to improve themselves, and yet there are also clear formal definitions of what makes a compiler safe.
There will be an element of good news and bad news. I’ll start by making the case why this challenge problem occupies a sweet spot for learning about verified self-improvement. Then I’ll bring in some classic gotchas regularly faced by more-traditional compiler verification, alongside typical solutions. Thinking about these gotchas will help us maintain a healthy level of wariness about how hard it is to catch all undesirable behavior.
The Case for the Importance and Difficulty of the Problem
Not everyone reading may realize how important compilers are to recent AI progress, let alone the full scope of the tech industry. We know generative AI is computationally expensive enough that many companies are quickly building new data centers to house the machines that drive it. The code that finally runs on the hardware is produced by compilers (prominent examples in machine learning include XLA and Triton). The better a job the compilers do optimizing the code, the less hardware we need, the fewer new data centers we need, and the lower costs to companies and their customers. Further, compiler bugs can lead to arbitrary misbehavior by deployed AI, potentially triggering security breaches. I hope it’s now clear that the economic importance of compilers is high, so there is real incentive to use any available tools to make them more effective, and those tools include recursive self-improvement.
There is actually a long history of thinking about compilers that get involved in their own futures. Ken Thompson’s Turing Award acceptance speech, called “Reflections on Trusting Trust”, addressed the hazards of sneaky compilers. It’s possible to be in a situation where deployed software, which we’ll call S, contains an important security backdoor, yet the source code of that program contains no evidence of that backdoor, and neither does the source code of the compiler, which we’ll call C, used with it.
How do we arrive in such a counterintuitive state?
Modify the source code of C so that it notices when it is compiling S, so that it can add in the backdoor silently.
Additionally modify the source code C so that it notices when it is compiling itself (some future version of C), so it can add in exactly the modification from the previous step.
Compile C with itself and install the newly sneaky version everywhere.
Undo the changes to C’s code from the previous two steps, which you kept to yourself all along, so no one has source-code evidence of the trick.
We are now in the steady state where the backdoors are always present on the system that initially received the tricky compiler, but no source code on the planet contains the evidence. An analyst needs to dig into machine code to realize something is off.
Arguably this precedent from 1984 shows some of the same dynamics that researchers in AI safety worry about, making it a good setting to tackle them with principled tools. There is a chance to occupy an unusual middle ground in the spectrum of discussions about AI safety. On the one hand, we have very practical work aimed at today’s widely used AI systems, most of which put majorities of the heavy lifting in the hands of deep learning. Current research notwithstanding on topics like explainability, I think the jury remains out on whether this kind of system even adheres to good future definitions of alignment, and we should also consider possibilities from codesigning intelligent systems to be easier to specify and prove. There has also been a lot of thought about alignment and safety for dramatically more capable AI systems of the future, but, since we don’t have those systems yet, it is hard to do empirical work driven by building nontrivial implementations. I’ll claim that self-improving compilers are appealing for existing today while exposing similar security challenges to those long-discussed for AI alignment. To be fully convincing, though, we’ll need to be sure to include some engineering features that may not be obvious to someone from the research community around compiler verification; I’ll get to those shortly.
Functional Correctness of Compilers
Now imagine a compiler compiling itself (in addition to application programs of fundamental interest) over and over again. What we hope is that it does a better job each time, making the code faster, reducing memory requirements, and so on. Here’s a diagram of the scenario.
To pair the approach with formal verification, we start by doing a machine-checked proof that the original compiler is correct. That is, we want to know that, whichever program we feed into the compiler (the source program), the resulting output program (the target program) exhibits the same behavior. For instance, we might require that a target program, given any inputs, produces the same outputs as its source program. At this point, we notice a lovely confluence: a compiler itself takes inputs and produces outputs, and now the compiler’s theorem applies to compiling itself! The original compiler proof probably involved specialized mathematics (around formal semantics of programming languages), but now all we need to know is that the compiler preserves functional behavior. In fact, for any number of iterations of this loop of recursive self-improvement, correctness of the final compiler follows by mathematical induction (turning guarantees of individual steps into guarantees of full paths from start to later steps). The proof is a composition of the original correctness theorem with a number of specializations of itself to successive versions of its own source code.
This process is a classic example of twisty recursive structure in computer science that somehow just works. All the same, it has relatively rarely been realized in practice. The best-known verified compiler today is CompCert, whose source language (C) is different from the compiler’s own implementation language (the native language of Rocq), so for relatively unsubtle reasons, CompCert can’t compile itself. My own project Fiat Cryptography involves a compiler implemented in Rocq and compiling a subset of Rocq’s native language, but that subset is not nearly expressive enough for the compiler to be written in it, leading to the same obstacle to recursive self-improvement. Probably the highest-profile example these days that does achieve verified self-compilation is CakeML.
Now let’s try to think of this recursive self-improvement loop as more like what we see going on, for the time being mostly driven by human engineering effort, to optimize tech stacks for AI training and inference. An ideal challenge problem for reasoning about such work being done safely would include the engineering complexities that make such systems hard to build and reason about. I’ve identified two complexities that are largely absent from both of prior work on compiler verification and formal approaches toward AI alignment. I also think there’s a case to be made (though I can’t fit it into this single post) that we should expect both complexities to stay with us – they are well-justified as tools for producing increasingly capable intelligent systems.
Past verified compilers involve relatively tame search through large spaces of programs, which is probably needed if we hope to mimic how humans are making progress improving these stacks. Intuition for staying power of this technique: we routinely run into design challenges that are hard-enough that we can’t just zero in directly on competitive solutions.
Motivated partly by the costs of exploring those search spaces, we almost certainly need to rely on extensive parallelism, with many different threads of computation happening simultaneously in the compiler itself, not just the applications it compiles. Indeed, deep-learning applications in practice depend on large amounts of parallelism, and in my opinion we haven’t seen nearly enough similar structure in compilers, let alone verified ones. Intuition for staying power of this technique: getting answers more quickly or at lower cost already provides outsize benefit in today’s economy, not to mention what may happen in the future with superintelligence, and today most computer scientists agree that parallelism is critical to continued performance improvement after the end of Moore’s law (see There’s Plenty of Room at the Top).
So there’s my framing of a challenge program that I think can be explored fruitfully in near-term research: recursive self-improvement through a verified compiler that searches large spaces of alternative programs in parallel. The challenges and solutions that arise in such an effort can inform broader thinking on AI safety. Let me anticipate a few of them by bringing up issues that show up repeatedly in more-conventional formal verification, typically tied to security concerns.
Gotcha #1: Nondeterminism
Nondeterminism seems like an innocuous property of a program: it may not always give the same answer, and we give it free rein to choose a different answer each time we call it on the same inputs. The trouble is when we imagine our worst enemy getting to break the ties and choose between possible outcomes. In an AI-safety context, the decisions might be made by an inscrutable superintelligence, but similar risks are already broadly studied in computer security, where the so-called adversary is conventionally a person.
A classic security property that interacts poorly with nondeterminism is secure information flow. Consider the following example of an employee accessing his company email inbox, in two different states of the world.
Everything in the email server that the employee is supposed to know about is the same between the two worlds. The only difference is in his boss’ email inbox. Say we are generous with nondeterminism in specifying the email server. When a user asks for a list of messages, it can be delivered in any order – say the order in which the records are stored on-disk, a handy performance optimization. However, if we allow complete freedom in this choice, we don’t prevent the email server from consulting other users’ secrets to choose the order. And complete freedom, even to choose devious orderings, is exactly what the naive, nondeterministic functional specification allows.
It actually isn’t that hard to find similar risks in our compiler example. Say the compiler is actually structured as a server that compiles programs for many users. Users consider their own source code confidential, but the obvious nondeterministic specification for a compiler allows compilers to resolve incidental choices in ways that reveal code across users.
The message is that nondeterminism and secure information flow don’t mix well, and perhaps that message is sufficient to help us get an email server’s specification right. A compiler is trickier. The problem is that its natural specification is inherently nondeterministic, to give compiler-writers freedom to invent new optimizations – as we surely want our recursive self-improvement loop to do. A given source program has many possible target programs that meet the specification (each computing the same result in different ways). We have to generalize the advice “avoid nondeterminism” to the fancier setting of a compiler. I’ll sketch one possible solution below, after introducing another challenge.
Gotcha #2: Side Channels
Imagine we lock down our email server’s specification to enforce deterministic answers, but even that move isn’t enough to secure the system.
The employee gets the same answer across scenarios, but now the amount of time it takes for the answer to arrive reveals something about his boss’ inbox. The author of the specification made a classic mistake: forgetting to take into account an important nonfunctional dimension (doesn’t mean that the system is broken but rather that we are thinking about aspects beyond giving correct answers!), in this case running time.
The good news is that just thinking of running time as an important dimension allows us to lock down this bad information flow, with a relatively general specification pattern following the standard technique of noninterference. The bad news is that merely measuring time until the full answer is ready is inadequate to consider all real-world timing risks. Especially when we add in those risks, we find ourselves in the domain of side channels, surprising and indirect ways in which systems leak information. Consider the following extension of our running example.
Now the two scenarios generate the same answer in the same total time – but the sequence of internal actions (writing to RAM) happens on a different schedule in each world. Why do we worry about such a fine-grained difference? Imagine the email server runs on the same cloud service as a spy’s server, and the spy is periodically checking what has changed in memory. I won’t go into more detail about feasible attacks in this post, but check out famous name-brand vulnerabilities Spectre and Meltdown for more. The point is, since the cloud service may share RAM across tenant services, there is the potential for information leakage through that shared resource, including via timing.
There has been plenty of recent research on this kind of problem, formally verifying absence of undesirable side channels. One general class of solutions riffs on the idea of avoiding nondeterminism, without going all the way. For instance, one of my recent projects shows how to prove that compilers avoid introducing timing side channels, by requiring that every individual compiler run chooses one deterministic behavior, influenced just by parts of program state that need not be kept secret. We have applied a similar approach to proving that hardware keeps servers from learning each others’ secrets.
This discussion suggests a powerful design pattern: define a space of allowed behaviors (or behavioral functions), where any given function is clearly “locked down” in a safe way, and allow a system to choose exactly one of them. For instance, we may somehow know that every function respects human rights, but different functions design future factories in very different ways. The reusable trick is not to go to either extreme of a specification that is either “hand-wavily” nondeterministic or rigidly deterministic, instead carefully constructing a menu of acceptable deterministic functions and letting a system choose one. If we designed the menu properly, then the system “gets to surprise us exactly once,” after which its behavior is confined to a particular space we identified as safe. This post is mostly focused on pointing out a problem worth researching, though the idea from this paragraph is the closest I have to a reusable solution principle for alignment to propose for now; the problem seems to have a good chance at producing more, if tackled properly.
Conclusion
The concrete challenge problem is to build a compiler that
can compile itself,
applies large-scale parallel compute, of the style familiar from deep learning today, for better decision-making, and
is formally verified to meet not just functional requirements but also whatever nonfunctional requirements become clear are important.
I’m trying to make the case that this problem is at a “just-right” level of difficulty and applicability to core AI-safety problems. Many of the foundational questions have already been explored extensively in formal verification, especially around security against human adversaries. We have both theoretical techniques and significant implementations. I will sketch some of the most-interesting elements in upcoming posts, including:
Techniques for writing specifications to avoid consequential mistakes
Scalable automation of proof-writing for complex code bases
Extending coverage of the code-improvement loop to cover not just software but also the hardware it runs on
How to avoid making the theorem-prover itself a source of dangerous bugs
How to structure compilers and other important pieces of infrastructure to make verification easier
Promising techniques for combining the best of deep learning and logic-based methods
The very next post, though, will consider a sci-fi scenario that adds the twist of agents working together to build better systems, even as they don’t fully trust each other, in contrast to this post’s idea of one intelligence expending a lot of compute to improve itself in isolation. We’ll see how formal verification can still help out.
This post is crossposted from my Substack, Structure and Guarantees, where I’ve been exploring how formal verification might scale to more-complex intelligent systems. This piece argues that compilers—already used in limited forms of recursive self-improvement—offer a concrete setting for studying some AI-safety challenges. The previous post in this sequence is “Formal Verification: The Ultimate Fitness Function”.
My last post pitched formal verification as an enabler of improved evolutionary search. While advances in AI capabilities and safety are often presented as trading off against each other, there is hope for improving both dimensions through adoption of formal verification. While natural selection takes a long time to “evaluate” the fitness of individuals, measuring only the concrete lives that they lead, formal verification allows up-front evaluation in all possible scenarios, dramatically shortening the feedback cycle. At the same time, the kinds of proofs that underlie formal verification can potentially establish properties that we want to make sure stay true during an evolutionary process (like safety properties, in both the AI sense and, in a bonus pun, the technical formal-methods sense).
There are a lot of details to get right, to set up such a framework that is ready to ensure safety of powerful AI systems. The canonical phenomenon that makes safety guarantees hard is recursive self-improvement, where AI systems design their own successors. A “small” misalignment between the early AI systems and their human creators can be amplified, as the AIs build up increasingly powerful capabilities, perhaps at a pace too fast for humans to notice and react usefully. How can we hope to anticipate all of the “tricks” that an intelligence much greater than our own might employ?
The good news is that the formal-methods research community has a fair amount of experience that is both broad (proving many kinds of systems) and deep (integrated proof of systems with many parts). While most writing about AI safety adopts a more philosophical tone because we don’t yet have the superintelligences to evaluate techniques against, analogies to the formal-methods research literature will help ground us in more of an engineering tone. A convenient stand-in for rogue superintelligences will be security vulnerabilities, which also allow a system to take actions very much at odds with our intentions. There are even malevolent intelligences (“hackers”) at the other ends of exploited security vulnerabilities.
My “don’t panic” message animating this post and plenty of planned follow-ups is that the formal-methods community knows how to build software and hardware systems proved convincingly to be protected against common and important kinds of security attacks. There is still plenty of research and engineering left to do to expand the scope of such guarantees. There will be more work still needed to expand coverage to include the full scope of AI safety. But we can choose concrete starting points with minimal risk of spillover into real-world catastrophe, where we still have a chance to learn about the dynamics of controlling self-improving systems.
I propose a challenge problem grounded in one of my research specialties. A compiler is a software program that translates between programming languages. The classic example turns a programming language that is pleasant for humans to write into the machine language spoken natively by a CPU. Compilers form one of the most-compelling early targets for formal verification, because the correctness condition is so obvious (technically, taking advantage of formal semantics of programming languages). Formally verified compilers are actually making their way into production systems, including the one I mentioned in the previous post, whose generated code you are probably using right now to read this text. There are both elegant theoretical foundations and street smarts about how to build and maintain verified compilers, providing a good setup for a compelling challenge problem.
As a result, this domain works unusually well for studying questions of broad interest in AI safety. For decades, compilers have already been used to improve themselves, and yet there are also clear formal definitions of what makes a compiler safe.
There will be an element of good news and bad news. I’ll start by making the case why this challenge problem occupies a sweet spot for learning about verified self-improvement. Then I’ll bring in some classic gotchas regularly faced by more-traditional compiler verification, alongside typical solutions. Thinking about these gotchas will help us maintain a healthy level of wariness about how hard it is to catch all undesirable behavior.
The Case for the Importance and Difficulty of the Problem
Not everyone reading may realize how important compilers are to recent AI progress, let alone the full scope of the tech industry. We know generative AI is computationally expensive enough that many companies are quickly building new data centers to house the machines that drive it. The code that finally runs on the hardware is produced by compilers (prominent examples in machine learning include XLA and Triton). The better a job the compilers do optimizing the code, the less hardware we need, the fewer new data centers we need, and the lower costs to companies and their customers. Further, compiler bugs can lead to arbitrary misbehavior by deployed AI, potentially triggering security breaches. I hope it’s now clear that the economic importance of compilers is high, so there is real incentive to use any available tools to make them more effective, and those tools include recursive self-improvement.
There is actually a long history of thinking about compilers that get involved in their own futures. Ken Thompson’s Turing Award acceptance speech, called “Reflections on Trusting Trust”, addressed the hazards of sneaky compilers. It’s possible to be in a situation where deployed software, which we’ll call S, contains an important security backdoor, yet the source code of that program contains no evidence of that backdoor, and neither does the source code of the compiler, which we’ll call C, used with it.
How do we arrive in such a counterintuitive state?
Arguably this precedent from 1984 shows some of the same dynamics that researchers in AI safety worry about, making it a good setting to tackle them with principled tools. There is a chance to occupy an unusual middle ground in the spectrum of discussions about AI safety. On the one hand, we have very practical work aimed at today’s widely used AI systems, most of which put majorities of the heavy lifting in the hands of deep learning. Current research notwithstanding on topics like explainability, I think the jury remains out on whether this kind of system even adheres to good future definitions of alignment, and we should also consider possibilities from codesigning intelligent systems to be easier to specify and prove. There has also been a lot of thought about alignment and safety for dramatically more capable AI systems of the future, but, since we don’t have those systems yet, it is hard to do empirical work driven by building nontrivial implementations. I’ll claim that self-improving compilers are appealing for existing today while exposing similar security challenges to those long-discussed for AI alignment. To be fully convincing, though, we’ll need to be sure to include some engineering features that may not be obvious to someone from the research community around compiler verification; I’ll get to those shortly.
Functional Correctness of Compilers
Now imagine a compiler compiling itself (in addition to application programs of fundamental interest) over and over again. What we hope is that it does a better job each time, making the code faster, reducing memory requirements, and so on. Here’s a diagram of the scenario.
To pair the approach with formal verification, we start by doing a machine-checked proof that the original compiler is correct. That is, we want to know that, whichever program we feed into the compiler (the source program), the resulting output program (the target program) exhibits the same behavior. For instance, we might require that a target program, given any inputs, produces the same outputs as its source program. At this point, we notice a lovely confluence: a compiler itself takes inputs and produces outputs, and now the compiler’s theorem applies to compiling itself! The original compiler proof probably involved specialized mathematics (around formal semantics of programming languages), but now all we need to know is that the compiler preserves functional behavior. In fact, for any number of iterations of this loop of recursive self-improvement, correctness of the final compiler follows by mathematical induction (turning guarantees of individual steps into guarantees of full paths from start to later steps). The proof is a composition of the original correctness theorem with a number of specializations of itself to successive versions of its own source code.
This process is a classic example of twisty recursive structure in computer science that somehow just works. All the same, it has relatively rarely been realized in practice. The best-known verified compiler today is CompCert, whose source language (C) is different from the compiler’s own implementation language (the native language of Rocq), so for relatively unsubtle reasons, CompCert can’t compile itself. My own project Fiat Cryptography involves a compiler implemented in Rocq and compiling a subset of Rocq’s native language, but that subset is not nearly expressive enough for the compiler to be written in it, leading to the same obstacle to recursive self-improvement. Probably the highest-profile example these days that does achieve verified self-compilation is CakeML.
Now let’s try to think of this recursive self-improvement loop as more like what we see going on, for the time being mostly driven by human engineering effort, to optimize tech stacks for AI training and inference. An ideal challenge problem for reasoning about such work being done safely would include the engineering complexities that make such systems hard to build and reason about. I’ve identified two complexities that are largely absent from both of prior work on compiler verification and formal approaches toward AI alignment. I also think there’s a case to be made (though I can’t fit it into this single post) that we should expect both complexities to stay with us – they are well-justified as tools for producing increasingly capable intelligent systems.
So there’s my framing of a challenge program that I think can be explored fruitfully in near-term research: recursive self-improvement through a verified compiler that searches large spaces of alternative programs in parallel. The challenges and solutions that arise in such an effort can inform broader thinking on AI safety. Let me anticipate a few of them by bringing up issues that show up repeatedly in more-conventional formal verification, typically tied to security concerns.
Gotcha #1: Nondeterminism
Nondeterminism seems like an innocuous property of a program: it may not always give the same answer, and we give it free rein to choose a different answer each time we call it on the same inputs. The trouble is when we imagine our worst enemy getting to break the ties and choose between possible outcomes. In an AI-safety context, the decisions might be made by an inscrutable superintelligence, but similar risks are already broadly studied in computer security, where the so-called adversary is conventionally a person.
A classic security property that interacts poorly with nondeterminism is secure information flow. Consider the following example of an employee accessing his company email inbox, in two different states of the world.
Everything in the email server that the employee is supposed to know about is the same between the two worlds. The only difference is in his boss’ email inbox. Say we are generous with nondeterminism in specifying the email server. When a user asks for a list of messages, it can be delivered in any order – say the order in which the records are stored on-disk, a handy performance optimization. However, if we allow complete freedom in this choice, we don’t prevent the email server from consulting other users’ secrets to choose the order. And complete freedom, even to choose devious orderings, is exactly what the naive, nondeterministic functional specification allows.
It actually isn’t that hard to find similar risks in our compiler example. Say the compiler is actually structured as a server that compiles programs for many users. Users consider their own source code confidential, but the obvious nondeterministic specification for a compiler allows compilers to resolve incidental choices in ways that reveal code across users.
The message is that nondeterminism and secure information flow don’t mix well, and perhaps that message is sufficient to help us get an email server’s specification right. A compiler is trickier. The problem is that its natural specification is inherently nondeterministic, to give compiler-writers freedom to invent new optimizations – as we surely want our recursive self-improvement loop to do. A given source program has many possible target programs that meet the specification (each computing the same result in different ways). We have to generalize the advice “avoid nondeterminism” to the fancier setting of a compiler. I’ll sketch one possible solution below, after introducing another challenge.
Gotcha #2: Side Channels
Imagine we lock down our email server’s specification to enforce deterministic answers, but even that move isn’t enough to secure the system.
The employee gets the same answer across scenarios, but now the amount of time it takes for the answer to arrive reveals something about his boss’ inbox. The author of the specification made a classic mistake: forgetting to take into account an important nonfunctional dimension (doesn’t mean that the system is broken but rather that we are thinking about aspects beyond giving correct answers!), in this case running time.
The good news is that just thinking of running time as an important dimension allows us to lock down this bad information flow, with a relatively general specification pattern following the standard technique of noninterference. The bad news is that merely measuring time until the full answer is ready is inadequate to consider all real-world timing risks. Especially when we add in those risks, we find ourselves in the domain of side channels, surprising and indirect ways in which systems leak information. Consider the following extension of our running example.
Now the two scenarios generate the same answer in the same total time – but the sequence of internal actions (writing to RAM) happens on a different schedule in each world. Why do we worry about such a fine-grained difference? Imagine the email server runs on the same cloud service as a spy’s server, and the spy is periodically checking what has changed in memory. I won’t go into more detail about feasible attacks in this post, but check out famous name-brand vulnerabilities Spectre and Meltdown for more. The point is, since the cloud service may share RAM across tenant services, there is the potential for information leakage through that shared resource, including via timing.
There has been plenty of recent research on this kind of problem, formally verifying absence of undesirable side channels. One general class of solutions riffs on the idea of avoiding nondeterminism, without going all the way. For instance, one of my recent projects shows how to prove that compilers avoid introducing timing side channels, by requiring that every individual compiler run chooses one deterministic behavior, influenced just by parts of program state that need not be kept secret. We have applied a similar approach to proving that hardware keeps servers from learning each others’ secrets.
This discussion suggests a powerful design pattern: define a space of allowed behaviors (or behavioral functions), where any given function is clearly “locked down” in a safe way, and allow a system to choose exactly one of them. For instance, we may somehow know that every function respects human rights, but different functions design future factories in very different ways. The reusable trick is not to go to either extreme of a specification that is either “hand-wavily” nondeterministic or rigidly deterministic, instead carefully constructing a menu of acceptable deterministic functions and letting a system choose one. If we designed the menu properly, then the system “gets to surprise us exactly once,” after which its behavior is confined to a particular space we identified as safe. This post is mostly focused on pointing out a problem worth researching, though the idea from this paragraph is the closest I have to a reusable solution principle for alignment to propose for now; the problem seems to have a good chance at producing more, if tackled properly.
Conclusion
The concrete challenge problem is to build a compiler that
I’m trying to make the case that this problem is at a “just-right” level of difficulty and applicability to core AI-safety problems. Many of the foundational questions have already been explored extensively in formal verification, especially around security against human adversaries. We have both theoretical techniques and significant implementations. I will sketch some of the most-interesting elements in upcoming posts, including:
The very next post, though, will consider a sci-fi scenario that adds the twist of agents working together to build better systems, even as they don’t fully trust each other, in contrast to this post’s idea of one intelligence expending a lot of compute to improve itself in isolation. We’ll see how formal verification can still help out.