This post is crossposted from my Substack,Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I explain how formal verification can rule out many ways that a flexible system (perhaps some day including one with strong artificial intelligence) could be subverted by adversaries, even without needing to anticipate specific methods of subversion.
Formal verification has the potential to anticipate the futures of complex software systems and block their most problematic potential behaviors, through mathematical proof. It may be the most potent protection against misbehavior by superintelligent systems, whose eventual plans we would not be able to foresee. However, we are only protected if we manage to write out formal specifications that actually block the bad behaviors. I previously wrote about two main techniques to simplify the job of writing those specifications, namely end-to-end formal verification, to catch mistakes in the connections between components; and careful encapsulation of most components away from the complex human and natural world. Now I would like to write about an underappreciated benefit of formal methods for security, already relevant to conventional systems but perhaps even more important for artificial intelligence. This concept has been known in the formal-methods community (e.g. see discussion around the seL4 verified operating system) but still remains not widely enough understood.
The Cybersecurity Arms Race
Cybersecurity is often described as fundamentally favoring attackers over defenders. The reason is that an attacker often only needs to find one vulnerability in a system to subvert it, while a defender (or author of a system) needs to anticipate all possible attacks and build protections against all. Not noticing just one potential attack vector, or simply delaying too long in patching a known hole, can allow an attacker to just as effectively do damage as if the system were full of obvious security problems.
One of the worst kinds of security vulnerabilities is often called remote code execution: a way that an attacker with network access to a system can trick the system into running new program code provided by the attacker. Regardless of what we thought a system should do, an attacker modifying its code can change the plan rather arbitrarily. This category is a special case of the more-general phenomenon of subverting a system. I appreciate the evocative synonym perversion from A Fire Upon the Deep, applied to rogue artificial intelligences.
There are a bewildering variety of different remote code execution attacks, and engineers need to make sure to block all of them, even as new ones arise regularly. One of the classics is a buffer-overflow attack, where a segment of memory is reserved to hold user input, but a programming error causes the user input to run off the end of the region and into adjacent regions. If an adjacent region was used to hold machine code to be executed, we have given the attacker a way to inject his own code to execute.
The phenomenon of code-injection attacks is a broad one, and often different variants have common names that obscure their similarity. When user input is allowed to provide code in the SQL database language, we call it an SQL injection. When a similar situation happens with the JavaScript programming language, most typically in a web browser, we call it cross-site scripting. At first blush, it looks like we as engineers of would-be secure systems need to conduct careful audits of all programming languages used within, being sure none of them provide code-injection vulnerabilities.
It gets worse, though. Just enumerating programming languages isn’t good enough, because attackers find new ways to implement injection-like functionality within established languages. The first high-profile buffer-overflow attacks involved smashing the stack to overwrite certain places used to store addresses of code to execute. As increasingly effective mitigations against stack-smashing (like address-space layout randomization, which makes it hard to guess which code addresses contain which code, thus making it hard to guess the effect of an attack) rolled out, attackers developed increasingly sophisticated kinds of weird machines, a phrase that evokes finding hidden functionalities within seemingly innocuous programs. One idea called return-oriented programming creates seemingly impossible execution sequences within the legitimate code of a program by stitching together segments of code in surprising ways, producing emergent behaviors out of pieces that are innocuous individually.
And here we seem stuck as diligent defenders. It’s bad enough if we need to realize every place within a complex system that contains a programming language that an attacker could use to inject and run his own code. We also need to imagine all sorts of devious ways such functionality can be built on top of particular languages, including ways that haven’t been invented at the time code is deployed.
Future AI systems, especially superintelligent ones, amplify these worries. To start with, the baseline capabilities of these systems to influence this world may be so high that it becomes even more catastrophic to allow an attacker to subvert one of them and turn it against its owners. Then, as highlighted by recent use of the Mythos AI model to find new security vulnerabilities in widely used software, highly capable AI may mean an end to the method of security through obscurity that was already deprecated. Superintelligent systems may be able to find longstanding vulnerabilities that human hackers never caught onto.
So, where a novice engineer may worry that it’s too difficult to anticipate all attacks against an important system, but where one learning of formal methods may get excited about the potential to rule out those attacks mathematically, we also see how a more-informed engineer may get to worrying that it is impractical even to characterize all of the attacks in a formal specification, let alone prove that particular mitigations address them.
To the Rescue: Functional Correctness Implies Subversion-Resistance
Luckily, at the next-higher level of enlightenment, we see that using formal verification to block all of the above attacks is rather easy, if we manage to prove much of anything at all.
Think of a formal specification as laying out which destinations are legal within the execution space of a system, based on starting points.
The trouble with merely testing a system is that we might run many test cases, find that every one arrives at an acceptable destination, and yet have it be the case that the system exhibits catastrophic behavior in an infinite variety of scenarios that we did not think to test. Formal verification can demonstrate acceptable behavior in all possible executions.
Now assume that we have already invested in proof of functional correctness for a program, which means roughly that we show it truly carries out the intentions of its creator. For instance, a program that is meant to sort lists of integers in increasing order truly does output the sorted version of each input. Trying to define formally exactly which specifications count as functional correctness can be a losing game, but we do expect that such specifications are precise-enough that many small program changes break them. Let’s keep from this intuition an even laxer requirement: imagine any specification that accepts some system behaviors but not all. Let BAD be some behavior that is not allowed but can be expressed in the underlying programming language. We will now pull a trick reminiscent of our prior discussion of undecidable program properties and Rice’s theorem.
Imagine an attacker finds a way to inject code into the system and run that code, giving great freedom in how to perturb the system’s behavior. The attacker could use that freedom simply to upload the program for BAD, producing behavior that violates the specification, and so formal verification of the program must fail, despite not having enumerated any requirements about code injection. In this next image, I’m using the metaphor of an injection attack opening a gate to a world of great flexibility for the attacker, who is allowed to upload and run arbitrary programs in an expressive language, such that it is essentially assured that, once the gate opens, the attacker can find some way (probably infinitely many) to do serious damage.
For example, there may be an Internet service meant to add positive numbers on demand. Its specification may explain exactly which sums should be produced, in response to which requests. Or its specification might just declare that the service only outputs positive numbers. Both are sufficient to guarantee that no injection attack allows installing arbitrary code. If that vulnerability did exist, then we could make the system return zero, which violates both specifications.
In fact, the authors of specifications don’t need to think explicitly about code injections at all. It is only necessary to explain positively what a program is meant to do, rather than enumerate negatively what it must be prevented from doing. We saw an example in end-to-end verification of a simple IoT system, whose specification simply explained how a network protocol was meant to work, nonetheless ruling out, for instance, that the system can be taken over by a well-crafted packet and tricked into joining a botnet.
Conclusion
This observation certainly doesn’t address all of the challenges of AI alignment. We should expect superintelligence to be given such wide latitude in making plans and transforming the world as to defy concise specification. However, it seems reasonable to assume that any useful specification does rule out some behaviors, from which it still follows that attackers can’t gain access to run arbitrary code. Therefore, there is a relatively clear path toward using formal verification to develop highly empowered AI systems that will resist being subverted.
In fact, the principle generalizes to any cases involving top-level objectives that must be realized by lower-level implementation details that we need not spell out in formal specifications. For example:
If an attacker finds a clever way to overwrite stored state of a system, then presumably either that state wasn’t important, or the attacker has a simple lever to perturb behavior away from what is specified.
If an attacker is able to interfere with the flow of information from sensors into decision-making, say by taking advantage of a subtle weakness in digital signal processing, then either those sensors weren’t important, or arbitrarily distorting their values can easily lead to violating the specification.
Conversely, if an attacker can mess with the logic controlling how a system takes action in the world, then it should be even easier to generate specification violations. Imagine an example of an online-shopping agent with a bug that repeats the last order over and over, instead of responding to new requests. Such behavior would violate specifications on the granularity spectrum from “only place orders that the user explicitly OKs” to “don’t place an order that would exceed your credit-card spending limit” (because the repeated order could happen to be a gigantic one, even as the later legitimate orders are for cheap items; remember, with formal verification, we reason in advance about all possible scenarios).
Even examples of poisoning the training data of systems that learn are covered, so long as the learned components are implementation details not mentioned directly in specifications. Either the learned component is irrelevant to the top-level mission, or arbitrary data control allows breaking the specification without too much effort.
Of course, much of the current deep learning-maximalist approach is built around assuming that learned systems are central to meeting specifications, despite the absence of ways to prove in advance that learned components don’t surprise us, so my reassurance here in the last bullet only applies to examples where learned components are nicely modularized and only raise worries about implementation details like where the training data come from. We should remember that sometimes we realize we overlooked additional dimensions of top-level specification, as in leakage of secrets through timing that we discussed previously. However, there is great power in only needing to pin down the top-level requirements of a system and not worry about vulnerabilities missed in low-level implementation details.
This post is crossposted from my Substack, Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I explain how formal verification can rule out many ways that a flexible system (perhaps some day including one with strong artificial intelligence) could be subverted by adversaries, even without needing to anticipate specific methods of subversion.
Formal verification has the potential to anticipate the futures of complex software systems and block their most problematic potential behaviors, through mathematical proof. It may be the most potent protection against misbehavior by superintelligent systems, whose eventual plans we would not be able to foresee. However, we are only protected if we manage to write out formal specifications that actually block the bad behaviors. I previously wrote about two main techniques to simplify the job of writing those specifications, namely end-to-end formal verification, to catch mistakes in the connections between components; and careful encapsulation of most components away from the complex human and natural world. Now I would like to write about an underappreciated benefit of formal methods for security, already relevant to conventional systems but perhaps even more important for artificial intelligence. This concept has been known in the formal-methods community (e.g. see discussion around the seL4 verified operating system) but still remains not widely enough understood.
The Cybersecurity Arms Race
Cybersecurity is often described as fundamentally favoring attackers over defenders. The reason is that an attacker often only needs to find one vulnerability in a system to subvert it, while a defender (or author of a system) needs to anticipate all possible attacks and build protections against all. Not noticing just one potential attack vector, or simply delaying too long in patching a known hole, can allow an attacker to just as effectively do damage as if the system were full of obvious security problems.
One of the worst kinds of security vulnerabilities is often called remote code execution: a way that an attacker with network access to a system can trick the system into running new program code provided by the attacker. Regardless of what we thought a system should do, an attacker modifying its code can change the plan rather arbitrarily. This category is a special case of the more-general phenomenon of subverting a system. I appreciate the evocative synonym perversion from A Fire Upon the Deep, applied to rogue artificial intelligences.
There are a bewildering variety of different remote code execution attacks, and engineers need to make sure to block all of them, even as new ones arise regularly. One of the classics is a buffer-overflow attack, where a segment of memory is reserved to hold user input, but a programming error causes the user input to run off the end of the region and into adjacent regions. If an adjacent region was used to hold machine code to be executed, we have given the attacker a way to inject his own code to execute.
The phenomenon of code-injection attacks is a broad one, and often different variants have common names that obscure their similarity. When user input is allowed to provide code in the SQL database language, we call it an SQL injection. When a similar situation happens with the JavaScript programming language, most typically in a web browser, we call it cross-site scripting. At first blush, it looks like we as engineers of would-be secure systems need to conduct careful audits of all programming languages used within, being sure none of them provide code-injection vulnerabilities.
It gets worse, though. Just enumerating programming languages isn’t good enough, because attackers find new ways to implement injection-like functionality within established languages. The first high-profile buffer-overflow attacks involved smashing the stack to overwrite certain places used to store addresses of code to execute. As increasingly effective mitigations against stack-smashing (like address-space layout randomization, which makes it hard to guess which code addresses contain which code, thus making it hard to guess the effect of an attack) rolled out, attackers developed increasingly sophisticated kinds of weird machines, a phrase that evokes finding hidden functionalities within seemingly innocuous programs. One idea called return-oriented programming creates seemingly impossible execution sequences within the legitimate code of a program by stitching together segments of code in surprising ways, producing emergent behaviors out of pieces that are innocuous individually.
And here we seem stuck as diligent defenders. It’s bad enough if we need to realize every place within a complex system that contains a programming language that an attacker could use to inject and run his own code. We also need to imagine all sorts of devious ways such functionality can be built on top of particular languages, including ways that haven’t been invented at the time code is deployed.
Future AI systems, especially superintelligent ones, amplify these worries. To start with, the baseline capabilities of these systems to influence this world may be so high that it becomes even more catastrophic to allow an attacker to subvert one of them and turn it against its owners. Then, as highlighted by recent use of the Mythos AI model to find new security vulnerabilities in widely used software, highly capable AI may mean an end to the method of security through obscurity that was already deprecated. Superintelligent systems may be able to find longstanding vulnerabilities that human hackers never caught onto.
So, where a novice engineer may worry that it’s too difficult to anticipate all attacks against an important system, but where one learning of formal methods may get excited about the potential to rule out those attacks mathematically, we also see how a more-informed engineer may get to worrying that it is impractical even to characterize all of the attacks in a formal specification, let alone prove that particular mitigations address them.
To the Rescue: Functional Correctness Implies Subversion-Resistance
Luckily, at the next-higher level of enlightenment, we see that using formal verification to block all of the above attacks is rather easy, if we manage to prove much of anything at all.
Think of a formal specification as laying out which destinations are legal within the execution space of a system, based on starting points.
The trouble with merely testing a system is that we might run many test cases, find that every one arrives at an acceptable destination, and yet have it be the case that the system exhibits catastrophic behavior in an infinite variety of scenarios that we did not think to test. Formal verification can demonstrate acceptable behavior in all possible executions.
Now assume that we have already invested in proof of functional correctness for a program, which means roughly that we show it truly carries out the intentions of its creator. For instance, a program that is meant to sort lists of integers in increasing order truly does output the sorted version of each input. Trying to define formally exactly which specifications count as functional correctness can be a losing game, but we do expect that such specifications are precise-enough that many small program changes break them. Let’s keep from this intuition an even laxer requirement: imagine any specification that accepts some system behaviors but not all. Let
BADbe some behavior that is not allowed but can be expressed in the underlying programming language. We will now pull a trick reminiscent of our prior discussion of undecidable program properties and Rice’s theorem.Imagine an attacker finds a way to inject code into the system and run that code, giving great freedom in how to perturb the system’s behavior. The attacker could use that freedom simply to upload the program for
BAD, producing behavior that violates the specification, and so formal verification of the program must fail, despite not having enumerated any requirements about code injection. In this next image, I’m using the metaphor of an injection attack opening a gate to a world of great flexibility for the attacker, who is allowed to upload and run arbitrary programs in an expressive language, such that it is essentially assured that, once the gate opens, the attacker can find some way (probably infinitely many) to do serious damage.For example, there may be an Internet service meant to add positive numbers on demand. Its specification may explain exactly which sums should be produced, in response to which requests. Or its specification might just declare that the service only outputs positive numbers. Both are sufficient to guarantee that no injection attack allows installing arbitrary code. If that vulnerability did exist, then we could make the system return zero, which violates both specifications.
In fact, the authors of specifications don’t need to think explicitly about code injections at all. It is only necessary to explain positively what a program is meant to do, rather than enumerate negatively what it must be prevented from doing. We saw an example in end-to-end verification of a simple IoT system, whose specification simply explained how a network protocol was meant to work, nonetheless ruling out, for instance, that the system can be taken over by a well-crafted packet and tricked into joining a botnet.
Conclusion
This observation certainly doesn’t address all of the challenges of AI alignment. We should expect superintelligence to be given such wide latitude in making plans and transforming the world as to defy concise specification. However, it seems reasonable to assume that any useful specification does rule out some behaviors, from which it still follows that attackers can’t gain access to run arbitrary code. Therefore, there is a relatively clear path toward using formal verification to develop highly empowered AI systems that will resist being subverted.
In fact, the principle generalizes to any cases involving top-level objectives that must be realized by lower-level implementation details that we need not spell out in formal specifications. For example:
Of course, much of the current deep learning-maximalist approach is built around assuming that learned systems are central to meeting specifications, despite the absence of ways to prove in advance that learned components don’t surprise us, so my reassurance here in the last bullet only applies to examples where learned components are nicely modularized and only raise worries about implementation details like where the training data come from. We should remember that sometimes we realize we overlooked additional dimensions of top-level specification, as in leakage of secrets through timing that we discussed previously. However, there is great power in only needing to pin down the top-level requirements of a system and not worry about vulnerabilities missed in low-level implementation details.
The next posts will cover two other ways we should expect specification-writing to get simpler as more pockets of the economy become dominated by AI agents interacting with each other. We’ll start by reconsidering user interfaces in that world where users increasingly don’t belong to the same species anymore.