This post is crossposted from my Substack,Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I present what may seem like an ironic reversal of the idea of keeping an AI in a box to limit the damage it can do. Instead, I argue for carefully redesigning parts of the world to be more legible to automated reasoning, encapsulating intelligent systems away from many complexities of our evolved environment. The goal is not merely to make AI smarter but to create regions where reliable reasoning becomes easier. Alignment challenges are concentrated into well-defined interfaces with the broader world, rather than needing to be confronted throughout every layer of a system.
AI alignment considers the hard problem of how to give clear instructions to what can strike us as alien minds. As these intelligent systems grow in complexity, we may worry that it becomes correspondingly complicated to give them clear-enough instructions to avoid bad outcomes. In contrast, the last post reviewed an underappreciated technique from formal verification: end-to-end proof of layered computer systems, verifying many different parts as a whole, in a way that helps catch any misunderstandings between the parts about how they interact. Such a layered system can be seen as exporting some top-level interface and depending on some bottom-level foundation interface – and the great payoff of end-to-end verification is that it (with caveats; see that last post) roots out mistakes in spelling out requirements for all interior layers, leaving only the top and bottom interfaces as opportunities for faulty specification. This capability opens the possibility of scaling up the complexity of automated systems without paying a tax in the difficulty of avoiding bugs, sometimes even finding that bug-avoidance becomes easier as we add layers.
Engineers quickly realize a problem: true certainty about the behavior of systems is impossible. Our example from last time of a network-connected lightbulb controller presumes the correct operation of a physical actuator for the lightbulb. If that actuator instead starts a house fire when triggered, we’re in trouble. If a saboteur takes a hair dryer to the CPU and flips critical bits, all bets are off. The assumption that a hardware circuit executes in an orderly way can prove misleading, invalidating all of the theorems we invested in.
Yet it remains overwhelmingly valuable to carry out formal verification of complex digital systems. Why? We’ll review the effectiveness of abstractions that present clear interfaces on top of messy physical reality. Then we’ll generalize to principles for responsibly bringing artificial intelligence into more aspects of our society. It’s natural to think of adding AI into the world gradually, replacing units of functionality previously provided by humans. However, our evolved world is full of complexities that are expensive for AI to grapple with. The incremental path misses an opportunity to create economic enclaves purposely protected from those evolved complexities, allowing reasoning that is both cheaper and more effective.
The Digital Abstraction
I should confess here that, despite working in an academic department whose name starts with “electrical engineering,” I’ve never taken a class in electrical engineering. I couldn’t explain how flows of electrons provide the behaviors we’re used to in electronic circuits. Yet somehow I’ve been able to develop hardware and software components that function well. What’s the trick? All of us depend critically on the digital abstraction.
A naive electronic component can involve a wide range of voltages in a particular wire. Such an untamed wire can be used as analog storage, meaning that it represents a real number, but computing with real numbers, embodied in voltage levels, is a tricky business. We can imagine that a relatively “natural” electrical phenomenon would tend to spread its voltages over a wide range.
So then how have we come to rely on computers that make discrete decisions? The wizards of electrical engineering found a way to bias electronic components so that their voltage levels bunch up at two extremes. We can then draw a line down the middle of that spectrum and say every level above the line is a one and every level below a zero. Now we can construct the basic building blocks of digital computing, logic gates like “AND” that outputs one exactly when each of its two inputs is one.
Crucially, this digital abstraction lets us forget about continuous voltage and think of the “AND” gate as working solely with zeros and ones. Yes, such components do still glitch occasionally, revealing their analog reality. For instance, a cosmic ray may fly by and disturb the wiring. The point is that extensive engineering has pushed the risk of such disturbances low-enough that we can get far while ignoring them.
With this abstraction in place, we can build other abstractions. We can combine many primitive logic gates into more complex circuits. For example, we can build an addition circuit out of gates for “AND” and other simpler functionalities. Now this addition circuit can, in turn, be seen as a building block for higher-level functionality, ignoring not just how it was built out of simpler gates but also ignoring the analog dynamics of voltage within them.
Our end-to-end-verified stack example presumes the digital abstraction at its bottom level. More specifically, it bottoms out in formal semantics of a hardware-description language (a mathematical characterization of what results any circuit could produce), and we assume some accurate way of executing circuits encoded in that language. The digital abstraction gives us that way. An immensely complicated semiconductor-manufacturing supply chain is able to turn circuit descriptions into physical chips, well-enough that we can (usually) forget its details in designing digital systems.
Other Greatest Hits of Abstraction
The importance of abstraction in planning for a future of artificial intelligence has been emphasized in other sources, including The Singularity is Near. Let me just give a few more standard examples, before moving on to suggest a new kind of abstraction.
Natural evolutionary processes have given us many abstractions. Somehow lower-level physics allows subatomic particles to come together into atoms, which become largely reliable building blocks for chemistry. Then those higher-level molecules become building blocks for cells in biology. Cells can be aggregated into tissues and organisms, with cancer as the consequence of misbehavior when cells act more individualistically. We must plan for (prevent and treat) cancer, but, like a cosmic ray disturbing part of a silicon chip, it happens infrequently enough that abstractions like tissue and organism remain useful.
There are also higher-level abstractions that we design ourselves. One of the most basic ones is to have government maintain a monopoly on (legitimate) violence, so that competition shifts to economic activity, on top of strong foundations of property rights. On top of that foundation, a corporation can have legal personhood and enter into contracts with individuals or other corporations. Several corporations can then form a consortium that lobbies for their collective interest. Or individuals as citizens can aggregate into a nation, and then those nations can form coalitions. Geopolitical strategists can get pretty far thinking of coalitions or nations as atomic agents, even as we know it is often necessary to, say, understand a nation by understanding the will of blocs of its voters.
It is probably not controversial to suggest that taking full advantage of artificial intelligence will depend on developing new abstractions, but I’m going to suggest a strategy that tinkers with different parts of our world than most would focus on.
Bubbles of Legibility and Their Interfaces
One major point I worked up to in previous installments was the payoff from reconfiguring the world for greater legibility to intelligence technologies with good properties. That is, some hard problems of AI come from assuming that the basic structure of the world stays the same, and we plug an AI into a spot traditionally occupied by a human worker. If humans communicate with the rest of the world in natural language, we fall into assuming that AIs must communicate in natural language, too. Yet switching to other modes of communication dramatically simplifies processing.
One goal for the present post is to flesh out more of the strategy for reconfiguring the world. I also want to frame that process as introducing an important new kind of abstraction.
Here is the three-step recipe for building a region of legibility.
Identify part of the economy where all decision-making can be artificial, minimizing roles for intelligence that evolved rather than being designed deliberately.
Codesign that region with the agents that occupy it, optimizing for lowest cost of understanding by those agents.
Carefully create an interface with the rest of the world, a promise of the region of legibility about what service it provides.
One important kind of region of legibility is under single ownership. As a canonical example, consider an autonomous factory.
For simplicity, say the factory is a three-dimensional block of space enclosed in walls. Its mission, what led us to create it, is production of particular physical goods. The nature of those goods can be formalized mathematically, in theory allowing end-to-end proof that the factory delivers on its mission. It won’t be easy to get the specification right, as we must characterize physics sufficiently well. Moreover, capturing the nature of the desired product is insufficient, as we must also capture safety properties, e.g. avoidance of toxic emissions seeping through the factory walls. However, following the end-to-end verification approach, we avoid needing to understand how work is divided up and carried out within the factory. The following common AI challenges can be sidestepped entirely.
As the factory is fully autonomous, natural language is off the table and need not be processed.
There is also no need to worry about safe cooperation with humans on an assembly line.
After the factory is constructed, there may no longer be any need to engage with vision or other conventional senses. The factory is laid out so that movement takes place among well-defined paths with easily detected markers of location.
The software running the factory may be written in languages incomprehensible to humans, relying on formal methods to guarantee that specifications continue to be followed, even in the presence of recursive self-improvement.
Overall, the protected environment of the factory has been created to maximize legibility to the AI agents inhabiting it, minimizing their cost to make good decisions.
It remains a hard problem to capture the rules we want the factory to follow, just as it was difficult to build today’s silicon supply chain. However, the payoff in cost reduction should be enormous. Moreover, we have available the established tricks of end-to-end verification. It can be easier to believe a formal verification of multiple autonomous factories than one. If their outputs are integrated into single products whose specifications are simpler than those of the constituent components alone, then the trusted “exterior” interface of the mega-factory becomes simpler, and there are fewer opportunities for mistakes in formalizing it. The basic rules of factory safety may also remain largely the same across constituent factories. In such a setting, the mega-factory may even have autonomy to build new factories that follow the common rules. It may invent new decomposition of top-level deliverables into components, design and construct new factories to supply those components, and still remain compatible with its exterior interface.
The exterior interface is the place where humans bring requests and receive deliverables. We work hard to formalize it properly, in a way that retains flexibility for the intelligence within. There can in general be many interior interfaces, which no longer need to deal directly with complexities of human interaction. Rather, the consequences of human requirements are devolved down into other layers and their formal interfaces, and effort at simplifying the human-interface layer can pay off in simplifying other layers.
We will never capture all aspects of existing in the world as formal requirements, but the principle I’m proposing is to encapsulate as much of a system as possible away from those complexities, whether they come from human behavior or from imperfectly understood natural phenomena.
Competition and Ground Rules
What about larger systems than factories, where one of the chief complexities is interaction between parties that aren’t fully cooperative with each other? We can consider settings for AI agents both to compete and cooperate with each other, on top of a foundation that simplifies operation for all of them, just like the digital abstraction for computer systems or the rule of law for human civilization.
I wrote previous about how AI agents can trust code provided by others. The value of rigorous reasoning about code depends on having a reliable computing substrate, as the digital abstraction enables. However, it is also important to have rules governing which compute resources are controlled by which agents. Furthermore, as an agent reasons through the consequences of a piece of code, its job is greatly simplified by using streamlined artificial languages over natural ones, also preferring well-designed programming languages. Gaming out strategies for competition and cooperation is simplified by adoption of a streamlined sensory environment. The physical manifestation could be something like a special economic zone where only artificial intelligence is allowed.
There is a significant payoff from avoiding the need to reason about human decision-making, within the interior of the economic zone. Our brains developed through evolution, with its limited ability to escape local optima in fitness landscapes. We especially weren’t subjected to selection pressure for efficient understandability to algorithms. Some pressures have even been toward making understandability worse, as with signaling.
I should emphasize, though, that compatibility with human wishes remains central to the proper design of an autonomous economic zone. It’s just that those wishes are abstracted properly into the interface of the zone, kept separate from its interior. Requests flow in from humans to their agents, and then the agents act on those requests as efficiently as they can, ideally even in a provably compliant way, even as they may recursively improve themselves over time.
It is also important to acknowledge another kind of competition, which may lead agents or their coalitions to try to sabotage each other. These acts of aggression seem impossible to block entirely, but, still, I’ll argue that it makes sense to maintain orderly economic zones as far as possible. One analogy is with the increasingly global economy of today, with common norms of property rights and protection from violence. Various conflicts interfere with those rights and protections from time to time, and yet progress depends on relying on them most of the time. Otherwise, we wouldn’t see the investment in for-profit companies that is upstream of so many important breakthroughs. In the context of autonomous economic zones, an attack could force direct confrontation with, say, the laws of physics instead of nicer abstractions obeyed by reconfigured matter, in which case less-scrutable methods like deep learning become dominant again, but we can still do our best to avoid those situations.
Conclusion
The principle we’ve covered bears an interesting kind of mirror-image similarity to the idea of keeping an AI in a box. The motivation for that idea was to limit the ability of a rogue intelligence to do damage out in the world, and it has been argued extensively that even just a text connection with human users is enough for an AI to use trickery to “escape.” Instead, the abstraction barriers I’m arguing for protect AIs against the human world so that they can be more efficient and reliable. The idea is that the complexities we protect these AI modules against are the ones that force the use of opaque learned heuristics rather than traceable reasoning processes from first principles. Then mechanisms like formal verification can be used to anticipate all consequences of candidate design choices.
Even if our human concerns are pushed into the external interfaces of autonomous zones, we still need to characterize those concerns properly. The next three posts will go through three reasons that top-level specifications should be relatively tractable to write in this future scenario, starting from one observation around computer security that already even applies to today’s computer systems.
This post is crossposted from my Substack, Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I present what may seem like an ironic reversal of the idea of keeping an AI in a box to limit the damage it can do. Instead, I argue for carefully redesigning parts of the world to be more legible to automated reasoning, encapsulating intelligent systems away from many complexities of our evolved environment. The goal is not merely to make AI smarter but to create regions where reliable reasoning becomes easier. Alignment challenges are concentrated into well-defined interfaces with the broader world, rather than needing to be confronted throughout every layer of a system.
AI alignment considers the hard problem of how to give clear instructions to what can strike us as alien minds. As these intelligent systems grow in complexity, we may worry that it becomes correspondingly complicated to give them clear-enough instructions to avoid bad outcomes. In contrast, the last post reviewed an underappreciated technique from formal verification: end-to-end proof of layered computer systems, verifying many different parts as a whole, in a way that helps catch any misunderstandings between the parts about how they interact. Such a layered system can be seen as exporting some top-level interface and depending on some bottom-level foundation interface – and the great payoff of end-to-end verification is that it (with caveats; see that last post) roots out mistakes in spelling out requirements for all interior layers, leaving only the top and bottom interfaces as opportunities for faulty specification. This capability opens the possibility of scaling up the complexity of automated systems without paying a tax in the difficulty of avoiding bugs, sometimes even finding that bug-avoidance becomes easier as we add layers.
Engineers quickly realize a problem: true certainty about the behavior of systems is impossible. Our example from last time of a network-connected lightbulb controller presumes the correct operation of a physical actuator for the lightbulb. If that actuator instead starts a house fire when triggered, we’re in trouble. If a saboteur takes a hair dryer to the CPU and flips critical bits, all bets are off. The assumption that a hardware circuit executes in an orderly way can prove misleading, invalidating all of the theorems we invested in.
Yet it remains overwhelmingly valuable to carry out formal verification of complex digital systems. Why? We’ll review the effectiveness of abstractions that present clear interfaces on top of messy physical reality. Then we’ll generalize to principles for responsibly bringing artificial intelligence into more aspects of our society. It’s natural to think of adding AI into the world gradually, replacing units of functionality previously provided by humans. However, our evolved world is full of complexities that are expensive for AI to grapple with. The incremental path misses an opportunity to create economic enclaves purposely protected from those evolved complexities, allowing reasoning that is both cheaper and more effective.
The Digital Abstraction
I should confess here that, despite working in an academic department whose name starts with “electrical engineering,” I’ve never taken a class in electrical engineering. I couldn’t explain how flows of electrons provide the behaviors we’re used to in electronic circuits. Yet somehow I’ve been able to develop hardware and software components that function well. What’s the trick? All of us depend critically on the digital abstraction.
A naive electronic component can involve a wide range of voltages in a particular wire. Such an untamed wire can be used as analog storage, meaning that it represents a real number, but computing with real numbers, embodied in voltage levels, is a tricky business. We can imagine that a relatively “natural” electrical phenomenon would tend to spread its voltages over a wide range.
So then how have we come to rely on computers that make discrete decisions? The wizards of electrical engineering found a way to bias electronic components so that their voltage levels bunch up at two extremes. We can then draw a line down the middle of that spectrum and say every level above the line is a one and every level below a zero. Now we can construct the basic building blocks of digital computing, logic gates like “AND” that outputs one exactly when each of its two inputs is one.
Crucially, this digital abstraction lets us forget about continuous voltage and think of the “AND” gate as working solely with zeros and ones. Yes, such components do still glitch occasionally, revealing their analog reality. For instance, a cosmic ray may fly by and disturb the wiring. The point is that extensive engineering has pushed the risk of such disturbances low-enough that we can get far while ignoring them.
With this abstraction in place, we can build other abstractions. We can combine many primitive logic gates into more complex circuits. For example, we can build an addition circuit out of gates for “AND” and other simpler functionalities. Now this addition circuit can, in turn, be seen as a building block for higher-level functionality, ignoring not just how it was built out of simpler gates but also ignoring the analog dynamics of voltage within them.
Our end-to-end-verified stack example presumes the digital abstraction at its bottom level. More specifically, it bottoms out in formal semantics of a hardware-description language (a mathematical characterization of what results any circuit could produce), and we assume some accurate way of executing circuits encoded in that language. The digital abstraction gives us that way. An immensely complicated semiconductor-manufacturing supply chain is able to turn circuit descriptions into physical chips, well-enough that we can (usually) forget its details in designing digital systems.
Other Greatest Hits of Abstraction
The importance of abstraction in planning for a future of artificial intelligence has been emphasized in other sources, including The Singularity is Near. Let me just give a few more standard examples, before moving on to suggest a new kind of abstraction.
Natural evolutionary processes have given us many abstractions. Somehow lower-level physics allows subatomic particles to come together into atoms, which become largely reliable building blocks for chemistry. Then those higher-level molecules become building blocks for cells in biology. Cells can be aggregated into tissues and organisms, with cancer as the consequence of misbehavior when cells act more individualistically. We must plan for (prevent and treat) cancer, but, like a cosmic ray disturbing part of a silicon chip, it happens infrequently enough that abstractions like tissue and organism remain useful.
There are also higher-level abstractions that we design ourselves. One of the most basic ones is to have government maintain a monopoly on (legitimate) violence, so that competition shifts to economic activity, on top of strong foundations of property rights. On top of that foundation, a corporation can have legal personhood and enter into contracts with individuals or other corporations. Several corporations can then form a consortium that lobbies for their collective interest. Or individuals as citizens can aggregate into a nation, and then those nations can form coalitions. Geopolitical strategists can get pretty far thinking of coalitions or nations as atomic agents, even as we know it is often necessary to, say, understand a nation by understanding the will of blocs of its voters.
It is probably not controversial to suggest that taking full advantage of artificial intelligence will depend on developing new abstractions, but I’m going to suggest a strategy that tinkers with different parts of our world than most would focus on.
Bubbles of Legibility and Their Interfaces
One major point I worked up to in previous installments was the payoff from reconfiguring the world for greater legibility to intelligence technologies with good properties. That is, some hard problems of AI come from assuming that the basic structure of the world stays the same, and we plug an AI into a spot traditionally occupied by a human worker. If humans communicate with the rest of the world in natural language, we fall into assuming that AIs must communicate in natural language, too. Yet switching to other modes of communication dramatically simplifies processing.
One goal for the present post is to flesh out more of the strategy for reconfiguring the world. I also want to frame that process as introducing an important new kind of abstraction.
Here is the three-step recipe for building a region of legibility.
One important kind of region of legibility is under single ownership. As a canonical example, consider an autonomous factory.
For simplicity, say the factory is a three-dimensional block of space enclosed in walls. Its mission, what led us to create it, is production of particular physical goods. The nature of those goods can be formalized mathematically, in theory allowing end-to-end proof that the factory delivers on its mission. It won’t be easy to get the specification right, as we must characterize physics sufficiently well. Moreover, capturing the nature of the desired product is insufficient, as we must also capture safety properties, e.g. avoidance of toxic emissions seeping through the factory walls. However, following the end-to-end verification approach, we avoid needing to understand how work is divided up and carried out within the factory. The following common AI challenges can be sidestepped entirely.
Overall, the protected environment of the factory has been created to maximize legibility to the AI agents inhabiting it, minimizing their cost to make good decisions.
It remains a hard problem to capture the rules we want the factory to follow, just as it was difficult to build today’s silicon supply chain. However, the payoff in cost reduction should be enormous. Moreover, we have available the established tricks of end-to-end verification. It can be easier to believe a formal verification of multiple autonomous factories than one. If their outputs are integrated into single products whose specifications are simpler than those of the constituent components alone, then the trusted “exterior” interface of the mega-factory becomes simpler, and there are fewer opportunities for mistakes in formalizing it. The basic rules of factory safety may also remain largely the same across constituent factories. In such a setting, the mega-factory may even have autonomy to build new factories that follow the common rules. It may invent new decomposition of top-level deliverables into components, design and construct new factories to supply those components, and still remain compatible with its exterior interface.
The exterior interface is the place where humans bring requests and receive deliverables. We work hard to formalize it properly, in a way that retains flexibility for the intelligence within. There can in general be many interior interfaces, which no longer need to deal directly with complexities of human interaction. Rather, the consequences of human requirements are devolved down into other layers and their formal interfaces, and effort at simplifying the human-interface layer can pay off in simplifying other layers.
We will never capture all aspects of existing in the world as formal requirements, but the principle I’m proposing is to encapsulate as much of a system as possible away from those complexities, whether they come from human behavior or from imperfectly understood natural phenomena.
Competition and Ground Rules
What about larger systems than factories, where one of the chief complexities is interaction between parties that aren’t fully cooperative with each other? We can consider settings for AI agents both to compete and cooperate with each other, on top of a foundation that simplifies operation for all of them, just like the digital abstraction for computer systems or the rule of law for human civilization.
I wrote previous about how AI agents can trust code provided by others. The value of rigorous reasoning about code depends on having a reliable computing substrate, as the digital abstraction enables. However, it is also important to have rules governing which compute resources are controlled by which agents. Furthermore, as an agent reasons through the consequences of a piece of code, its job is greatly simplified by using streamlined artificial languages over natural ones, also preferring well-designed programming languages. Gaming out strategies for competition and cooperation is simplified by adoption of a streamlined sensory environment. The physical manifestation could be something like a special economic zone where only artificial intelligence is allowed.
There is a significant payoff from avoiding the need to reason about human decision-making, within the interior of the economic zone. Our brains developed through evolution, with its limited ability to escape local optima in fitness landscapes. We especially weren’t subjected to selection pressure for efficient understandability to algorithms. Some pressures have even been toward making understandability worse, as with signaling.
I should emphasize, though, that compatibility with human wishes remains central to the proper design of an autonomous economic zone. It’s just that those wishes are abstracted properly into the interface of the zone, kept separate from its interior. Requests flow in from humans to their agents, and then the agents act on those requests as efficiently as they can, ideally even in a provably compliant way, even as they may recursively improve themselves over time.
It is also important to acknowledge another kind of competition, which may lead agents or their coalitions to try to sabotage each other. These acts of aggression seem impossible to block entirely, but, still, I’ll argue that it makes sense to maintain orderly economic zones as far as possible. One analogy is with the increasingly global economy of today, with common norms of property rights and protection from violence. Various conflicts interfere with those rights and protections from time to time, and yet progress depends on relying on them most of the time. Otherwise, we wouldn’t see the investment in for-profit companies that is upstream of so many important breakthroughs. In the context of autonomous economic zones, an attack could force direct confrontation with, say, the laws of physics instead of nicer abstractions obeyed by reconfigured matter, in which case less-scrutable methods like deep learning become dominant again, but we can still do our best to avoid those situations.
Conclusion
The principle we’ve covered bears an interesting kind of mirror-image similarity to the idea of keeping an AI in a box. The motivation for that idea was to limit the ability of a rogue intelligence to do damage out in the world, and it has been argued extensively that even just a text connection with human users is enough for an AI to use trickery to “escape.” Instead, the abstraction barriers I’m arguing for protect AIs against the human world so that they can be more efficient and reliable. The idea is that the complexities we protect these AI modules against are the ones that force the use of opaque learned heuristics rather than traceable reasoning processes from first principles. Then mechanisms like formal verification can be used to anticipate all consequences of candidate design choices.
Even if our human concerns are pushed into the external interfaces of autonomous zones, we still need to characterize those concerns properly. The next three posts will go through three reasons that top-level specifications should be relatively tractable to write in this future scenario, starting from one observation around computer security that already even applies to today’s computer systems.