I've been hearing vague claims that automated theorem provers are able to, or will soon be able to prove things about complex software such as AIs.

Max Tegmark and Steve Omohundro have now published a paper, Provably Safe Systems : The Only Path To Controllable AGI, which convinces me that this is a plausible strategy to help with AI safety.


The basic steps:

  • Write trustworthy AIs that are capable of searching for proofs and verifying them.
  • Specify safety properties that we want all AIs to obey.
  • Train increasingly powerful Deep Learning systems.
  • Use Mechanistic Interpretability tools to translate knowledge from deep learning systems into more traditional software that's more transparent than neural nets.
  • Use the AIs from the first step to prove that the results have those safety properties.
  • Require that any hardware that's sufficiently powerful to run advanced AI be provably constrained to run only provably safe software.

Progress in automated theorem proving has been impressive. It's tantalizingly close to what we need to prove interesting constraints on a large class of systems.

What Systems Can We Hope to Prove Properties about?

The paper convinced me that automated proof search and verification are making important progress. My intuition still says that leading AIs are too complex to prove anything about. But I don't have a strong argument to back up that intuition. The topic is important enough that we ought to be pursuing some AI safety strategies that have a significant risk of being impossible. Throwing lots of compute at proving things could produce unintuitive results.

My main reaction after figuring out the basics of the paper's proposals was to decide that their strategy made it impossible to train new AIs on high end GPUs.

Tegmark and Omohundro admit that powerful neural networks seem too messy to prove much about. Yet they also say that neural networks are a key step in creating better AIs:

First, note that the seemingly magical power of neural networks comes not from their ability to execute, but from their ability to learn. Once training is complete and it is time for execution, a neural network is merely one particular massively parallel computational architecture -- and there are many others based on traditional software that can be similarly efficient.

But neural networks routinely crush the competition when it comes to learning.

I'm willing to imagine that the knowledge that's embodied in neural network can be translated into more traditional software, and that doing so might make enough difference that the traditional software version will be amenable to proofs. My intuition says we should be pessimistic about proofs being tractable for anything more powerful than GPT-4, but the benefits of doing so suggest that we should try anyway.

What bothers me most about the paper's strategy is that training a new neural network requires running that unproven neural network, in ways that allow it to do arbitrarily intelligent acts. Doesn't that conflict with the rule that the relevant hardware only runs provably safe software?

That appears to depend in part on whether the training software can prove that the neural network is safe.

For an arbitrarily smart neural network, that may require proving that there's no way for information about that neural network to get out of the training run except via the process that proves the safety of the neural network.

Implementing such security is well beyond my pay grade. But I'm not willing to bet it's beyond the ability of GPT-7.

It also means that AI development becomes harder once security reaches this level (the paper implies that increasingly strict safety requirements will gradually be added). The paper seems to downplay the development costs of humans not being able test the unsafe neural network to figure out why it couldn't be proven safe.

What Safety Properties Are Feasible?

The paper implies that we should initially ask for relatively basic safety properties, such as a refusal to help terrorists spread harmful viruses.

I want to focus instead on security properties that limit the AI's ability to take over the world. Specifically, can we prove that the AI won't interfere with a human pressing a kill button that shuts down the AI?

One simple suggestion from the paper is to require that the AI needs crypto coins to enable it to access compute. That means the default is for the AI to run out of compute unless the human(s) in control of generating the coins cooperates.

That's a good start for dealing with AIs that approach human levels. Eventually we'll want to add safety properties that deter AIs from harmful manipulation of humans.

The longer-term vision seems to be gradually building up stronger safety guarantees, until they're more like Asimov's 3 Laws of Robotics.

More Caveats

The strategy assumes we'll develop a good set of safety properties that we're demanding proof of. There's likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications. Tegmark and Omohundro prefer to have this step done by humans.

I'm willing to accept that for near-human stages of AI, but not too long after that we'll needed something closer to an AI-generated approximation to CEV. I'll classify this as being a fairly important problem, one that's shared by most other approaches to AI safety.

Proofs about hardware seem different from proofs about software, since there are uncertainties about how well the theorem prover can observe the hardware. This seems like a relatively minor issue, but I'm uncomfortable with the way the paper claims mathematical certainty about the proofs.

The authors list sensor reliability as an open problem. They aim only for increasing that reliability, not full guarantees. So I'm getting mildly conflicting messages about how much certainty the paper is aiming for.

Section 8 ("The Only Path To Controllable AGI") goes a bit too far:

this implies that if some safety property doesn't have a proof, then there must be a way to violate it. Sufficiently powerful AGIs may well find that way. And if that AGI is malicious or controlled by a malicious human, then it will exploit that flaw.

That seems reasonable if the safety properties that we ask for are exactly the properties that we need, and if we use unlimited amounts of compute to search for proofs.

I consider it much more likely that we'll ask for security properties that don't exactly match our needs. We'll either be overly cautious, and ask for safety properties that constrain the AI more than we absolutely need. In which case, an absence of proof doesn't tell us whether a system is dangerous. Or we'll fail at safety by not asking for as much safety as we need.

Note that they seem to expect fairly exhaustive searches for proofs, such that failure to find a proof ought to imply a genuine absence of safety. I'm unsure whether this is a reasonable expectation.


I see something like a 50% chance that this strategy would significantly reduce AI risk if the world coordinated to require it. However, it's still rather hard to see how we can get such coordination.

I'm now feeling a wee bit more optimistic about decent regulation of AI development.

I've returned to thinking that it's overkill to have a full-fledged ban on development of powerful AI.

Requiring insurance and strict liability no longer appear to create large risks of amounting to a total ban on development, so it's looking more plausible that such regulations can be enforced.

If insurance companies need to see proofs of safety for advanced GPUs, that will in practice create a temporary pause in AI progress. I'm guessing that pause would be on the order of 1 to 10 years. But will it end due to successful proofs, or due to inability to enforce regulations? I don't know.

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 12:12 PM

As a few commenters have already pointed out, this "strategy" completely fails in step 2 ("Specify safety properties that we want all AIs to obey"). Even for a "simple" property you cite, "refusal to help terrorists spread harmful viruses", we are many orders of magnitude of descriptive complexity away from knowing how to state them as a formal logical predicate on the I/O behavior of the AI program. We have no clue how to define "virus" as a mathematical property of the AI sensors in a way that does not go wrong in all kinds of corner cases, even less clue for "terrorist", and even less clue than that for "help". The gap between what we know how to specify today and the complexity of your "simple" property is way bigger than the gap between the "simple" property and most complex safety properties people tend to consider...

To illustrate, consider an even simpler partial specification - the AI is observing the world, and you want to formally define the probability that it's notion of whether it's seeing a dog is aligned with your definition of a dog. Formally, define a mathematical function of arguments that, with the arguments representing the RGB values for a 1024x1024 image, would capture the true probability that the image contains what you consider to be a dog - so that a neutral network that is proven to compute that particular function can be trusted to be aligned with your definition of a dog, while a neutral network that does something else is misaligned. Well, today we have close to zero clue how to do this. The closest we can do is to train a neutral network to recognize dog pictures, and than whatever function that network happens to compute (which, if written down as a mathematical function, would be an incomprehensible mess that, even if we optimize to reduce the size of, will probably tbe at least thousands of pages long) is the best formal specification we know how to come up with. (For things simpler than dogs we can probably do better by first defining a specification for 3d shapes, then projecting it onto 2d images, but I do not think this approach will be much help for dogs). Note that effectively we are saying to trust the neural network - whatever it learned to do is our best guess on how to formalize what it needed to do! We do not yet know how to do better!!!

Note that effectively we are saying to trust the neural network

I expect that we're going to have to rely on some neural networks regardless of how we approach AI. This paper guides us to be more strategic about what reliance to put on which neural networks.

Fortunately, for coarse "guardrails" the specs are pretty simple and can often be reused in many contexts. For example, all software we want to run should have proofs that: 1) there aren't memory leaks, 2) there aren't out-of-bounds memory accesses, 3) there aren't race conditions, 4) there aren't type violations, 5) there aren't buffer overflows, 6) private information is not exposed by the program, 7) there aren't infinite loops, etc. There should be a widely used "metaspec" for those criteria which most program synthesis AI will have to prove their generated code satisfies. Similarly, there are common constraints for many physical systems: eg. robots, cars, planes, boats, etc. shouldn't crash into things or harm humans, etc. The more refined the rules are, the more subtle they become. To prevent existentially bad outcomes, I believe coarse constraints suffice. But certainly we eventually want much more refined models of the world and of the outcomes we seek. I'm a fan of "Digital Twins" of physical systems which allow rules and constraints to be run in simulation which can help in choosing specifications. We certainly want those simulations to be trusted. which can be achieved by proving the code actually simulates the systems it claims to. Eventually it would be great to have fully trusted AI as well! Mechanistic Interpretability should be great for that! I'm just reading Anthropic's recent nice advances in that. If that continues to make progress then it makes our lives much easier but it doesn't eliminate the need to ensure that misaligned AGI and malicious AGI don't cause harm. The big win with the proof checking and the cryptographic hardware we propose is that we can ensure that even powerful systems will obey rules that humanity selects. If we don't implement that kind of system (or something functionally equivalent), then there will be dangerous pathways which malicious AGI can exploit to cause great harm to humans. 

Thanks Peter for the post and thank you everyone for the comments. Let me try to clarify a bit. We're looking for an absolute foundation of trust on top of which we can build a world safe for AGI. We believe that we need to adopt a "Security Mindset" in which AGI's either on their own or controlled by malicious humans need to be considered a full on adversaries. The only two absolute guarantees that we have are mathematical proof and the laws of physics. Even the most powerful AGI can't prove a falsehood or violate the laws of physics. Based on these we show how to build a network of "provable contracts" that provide absolute guardrails around dangerous actions. As a commenter points out, figuring out which actions we need to protect against and what the rules should be is absolutely essential and not at all trivial! In fact, I believe that should be one of the primary activities of humanity for the next decade! 

From the paper:

AI is allowed to write the code and proof, but not the proof-checker.

Some examples from the paper of things to be proved are "this plane will not fly into buildings" (p.3), "this drone flies safely" (p.2), and "this social media account is human" (p.2). Somehow a formal specification is made of such things, and the proof-checker verifies that the proof indeed proves that the code satisfies the formal specification.

The only suggestion that the paper makes of how to make a formal specification is by citing its reference 94, "Autoformalization with Large Language Models". I have not read it, but the title implies that its answer is to have an AI do it. Where do you get an AI that is safe enough to be trusted with that?

Putting the two papers together, the scheme is:

  1. Give one AI a specification like "no-one can fly this plane into a building".

  2. Have the AI come up with a formal specification of what that means.

  3. A second AI generates avionics software to satisfy that specification and a purported proof that it does so.

  4. Run the specification, the code, and the proof through a dumb proof-checker.

  5. The code has been proved safe! Run it.

The second AI does not have to be safe, because the code it generates will not be run unless it passes step 4. But what about the first AI? The entire safety issue is still there. How can we assure ourselves that the formal specification that it generates actually describes what we really wanted? Especially if "no-one can fly this plane into a building" turns into a vast pile of mathematics as impenetrable as any neural network's weight matrices.

How do you stop someone doing the same with a different spec, "this plane will fly into any building I tell it to"?

Thanks for this write-up, Peter! 

Here's an interview I released yesterday with Omohundro on this paper: 

It's not a silver bullet, but not useless either. In software verification, there are many safety properties that are valuable to have, but they never guarantee safety in ways that are not captured in their formulation. Similarly, the task of formulating safety of AI systems in formal properties is hopeless in its ambitious form but could be extremely useful piecemeal, when AI systems are structured as modular software instead of as monolithic models.

Examples adjacent to where such methods might become relevant are Conjecture's CoEm and Drexler's CAIS. AIs built for a particular purpose that avoid escalation of their capabilities, or respect boundaries in their impact.

This also gets agent foundations a handhold in practical AI systems, ability to robustly implement particular laws of behavior. So not a solution to ambitious alignment, but rather tools for building systems that might eventually get us there, for keeping such systems tractable and bounded in particular ways.

Thank you for writing this review.

The strategy assumes we'll develop a good set of safety properties that we're demanding proof of.

I think this is very important. From skimming the paper it seems that unfortunately the authors do not discuss it much. I imagine that actually formally specifying safety properties is actually a rather difficult step.

To go with the example of not helping terrorists spread harmful virus: How would you even go about formulating this mathematically? This seems highly non-trivial to me. Do you need to mathematically formulate what exactly are harmful viruses?

The same holds for Asimov's three laws of robotics, turning these into actual math or code seems to be quite challenging.

There's likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications.

Probably obvious to many, but I'd like to point out that these automated systems themselves need to be sufficiently aligned to humans, while also accomplishing tasks that are difficult for humans to do and probably involve a lot of moral considerations.

I'm skeptical because in addition to logical reasoning, intuitive reasoning seems pretty important. And I'm not sure if there's a simpler representation of intuitive reasoning that a bunch of weights from a bunch of concepts to another concept.

Check out this great paper: "From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought" https://arxiv.org/abs/2306.12672 It proposes "probabilistic programming" as a formal "Probabilistic Language of Thought" (PLoT) with precise formal Bayesian reasoning. They show in 4 domains how a large language model can convert an informal statement or chain of reasoning into a precise probabilistic program, do precise Bayesian reasoning on that, and then convert the results back into informal natural language.

Peter, thanks again for starting this discussion! Just a few caveats in your summary. We don't depend on trustable AIs! One of the absolutely amazing and critical characteristics of mathematical proof is that anybody, trusted or not, can create proofs and we can check them without any need to trust the creator. For example, MetaMath https://us.metamath.org/ defines an especially simple system for which somebody has written a 350 line Python proof checker which can check all 40,000 MetaMath theorems in a few seconds. We need to make sure that that small Python program is correct but beyond that don't need to trust any AIs! It certainly would be nice to have trustable AIs but I think it is a mistake to depend on them. We are already seeing a wide range of open source AIs created by different groups with different intentions. And there are many groups working to remove the safety features of commercial Large Language Models and text-to-image models. 

The core of our proposal is to put provable guardrails around dangerous technologies that AGI might seek to control. As many commenters have pointed out we need to figure out what those are! Some current examples include DNA synthesis machines, nukes, military hardware of all kinds, drones capable of dispersing dangerous payloads, etc.  And we need to create rules for which actions with these systems are safe! But we already do that today for human use. The CDC has detailed rules for biohazard labs. The initial implementation of our proposal will most likely be digitally checkable versions of those existing human rules. But certainly we need to extend them to the new environment as AIs become more powerful. Any action remotely capable of human extinction should become a primary focus. I believe establishing the rules for high risk actions should be an immediate priority for humanity. And extending those rules to other systems to enable human flourishing should be a close second.

We can use untrusted AIs in much of this work! We simply require them to generate formal proofs of any designs or software that they create. These can then be easily and rigorously checked and they can be used without concern for the properties of the AI which generated them.

I understand how we can avoid trusting an AI if we've got a specification that the proof checker understands.

Where I expect to need an AI is for generating the right specifications.

Perhaps it is best to develop AI systems that we can prove theorems about in the first place. AI systems that we can prove theorems about are more likely to be interpretable anyways. Fortunately, there are quite a few theorems about maxima and minima of functions including uniqueness theorems including the following.

Theorem: (maximum principle) If  is a compact set, and  is an upper semicontinuous function that is subharmonic on the interior , then .

If  is a bounded domain, and  is a -function, then define the Dirichlet energy of  as 

Theorem: (Dirichlet principle) If  is a compact subset of , then whenever  is continuous and  is a continuous function which is harmonic on  and where , then  is the -function that minimizes the Dirichlet energy  subject to the condition that .

Theorem: (J Van Name: A version of the Cauchy-Schwarz inequality) Suppose that  which have no invariant subspace and . Suppose furthermore that  is the partial function where . Then  and  if and only if there is some non-zero complex number  and invertible matrix  with  for . Here,  denotes the spectral radius of a matrix .

Right now, there are some AI systems related to the above theorems. For example, if we are given a graph, then the smallest eigenvalues of the Laplacian of a graph form clusters in your graph, so spectral techniques can be used to analyze graphs, but I am working on developing spectral AI systems with more advanced capabilities. It will take some time for spectral AI systems to catch up with deep neural networks while retaining their mathematical properties (such as fitness functions with an almost unique local maximum), but it is plausible that the development of spectral AI systems is mostly a matter of engineering and hard work rather than a problem of possibility, but I am not sure whether spectral AI systems will be as energy efficient as neural networks once well developed.