Steve_Omohundro

70

Hear Hear! Beautifully written! Safety is a *systems* issue and if components at any level have vulnerabilities, they can spread to other parts of the system. As you write, we already have effective techniques for formal verification at many of the lower levels: software, OS, chips, etc. And, as you say, the AI level and social levels are more challenging. We don't yet have a detailed understanding of transformer-based controllers and methods like RLHF diminish but don't eliminate harmful behaviors. But if we can choose precise coarse safety criteria, then formal gatekeepers between the AI and the actuators can ensure those constraints are obeyed.

At the social level, you mention legality and violation of laws. Those work well for humans because humans care about being fined or jailed. I think the AI situation is more challenging because it may be unclear what the source of an AI involved in a harmful event was. If a malicious human actor or AI can spin off harmful AI agents anonymously, then after-the-fact punishment doesn't serve as a deterrent. That suggests that we need to both create accountability for the provenance of AIs and to harden infrastructure so that even malicious AIs can't cause harmful outcomes. I believe that both of those can be accomplished using formally verified techniques like Max's and my "provable contracts".

But there is clearly a lot of work to be done! What are the next steps and how long do we have? I've been waiting for LLMs to reach proficiency at theorem proving, verified software synthesis, and autoformalization. Companies like DeepSeek (https://github.com/deepseek-ai/DeepSeek-Prover-V1.5 ), Harmonic (https://harmonic.fun/news ) and xAI (https://x.ai/blog ) are making rapid progress on each of these capabilities. But OpenAI's release of o1 this week may be the enabling event I've been waiting for. For example, check out mathematician Terence Tao's experience with it: https://mathstodon.xyz/@tao/113132502735585408 "I could imagine a model of this capability that was specifically finetuned on Lean and Mathlib, and integrated into an IDE, being extremely useful in formalization projects."

Of course, those new abilities also enable new capability levels for cyberattack and social manipulation. So the urgency for hardening today's infrastructure may happen earlier than many of us expected. That suggests that we may need to rapidly institute coarse protective measures before we can build out the truly human-beneficial infrastructure we ultimately want. But the sooner we can build precise models of our systems, their failure modes, and the full range of adversarial attacks, the less harm humanity will have to suffer.

81

People seem to be getting the implication we intended backwards. We're certainly *not* saying "For any random safety property you might want, you can use formal methods to magically find the rules and guarantee them!" What we are saying is "If you have a system which actually guarantees a safety property, then there is formal proof of that and there are many many benefits to making that explicit, if it is practical." We're not proposing any new physics, any new mathematics, or even any new AI capabilities other than further development of current capabilities in theorem proving, verified software synthesis, and autoformalization.

Humanity is in a crisis right now! Even without AI we have disasters like the $5 billion CrowdStrike flaw a few weeks ago, many many cyberattacks disabling critical systems, Boeing airplanes falling apart in the sky, etc. As open source AI cyberattack models advance, every one of today's flaws and security holes are likely to be exploited by a wide variety of malicious actors. We have clear descriptions of desired safety properties for many of today's issues and we have design rules which have been developed to mitgate the problems. But today's software development and engineering practices aren't leading to the safety we need! Software engineers wing it and introduce flaws in their systems and security holes are left unfixed for decades. Boeing outsources the manufacture of jet components to underpaid and unsupervised third parties.

By creating explicit formal representations of these rules, and formal "digital twins" of our systems, we can obtain *guarantees* that they are being followed. Our papers and my talks describe many techniques for doing this. But we are just scratching the surface. We need *many* *many* more people thinking about how to achieve *actual* safety in the face of today's and tomorrow's AIs.

I'm a little bit shocked at the low level of curiosity and creativity I'm seeing in these discussions. I have not seen *any* other proposals for actually dealing with the actual problems that real AI systems are likely to cause in the next few years. Attempts at alignment, red teaming, etc. are all very interesting and valuable but do nothing to address the hundreds of millions of open source AIs which are almost as powerful as the frontier closed models and which are easily fine tuned for cyberattack and other harmful actions.

The general approach we have outlined is *not* optional! Godel's completeness theorem tells us that there is a formal proof of any safety property which actually holds for a system. For complex systems, they are very unlikely to be *accidentally* safe. So designing and building them with explicit formal models will be necessary. If humans don't do it, then powerful AIs certainly will. But, in that case, we may have very little input into exactly what properties will be guaranteed.

Our proposal is *not* simple! But it doesn't rely on any exotic new ideas. It *does* require creativity and dedication to actually work to ensure that humanity survives the next decade. There would be tremendous benefit in simply precisely encoding today's safety criteria and in creating systems that guarantee that those criteria are followed. With *some* future creativity we can go further and use these techniques to develop much more accurate safety criteria and better designs. I would have expected huge interest and forward movement on this. I'm sad to not see much of that so far.

32

Thinking more about the question of are there properties which we believe but for which we have no proof. And what do we do about those today and in an intended provable future?

I know of a few examples, especially in cryptography. One of the great successes of theoretical cryptography was the reduction the security of a whole bunch of cryptographic constructs to a single one: the existence of one way functions which are cheap to compute but expensive to invert: https://en.wikipedia.org/wiki/One-way_function That has been a great unifying discovery there and the way the cryptographers deal with it is that they just add the existence of one way functions as an extra axioms to their formal system! It does mean that if it turns out not to be true, then a lot of their proven secure system may actually not be. Fortunately, there is "Information-Theoretic Cryptography" https://www.cambridge.org/us/universitypress/subjects/engineering/communications-and-signal-processing/information-theoretic-cryptography?format=HB which doesn't rely on any unproven assumptions but is somewhat more inconvenient to use.

Then there's "Public Key Cryptography" which rests on much dicier assumptions (such as factoring being hard). We already know that a bunch of those assumptions no longer hold in quantum computation but NIST recently announced 3 "post-quantum" standards https://csrc.nist.gov/projects/post-quantum-cryptography but I think there is still quite a lot of worry about them

More generally, mathematicians often have statements which they believe to be true (eg. P!=NP, the Riemann hypothesis, and others: https://en.wikipedia.org/wiki/List_of_unsolved_problems_in_mathematics On what evidence do mathematicians believe these statements? What happens if they use them in proofs of other statements?

Timothy Gowers wrote an insightful essay into these questions: "**What Makes Mathematicians Believe Unproved Mathematical Statements?" ****https://www.semanticscholar.org/paper/What-Makes-Mathematicians-Believe-Unproved-Gowers/b17901cece820de845e57456eda06f892b5ba199**** **

What does this mean for provable safety? One can always add any beliefs one likes to one's axioms and prove things from there! If the axioms you have added turn out to be false, that can undo the guarantees about anything which depended on them. That suggests that one should try to limit unproven assumptions as much as possible! And it also seems to be a great strategy to follow the cryptographers and try to capture the essence of an assumption in a "standard" assumption whose validity can be tested in a variety of contexts.

Physics aims to have a precise fundamental theory and to mathematically derive the consequences of that theory to explain all phenomena. From a mathematical point of view, physicists are notoriously "mathematically sloppy" using techniques which may often give the right answer but which may not be provable (eg. different perturbation methods, renormalization, path integrals, etc.) But, fortunately, more mathematically inclinded physicists have repeatedly come along afterwards and created precise formal models within which the physics derivations *are* justified.

Two big leaps in the tower of physics models are from deterministic quantum equations (eg. Schrodinger's equation) to statistical representations of measurements and from particle descriptions of matter (eg. an ideal gas) to statistical mechanics representations based on probabilities. Huge literatures explore the meaning and character of those two leaps but my sense is that in general we can't yet formally justify them, but that they are extremely well justified empirically. Physicists call these the assumptions of "Quantum decoherence" https://en.wikipedia.org/wiki/Quantum_decoherence and "Stosszahlansatz" or "Molecular chaos" https://en.wikipedia.org/wiki/Molecular_chaos

How do we deal with them formally? I think we have to do what physicists do and just add those assumptions as axioms to the formal models. This is risky in an adversarial context. We have to watch out for powerful adversaries (ie. powerful AIs) which can control matter at a level which enables them to violate these assumptions. Doesn't seem likely to me, but we must be ever vigilant!

Something I would like to do in these situations but I don't think we have the philosophical underpinnings for is to have precise provable estimates of the probabilities of these being true or false. Gowers makes some attempt at that but I'm not sure it's formal yet. It's a bit weird, it would be a probability for something which is objectively either true or false. So it would be a measure of our knowledge of the state. But it would be valuable for AI safety to have a justifiable measure for how much we need to worry about an adversary being able to violate our assumptions. And, ultimately, our current laws of physics are of this character. It would be great to have a precise measure of our confidence in various physical properties like symmetries (eg. time-invariance, translation invariance, rotational invarianc, etc.), conservation laws (mass/energy, momentum, lepton-number, etc.), etc.

30

Yes, thanks Steve! Very interesting examples! As I understand most chip verification is based on SAT solvers and "Model Checking" https://en.wikipedia.org/wiki/Model_checking . This is a particular proof search technique which can often provide full coverage for circuits. But it has no access to any kind of sophisticated theorems such as those in the Lean mathematics library. For small circuits, that kind of technique is often fine. But as circuits get more complex, it is subject to the "state explosion problem".

Looking at the floating point division paper, the double precision divider took 7 hours and 30 minutes indicating a lot of search! But one great thing about proofs is that their validity doesn't depend on how long they take to find or on how big they are.

It looks like they did this verification with the Synopsys VC Formal tools (https://www.synopsys.com/verification/static-and-formal-verification/vc-formal.html ) This looks like a nice toolkit but certainly no higher level mathematical proof. It sounds like it's perfectly adequate for this task. But I wouldn't expect it to extend to the whole system very smoothly.

To see what should be possible as AI theorem provers come on line, ask how the Synopsys engineers designed the whole chip to begin with. Presumably they had arguments in their heads about why their design was correct. Humans are especially bad at complex mathematical components like floating point divide, so it makes great sense to use a kind of brute force tool to back up their intuitions there. But with general reasoning capabilities (eg. as people are doing in Lean, Coq, etc.) it shouldn't be hard to formalize the engineer's intuitive understanding of why the whole chip is correct.

If the engineers already have a mental proof of correctness, what is the benefit of a formal proof? I believe one of the most important aspects of proof is its *social* role. The proof in the engineer's head is not convincing to anyone else. They can try to explain it to another engineer. But that engineer may or may not share the same background and intutions. And if they try to explain it to their manager, the poor manager mostly just has to go on his intuitive assessment of the engineer and on their reputation.

A formal proof eliminates all that need for trust! With a formal proof, anyone can rapidly check it with 100% reliability. The proven design can be incorporated along with other proven designs and the proofs composed to get a proven design of bigger systems. The user of the chip can check the proof and not need to trust the design company. All of these factors make proofs extremely valuable as part of the social interactions of design and deployment. They would be worth creating independent of the increase in correctness and security.

These papers are a great example of why AI theorem proving is likely to be a complete game changer for safety and security but also for general design and process management. NVIDIA's H100 chip has nearly 13,000 AI-designed circuits https://developer.nvidia.com/blog/designing-arithmetic-circuits-with-deep-reinforcement-learning/ That's just the tip of the iceberg when AI-driven verification becomes the norm.

52

I agree that would be better than what we usually have now! And is more in the "Swiss Cheese" approach to security. From a practical perspective, we are probably going to have do that for some time: components with provable properties combined in unproven ways. But every aspect which is unproven is a potential vulnerability.

The deeper question is whether there are situations where it *has* to be that way. Where there is some barrier to modeling the entire system and formally combining correctness and security properties of components to obtain them for the whole system.

Certainly there are hardware and software components whose detailed behavior is computationally complex to predict in advance (eg. searching for solutions to SAT problems or inverting hash functions). So you are unlikely to be able to prove theorems like "For every specification of these n bits in this SAT problem, it will take f(n) time to discover a satisfying value for the remaining bits". But that's fine! It's just that you shouldn't make the correctness or security of your system depend on that! For example, you might put a time bound on the search and have a failsafe path if it doesn't succeed by then. *That* software does have a provable time bound.

So, in general, systems need to be *designed* to be correct and safe. If you can't put provable bounds on the safety of a system, then I would argue that you have no business exposing the public to that system.

It would be great to start collecting examples of subcomponents or compositional designs which are especially *difficult* to prove properties about. My sense is that virtually all of these formal analyses will be done by AIs and not by humans. And I think it will be important to develop libraries and models of problems which are easy to solve formally and provide formal guarantees about. And those which are more difficult.

Thinking about the software verification case, I would argue that every decently written piece of software today, the programmer has an internal argument in their head as to why it is correct and not vulnerable to attacks. Humans are fallible, so their argument may not be correct. But if it is correct, then it shouldn't be difficult to formalize it into a precise formal proof. The "de Bruijn Factor" (https://www.cs.ru.nl/~freek/factor/factor.pdf ) measures how much bigger a formal proof of something is than an informal description. It seems to be between 4 and 10 in current formal systems. So, if a human programmer has confidence in the correctness and security of his code, it should only be a small factor more work for an AI to formally prove that. If the programmer *doesn't* have that confidence, then I think we have no business deploying it anywhere it might harm humans.

92

I totally agree in today's world! Today, we have management protocols which are aimed at requiring testing and record keeping to ensure that boats and ships in the state we would like them to be. But these rules are subject to corruption and malfeasance (such as the 420 Boeing jets which incorporated defective parts and yet which are currently flying with passengers: https://doctorow.medium.com/https-pluralistic-net-2024-05-01-boeing-boeing-mrsa-2d9ba398bd54 )

But it appears we are rapidly moving to a world in which much of the physical labor will be done by robots and in which each physical system will have a corresponding "digital twin" (eg. https://www.nvidia.com/en-us/omniverse/solutions/digital-twins/ ).

In *that* world, we can implement provable formal rules governing every system, from raw materials, to manufacture, to supply chain, to operations, and to maintenance.

In an AI world, much more sophisticated malfeasance can occur. Formal models of domains with proofs of adherence to rules and protection against adversaries is the only way to ensure our systems are safe and effective.

52

Testing is great for a first pass! And in non-critical and non-adversarial settings, testing can give you actual probabilistic bounds. If the probability distribution of the actual uses is the same as the testing distribution (or close enough to it), then the test statistics can be used to bound the probability of errors during use. I think that is why formal methods are so rarely used in software: testing is pretty good and if errors show up, you can fix them then. Hardware has greater adoption of formal methods because it's much more expensive to fix errors after the fact.

But the real problems arise from adversarial attacks. The statistical correctness of a system doesn't matter to an adversary. They are looking for the weird outlier cases which will enable them to exploit the system (eg. inputs with non-standard characters that break the parser, or super-long inputs which overflow a buffer and enable unexpected access to memory, etc.). Testing can't show the *absence* of flaws (unless *every* input is tested!).

I think the increasing plague of cyberattacks is due to adversaries become more sophisticated in their search for non-standard ways of interacting with systems that expose their untested and unproven underbelly. But that kind of sophisticated attack requires highly skilled attackers and those are fortunately still rare.

What is coming, however, are AI-powered cyberattack systems which know all of the standard flaws and vulnerabilities of systems, all of the published 1-day vulnerabilities, all of the latest social engineering techniques discussed on the dark web, and have full access to reverse engineering tools like Ghidra. Those AIs are likely being developed as we speak in various government labs (eg. here is a list of significant recent cybe incidents: https://www.csis.org/programs/strategic-technologies-program/significant-cyber-incidents ).

How long before powerful cyberattack AIs are available on bittorrent to teenage hackers? So, I believe the new reality is that *every* system, software and hardware need to be proven correct and secure to have any confidence in it. To do that, we are likely to need to use AI-theorem provers and AI-verified software synthesis systems. Fortunately, many groups are showing rapid progress on those!

But that doesn't mean testing is useless. It's very helpful during the development process and in better understanding systems. For final deployment in an environment with powerful AIs, however, I don't think it's adequate any more.

102

In general, we can't prevent physical failures. What we can do is to accurately bound the probability of them occurring, to create designs which limit the damage that they cause, and to limit the ability of adversarial attacks to trigger and exploit them. We're advocating for humanity's entire infrastructure to be upgraded with provable technology to put guaranteed bounds on failures at every level and to eliminate the need to trust potentially flawed or corrupt actors.

In the case of the ship, there are both questions about the design of that ship's components and its provenance. Why did the backup power not enable the propulsion system to stop? Why wasn't there a "failsafe" anchor which drops if the systems become inoperable? Why didn't the port have tugboats guiding risky ship departures? What was the history of that ship's generators? Etc. With the kind of provable technology that Max and I outlined, it is possible to have provably trustable data about the components of the ship, about their manufacture, about their provenance since manufacture, about the maintenance history of the ship's components, etc.

The author of the main post and other critics argue against formal methods doing complex "magical" things like determining which DNA sequences are safe, how autonomous vehicles should navigate cities, or detecting bad thoughts in huge transformer neural nets. Someday these methods might help with some of those, but those aren't the low hanging fruit we are proposing. In some sense we mainly want to use proof for much more mundane things. What Max and I are arguing for are mechanisms to create software, hardware, and social designs which aren't exploitable by adversarial AIs and to create infrastructure that provides guarantees about its state and behavior. Nothing we are proposing requires sophisticated mathematics that today's grad students couldn't do. Nothing requires new physics or radically new engineering principles. Rather, it is a way to organize current technologies to increase trust and eliminate vulnerabilities.

These technologies enable us to eliminate the need to trust third parties: Was a computation performed accurately? Were there bugs in the program? What data was used to train this model or estimate this probability? What probabilistic program or neural net was used? Was the training done correctly? What is the history of this component? What evidence is there that it was manufactured correctly? These and thousands more cases will enable us to build up a robust infrastructure which is provably not vulnerable to AI-driven attack.

A core aspect of this is that we can use untrusted powerful AIs running on untrusted datacenters in untrusted countries to help us build completely trusted software, hardware, and social protocols. The idea is to precisely specify a task (eg. software spec, hardware spec, solve a mathematically encoded problem, etc.) and have the untrusted AI generate both and answer and a proof (in a system like Lean) that the answer solves the precisely specified problem or design task. We can cheaply and completely reliably check the proof. If it verifies, then we can fully trust the results from the untrusted AI. This enables us to bootstrap the current mess of untrusted and unreliable AIs, flaky and insecure hardware, untrustable people and groups, etc. to build up a *fully* trustable infrastructure. The power and importance of this is immense!

36

I totally agree! I think this technology is likely to be the foundation of many future capabilities as well as safety. What I meant was that society is unlikely to replace today's insecure and unreliable power grid controllers, train network controllers, satellite networks, phone system, voting machines, etc. until some big event forces that. And that if the community produces comprehensive provable safety design principles, those are more likely to get implemented at that point.

David, I really like that your description discusses the multiple interacting levels involved in self-driving cars (eg. software, hardware, road rules, laws, etc.). Actual safety requires reasoning about those interacting levels and ensuring that the system as a whole doesn't have vulnerabilities. Malicious humans and AIs are likely to try to exploit systems in unfamiliar ways. For example, here are 10 potential harmful actions (out of many many more possibilities!) that AI-enabled malicious humans or AIs could gain benefits from which involve controlling self-driving cars in harmful ways:

1) Murder for hire: Cause accidents that kill a car's occupant, another car's occupant, or pedestrians

2) Terror for hire: Cause accidents that injure groups of pedestrians

3) Terror for hire: Control vehicles to deliver explosives or disperse pathogens

4) Extortion: Demand payment to not destroy a car, harm the driver, or harm others

5) Steal delivery contents: Take over delivery vehicles to steal their contents

6) Steal car components: Take over cars to steal components like catalytic converters

7) Surreptitious delivery or taxis: Control vehicles to earn money for illegal deliveries or use in illegal activities

8) Insurance fraud for hire: Cause damage to property for insurance fraud

9) Societal distraction for hire: Cause multiple crashes to overwhelm the societal response

10) Blocked roads for hire: Control multiple vehicles to block roadways for social harm or extortion

To prevent these and many related harms, safety analyses must integrate analyses on multiple levels. We need formal models of the relevant dynamics of each level and proof of adherence to overall safety criteria. For societal trust in the actual safety of a self-driving vehicle, I believe manufacturers will need to provide machine-checkable "certificates" of these analyses.