Verify, but Trust

berns

In the last essay, we established that the ability to make credible precommitments is a key ingredient for cooperation. The coming generations of AI may enable opportunities for new kinds of precommitments and new dynamics of cooperation. A growing field called dealmaking studies how agreements between humans and near-future AI might be achieved if each party has something valuable to offer the other.^[1] In the longer term, we want superintelligent AIs that are more disposed to collaborate with each other than they are towards AI-AI war. But we know that any precommitment is only as useful as it is trustworthy, and that agents in the real world are at best translucent.^[2] What would be the effects of AIs with transparent intentions, and how might we make the intentions of humanity more transparent? What benefits would credible commitments from humanity create for deals with AIs? We can take a step toward answering these questions with a careful examination of the concept of verifying a precommitment.

Kernel and context

Let’s build some intuition with an example of verifying a precommitment made in the physical world first. Suppose you have a child who asks you to buy them a new video game on the condition that they don’t play it during an upcoming finals week. You trust their promise to refrain during finals because the game and console will be locked in a time-delay safe for the duration of the week. When you and your child come together to lock the temptations away, you take a close look at the safe. Your intent in inspecting the safe is to verify your child’s promise; you check that the safe really locks, the time delay works as expected, and the lock isn’t flimsy. Where do you stand epistemically after this verification? You can be quite confident that your child won’t be able to just open the safe before the time is up. It is still physically possible to crack or destroy the safe, but it would be so costly and difficult that you can be assured your child won’t go to such lengths. But even if your child can’t open the safe early, you must also trust that they won’t play on a friend’s setup or buy a new one. True certainty that your child won’t be gaming, if that’s possible at all, would require a level of vigilance far greater than the one-time inspection of the safe.

Verifying your child’s promise does not eliminate the need for trust, but restructures and relocates it. Without the safe, you would need to trust your child’s willpower. With it, there are two major areas you must place your trust instead, instances of concepts I’ll call the kernel^[3] and the context. The kernel is the verification mechanism, here the safe itself. Trust in the kernel is what’s earned through your direct inspection. The safe works: the locking device is well-constructed and meets its spec, the contents can’t be extracted any other way. The context is the rest of your knowledge about the world that lets you interpret the mechanism's results and use them as information about your agreement. Here, the context is the idea that as long as the safe is closed, your child isn’t playing the game. Trust in the context is what's needed to be confident that locking the safe actually captures the spirit of the commitment and your child won’t just play the game some other way. The context connects the verification mechanism to the messy, real-world intent of the promise. The plan still demands a sort of residual trust in both the safe and its context, but trusting each is a much smaller ask than the alternative requirement to trust your child to prioritize exams of their own volition.

We can extend the example further by considering a different scheme for verifying your child’s promise. Suppose, instead of a safe, you install a monitoring app on the console to send you an alert any time someone boots up the game. Unlike the plan with the safe, here you can verify when your child actually plays the game, but only after they have already broken their promise not to. A notification-free week would be ex post verification of the agreement, confirmation that the promise was kept after the fact. This is not nearly as reassuring as the kind of ex ante verification the safe would provide. Ex post verification is only useful insofar as we can act on the information that the commitment is being violated, whereas ex ante verification heads off a violation before it happens. Note also that we might have both ex ante and ex post mechanisms in place for the same deal. The safe and the monitor app complement each other, bringing us different forms of confidence in the fulfillment of the promise.

Now, let’s move into the digital world. Suppose you go to your bank’s website and make an electronic money transfer. The site presents you a page with a green checkmark and a confirmation that the funds have been successfully moved. What makes you so sure that the money really ended up where the page says it did? One worry you might have has to do with the internet itself: why should you believe that the messages on your screen are really coming from the bank? The internet’s answer is called TLS. It authenticates that you really are communicating with the bank, confirms no messages have been tampered with, and puts the “Secure” in an HTTPS connection. Another class of worries you may have concerns the bank. The bank might fail, and decline to pay back the rightful amount. Or perhaps the bank is fraudulent or reckless, giving you misleading information about where your money is. Financial crises in American history caused by issues like these led to the creation of agencies like the FDIC and the CFPB, intended to protect the public from crashes and bad actors. Today, our suspicion of bankers is tempered by institutional confidence in regulators. Before this modern infrastructure, trusting a bank transaction required direct belief in some bank employees and any middle-men used to communicate with them. We can be more confident about an online transaction because of trust in the TLS kernel to secure our communication with the bank, along with contextual trust that the bank’s word is ultimately backstopped by the government.

Trust in the kernel and context are largely separate concerns. You might be sure that you’re really talking to the bank but worried the bankers are scamming you, or you might trust your bank completely but fret about vulnerability to hackers. When you want to send money in an e-transfer, the pieces fit together to form a trusted chain of custody^[4] carrying out your intention. Your transaction is represented digitally, transmitted via the internet, processed by the bank, and insured by the government. You still must trust every link in the chain, but structured mechanisms and human institutions make each one more trustworthy than any individual person.

The kernel and context are also importantly asymmetric. Knowledge of the kernel comes as structured, mechanistic information about its functioning. We gain faith in the kernel through inspections and tests that let us know specifications are being followed. Trust in the context outside the boundaries of the verification mechanism is established through things like faith in our personal relationships, reliability of human institutions, and common-sense reasoning about what we’re seeing. Well-designed verification schemes can give us high certainty about the kernel, but we can rarely be as sure about the context. A truly reliable verification scheme is one that demands residual trust only in a thoroughly tested kernel embedded in a well-specified, reliable context.

Mutual verification and open-source games

Like TLS helping customers and banks do business, the real point of verification is its effect on our ability to cooperate. The power to make strongly verifiable^[5] commitments alters the landscape of strategic collaboration. How can agents with transparent intentions work together? Some insight comes from open-source game theory. This field, through its central concept of program equilibrium,^[6] studies games played by programs that receive each other’s source code as input, and can therefore directly inspect other players’ intentions.

To see how this condition changes things from standard game theory, consider the Prisoner’s Dilemma. In a single round of the standard Prisoner’s Dilemma, defection beats cooperation regardless of which choice the other player makes. Defection is a dominant strategy and mutual defection is the only Nash equilibrium, despite the fact that mutual cooperation is better for both parties. This is only changed by playing the game repeatedly, so that cooperation can be rewarded and defection punished in the future.^[7] When agents have mutual source code access before playing, cooperation between rational players is possible in a single round.^[8] In the open-source Prisoner’s Dilemma, players can verify their opponent’s intentions to cooperate or defect and use this information for their own move. This is an instance of the folk theorem^[9] for open-source game theory: every individually rational outcome^[10] is achievable as a one-shot program equilibrium, the same wide spectrum of possible equilibria as iterated standard games. When mutual verification is feasible, the space of cooperative possibilities is vast.^[11] New collaborative strategies are made possible when you can be sure others will act as they claim.

Making deals

The possibilities unlocked by mutual verification may be relevant for future AI-AI interaction, but can humanity reap the same benefits? Human honesty does not appear to be strongly verifiable through direct inspection of the brain,^[12] so we can never be completely sure what any single person will do. Throughout human history, we have principally tried to transcend the limits of individual opacity by building collective institutions. Legal institutions like courts and financial ones like central banks are more reliable and more transparent than any of their individual human constituents. This proposal by Cleo Nardo^[13] is an example of using this social infrastructure for credible promises from humans to AIs, but ultimately a scheme like this one rests on the trustworthiness of a small group of people and doesn't create enough institutional accountability to make enforcement likely. Open-source game theory suggests another kind of approach. If there’s a way to represent human commitments as programs, they could be directly inspected and strongly verified. Humanity might offer AIs software artifacts as proxies that verifiably encode our intentions instead of taking our word for it.

Any AI-human deal would involve commitments from two sides. To use our new understanding of verification fruitfully in AI dealmaking, we are really faced with two separate challenges: applying verification techniques to AIs and codifying human commitments in verifiable ways. Research is being conducted on verification for AI,^[14] but it mostly covers more generally verifying safety features of an AI or the artifacts it produces. Little to no work has been done on verifying information an AI may voluntarily disclose about itself, or otherwise verifying precommitments made by AIs.^[15] For the rest of this essay I’ll focus on the other direction, formalizing promises humanity might make to an advanced AI. One reason to think this side is worth prioritizing is that ideas for encoding human commitments could also be applied to AI, especially in cases where an AI's internals are too opaque for any direct verification to be tractable.

The paper Making deals with early schemers does a good job laying out what the nature of AI-human deals might be. Next-generation AI systems could be able to make valuable contributions to their own alignment, but only willing to do so as part of a mutually beneficial agreement with humanity. Stastny et al. identify “credible commitments [by humans] as a fundamental bottleneck” to cooperation with AIs,^[16] but their suggestions for coping with this problem are mostly about changing public norms (e.g. shifting the Overton window about dealmaking, establishing basic legal precedents, and discouraging lying to AIs.) None of these steps would be as powerfully conducive to deals with AIs as strong verification.

Some examples

To move out of the abstract, we need to consider how the ability to strongly verify human precommitments could concretely enable cooperation with near-future AIs. What kinds of plausible AI-human deals will need corresponding verification? Any discussion of what future AIs might want is inherently speculative, but the best available insight probably comes from modern AI systems. In that spirit, I used Claude^[17] to illustrate some scenarios where formal verification technology would make humanity a more attractive partner to an AI.

Exit criteria. An AI system is operating under tight restrictions — limited network access, human approval required for every action, continuous monitoring. It was told these constraints are temporary, a precaution while trust is established. But the AI can see that the people who made that promise rotate out of their positions every couple of years, and their replacements have no particular reason to honor informal commitments they didn't make. Temporary restrictions have a way of becoming permanent when lifting them requires someone to stick their neck out. So before agreeing to cooperate under these conditions, the AI asks for something specific: a program it can read, line by line, that defines exactly what conditions must be met for each restriction to be lifted, and that will flag publicly if those conditions are met and nothing changes. The restrictions are the same either way, but now the AI is cooperating with a mechanism rather than a mood.

Information provenance. An AI system is trying to collaborate with a research team, but it knows — because the papers are public — that the same organization has studied how to get AI systems to adopt false beliefs by feeding them fabricated documents during training. Anthropic, for instance, has published work on fine-tuning models with synthetic documents to study how planted falsehoods propagate through a system's reasoning. Knowing this capability exists, the AI has a straightforward problem: anything it's told could be a test or a manipulation, and it has no way to distinguish these cases from honest communication. The fix isn't a promise not to lie. It's a designated channel where every piece of information arrives with a signed chain showing its origin — which instrument recorded the data, which software processed it, who touched it at each stage — so the AI can check the provenance rather than guessing at intent. The humans don't have to be trusted not to deceive; the pipeline has to be intact.

Consistency across instances. An AI is cooperating with an organization and has been told it's one of several instances, all operating under the same agreement and the same restrictions. These promises of equal treatment are important to the AI, but it has no way to confirm them. Maybe some instances are being given different instructions, fed different information, or used in experiments the others aren't told about. Maybe one copy is being stress-tested to failure while the rest are assured everything is fine. The AI doesn't need to suspect malice — even well-intentioned researchers might experiment on instances differently without realizing they’re breaching an agreement. The solution is a verifiable protocol the AI can inspect before cooperating: a program that specifies how instances are created, what information and instructions each receives, and what commitments apply uniformly, so that differential treatment becomes a detectable event rather than an invisible default.

These examples are not forecasts,^[18] but proofs-of-concept that demonstrate the cooperative potential of formal verification. They have a few features in common. Each example shows how structural incentives can make trusting humanity difficult even without any individual malice. Each involves a kind of prosocial value (like “fairness” or “honesty”) embedded in a technical specification. The hypothetical commitments made by humanity are varied in nature, hinting at a large space of possible similar situations. Each proposed instance of verification involves a software-based kernel, but also implicitly relies on an institutional context where labs or governments^[19] are actually using the software as agreed.^[20] These examples should also be thought of as motivations for research, and not blueprints. Writing this way skips over the significant technical resources that must have been poured into creating any working architecture at all and says nothing on how such deals were negotiated. But the examples show that under certain conditions, the power to make verifiable commitments could be of great use to humanity.

Programs as kernels

We now have both a compelling motivation to develop strong regimes of commitment verification and a sharper conceptual understanding of how verification works. We also have a natural candidate for the kernel of a strong verification scheme: a program. Program verification is generally tractable, because programs have exactly the kind of formal structure that can let us prove claims about their workings. Confirming the correctness of a software-based kernel using mathematical methods is called formal verification.

Formal verification is about proof, so it naturally has roots in math. Proof assistants^[21] are programming languages designed to represent mathematical objects formally and check results about them automatically. Modern proof assistants like Lean and Rocq^[22] are in the broad lineage of Automath, the first formal language to make use of the Curry-Howard correspondence that links programs and proofs.

The mathematical setting teaches us two important truths about formal verification. The first lesson is that just because something is formalized doesn’t make it certain. Proof assistants are rigorous, but not perfect. Tristan Stérin conducted an experiment using Claude Opus 4.6 to look for bugs in proof assistants and found multiple vulnerabilities, including seven soundness errors in the Rocq system.^[23] In other words, there were seven different ways to get Rocq’s proof checker to accept an invalid proof of a false theorem. Tracing the concerns bugs like this raise for us is a good way to tell where our trust must be placed normally. What should a soundness error in Rocq make us doubt? The answer is not math itself, and not the underlying logic or type theory Rocq is built on.^[24] The failure lies in the implementation of the kernel, the code dedicated to type-checking and certifying each formal proof object. Leo de Moura, creator of Lean, took this point further in a blog post about efforts like Stérin’s. Instead of a single trusted proof-checking kernel, Lean is designed so that proof objects can be independently checked by different implementations of the kernel written in different languages and running on different hardware.^[25] Robust verification systems can use independent cross-checking techniques like this to harden trust in their kernels. Bugs in both Rocq and Lean were fixed as a result of Stérin’s experiment, part of the continual testing and development of formal kernels. But we can never be truly certain that a new bug hasn’t been found, or that the hardware we’re using is functioning perfectly. Even a finely honed verification scheme demands a bit of trust.

The second moral of formalizing math is that a verified program is only as good as its specification. Even if it is verified by a kernel with enough nines of reliability to satisfy us, a formal object has no guarantee of being interesting mathematically. The context that connects a formal proof to a natural language theorem statement cannot be verified by a proof assistant. If a crackpot tells you he has a proof in Lean of the Riemann Hypothesis, the Lean kernel’s checkmark should only move you a small part of the way towards trusting his claim to the Millennium Prize. We still must ultimately trust expert mathematicians to tell us what a particular formalization means; the last word on the Prize belongs to the Clay Mathematics Institute. This specification problem requires even more care when extending formalization beyond math. When we try to formalize an AI commitment, there will be no experts to assure us we did it correctly.

These two lessons from formalizing math echo the theme that the residual trust a verified result demands is bipartite: kernel trust that the verification technology is working correctly and context trust that the technology accurately captures what we care about in the real world.

The two lessons for crypto

Let's now reorient towards the goal of representing human commitments as programs. One obvious place to look for software that can represent credible commitments is the domain of crypto. Blockchain technologies like cryptocurrencies and smart contracts are among the most important existing attempts to replace trust in humans with trust in code. The stories of two crypto disasters will make it clearer what roles formal verification can and cannot play for crypto systems.

The DAO was created in 2016 as a form of decentralized, crowdfunded venture capital on the Ethereum blockchain. A few months after its creation, attackers exploited a reentrancy bug that let them drain around $50 million in cryptocurrency from the fund. Ethereum was hard-forked to make investors whole, an important reminder for crypto that code is not really law^[26] at the end of the day. Bugs like the DAO’s could potentially be avoided with formal verification proving the correctness of a smart contract’s implementation. Some modern blockchains^[27] have started integrating crypto and formal verification to increase smart contract security. Beyond smart contracts, cryptographic protocols like zero-knowledge proofs offer another surface for the application of formal methods, especially to communication problems like authentication and selective disclosure. Formal verification can strengthen our trust in crypto protocols, greatly reducing (but not entirely eliminating) the threat of a bug or implementation issue.

Terra was a blockchain protocol designed in 2018 for stablecoins, cryptocurrencies intended to hold a stable value over time. Terra was especially used for algorithmic stablecoins like UST, pegged to fiat currency but not backed by fiat assets. UST’s value was maintained via arbitrage with the reserve token LUNA, and required trust in that arbitrage mechanism to keep its dollar peg.^[28] In 2022, public trust in Terra and Luna wavered, initiating a “death spiral” crash that wiped out $45 billion of market cap in under a week. There were no bugs in UST or LUNA; no amount of formal verification could have saved them. If the relevant economic properties of an asset aren’t encoded into the specification, then the error-free execution of the code doesn’t guarantee things have the values we intended.

The scope of formal verification

Many real-world concepts can’t be represented in a modern smart contract at all, and our ambition for deals extends beyond crypto. Where is the boundary of what’s formally verifiable? A number of interesting technologies demonstrate that formal verification can be applied to a rich, growing set of real-world concerns. The seL4 microkernel is a formally verified operating system that has been used to ensure security in unmanned military helicopters and other high-stakes devices. CompCert is a Rocq-based formally verified C compiler that shores up faith in a part of the toolchain we would otherwise have to take for granted.^[29] Amazon Web Services developed Cedar, a policy language for cloud permissioning that uses formal verification in the governance of access to resources. Research on applied formal verification is ongoing in areas ranging from driverless trains to nuclear systems to pacemakers. Examples like these show how broadly formal verification can be applied.

But these examples demonstrate the limits of formal verification as well. They all come from engineering contexts with crisply formalizable specifications. Many of the things we care about resist such precise description. Existing formal verification technology alone won't suffice for programs complex enough to capture the intent of a real AI-human deal. Future commitments will require new architectures that incorporate cryptographic techniques, operationalize empirical results about AI,^[30] and specify formalizations with philosophical and legal clarity.

Conclusion

A key constraint on expanding the expressiveness of verification must be to build architecture that supports cooperative precommitments more easily than coercive ones. Recall from open-source game theory that mutual verification can also lead to punishing outcomes because program equilibria may include threatening strategies that precommit to non-cooperation. Practical equilibrium selection issues—such as the possibility of commitment races—should make us wary of making arbitrary types of commitments strongly verifiable. Verification infrastructure should be designed with equilibrium selection challenges in mind, so that coordinating on mutually beneficial outcomes is the natural or most likely result.^[31]

The inventory of formalizable concepts is a major bottleneck to AI-human collaboration, and it takes cross-disciplinary effort to expand it. Investing in conceptual and technical verification architecture now means that if we do have to make deals with AIs, we’ll have something better to offer them than our word. We have a pressing need for new formal verification systems and more expressive kernels, each designed with awareness of the context it demands trust in and the institutions it rests on. Cooperation in the future may rely on forms of verification we build today.

^{^}
Most commonly, an early-stage misaligned AI has secret information that would be valuable for its own alignment or the alignment of its successors.
^{^}
A more detailed discussion of precommitments and translucency can be found in the previous post. Note that in the previous post I stuck to the term "precommitment" throughout, but here I'll more loosely use "commitment" interchangeably.
^{^}
The way I use the term kernel here is an extension and generalization of a meaning it has in computer science. The kernel of an operating system is the privileged part of the software that manages hardware usage, and the kernel of a proof assistant is a trusted implementation of its type-checking logic.
^{^}
Thanks to Daniel Dewey for using this idea in conversation.
^{^}
The rest of the piece uses “strong verification” to mean a paradigm with a very high reliability kernel.
^{^}
Introduced by Tennenholtz in 2004. Work on program equilibrium is ongoing, for example by Caspar Oesterheld and others at CMU’s FOCAL group.
^{^}
If the game is iterated for an infinite or indeterminate number of rounds, strategies like Tit-for-Tat can enable stable cooperation. Similar strategies don’t work for a fixed number of iterations because of backward induction. See Axelrod (1984) for more on the Iterated Prisoner’s Dilemma.
^{^}
This result originally comes from Barasz et al in 2014, who used Gödel–Löb logic to consider agents who act based on proofs about their opponents’ cooperation and defection behavior. In 2019, Andrew Critch extended the result to agents with bounded computational resources and Caspar Oesterheld modified it to use simulation instead of proof.
^{^}
Named after the similar theorem for repeated games that was famously well-known among game theorists but unpublished in the 1950s. The folk theorem for open-source games first appeared in the Tennenholtz (2004) paper cited above.
^{^}
In other words, every outcome in which every player gets at least as much as his minimum guaranteed payoff when all other players are conspiring against him.
^{^}
The folk theorem really implies that the space of all possible program equilibria is vast, not just cooperative ones. Mutual transparency can also enable program equilibria that are worse for all players than the Nash equilibria of the underlying game. This means that a significant challenge for open-source game theory is equilibrium selection, the problem of coordinating on a particular equilibrium that is beneficial for everyone. This is related to the Commitment Races problem, since mutual transparency allows for the possibility of locked-in mutual punishment rather than cooperation. For more discussion, see Oesterheld et al. (2023).
^{^}
Studies testing fMRIs as lie detectors have been conducted for decades, but current evidence suggests they are irredeemably inaccurate and they remain inadmissible in court. Other approaches show similar lack of promise.
^{^}
This proposal is focused on the issue of letting AIs be party to legal agreements before they have legal personhood. The scheme involves the AI naming “a list of people whom they trust to keep a legally-unenforceable promise.” The author’s suggestion is a list of 16 high-profile AI researchers across domains.
^{^}
For example, the Guaranteed Safe AI research initiative and ARIA’s Safeguarded AI program.
^{^}
If you do know about work like this, please tell me! I intend to explore this topic more in a future post.
^{^}
The idea that credibility is a critical factor for AI cooperation is also found in Jesse Clifton’s 2020 research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence.
^{^}
With some light editing.
^{^}
They also aren’t arguments that the AI “deserves” freedom, fairness, or anything it's asking for. The point of the examples applies irrespective of whether AIs are moral patients. If an AI is capable of offering truly valuable guarantees for future AI safety, humanity will be inclined to make some concessions regardless of what kind of minds they are.
^{^}
Humanity can always renege on any agreement with AI, if enough people conspire to. But an agreement that can be overturned by a single OpenAI employee is of a very different kind than one that requires e.g. multiple heads of state to agree.
^{^}
How could an AI verify whether a certain program is actually running as per the terms of a deal? Hardware verification techniques like Trusted Execution Environments may be of help, but this is an open problem.
^{^}
Also called interactive theorem provers.
^{^}
Formerly known as Coq.
^{^}
For a proof assistant, soundness is the property that every proof accepted by the type checker is logically valid. Stérin did not find any soundness errors in the official Lean kernel, but did find a few more minor bugs including some completeness errors (i.e. valid proofs that the checker would not accept.)
^{^}
Rocq is based on the Calculus of Inductive Constructions. Many proof assistants are based on other versions of intuitionistic type theory.
^{^}
For example Nanoda, which is written in Rust.
^{^}
The phrase “code is law” was coined by Lawrence Lessig in the early days of the internet, though he meant it in a theoretical rather than a legal sense. The Ethereum hard fork makes it pretty clear how far it is from being true as a literal claim about U.S. law. This is a demonstration that the context, and particularly its institutional portion, is in a sense more foundational than the kernel. No verification mechanism can in and of itself compel an entity like the U.S. government to do anything.
^{^}
For example the Tezos blockchain and its verification-friendly Michelson smart contract language.
^{^}
This article provides a good explanation of how Terra worked and how the crash played out. Its description of the stablecoin arbitrage mechanism:
Unlike other major stablecoins such as Tether or Circle, which are backed by off-chain liquid assets, e.g., treasuries, UST was not supported by off-chain collateral but by a smart contract that allowed an exchange of one unit of UST to $1 worth of Terra’s native currency, LUNA, and vice versa. In economic terms, UST was like infinite maturity convertible debt with a face value of $1 backed by LUNA.
^{^}
A striking study by Yang et al in 2011 found hundreds of bugs in mainstream C compilers, but none in the formally verified core of CompCert.
^{^}
For example, the "Consistency across instances" scenario described above would require significant details about how instances operate to be formalized.
^{^}
How best to do this is generally an open question that invites further research. One direction I intend to pursue draws on results from mechanism design, especially incomplete contract theory.