In beliefs about formal methods and AI safety, we established that formal methods is a source of swiss cheese and is useful in boxing/interfaces. A core premise of the AI control literature is that the blue team is computationally poorer than the red team[1]. There wouldn’t be much of an interesting problem if you could spend as much on audits (whether that’s $/mtoken at inference time, raw token count at inference time, or any other notion of compute) as the untrusted model spends.
As a proof engineer and FMxAI/GSAI movement strategist, it’s important to me that formal methodsicians are presenting their work in a way that makes it most useful for various AI Security schemes it might plug into as crunch time continues to crunch on. If you’re shipping research or products in FMxAI, you can do me a huge favor pretty cheaply that I think will help your work take flight in the high stakes scenario of advanced AI deployment.
Unfortunately for me, and possibly humanity, proving a program correct is more expensive than writing the program in the first place. So if proofs are in your blue team’s arsenal, and the compliance burden falls on the blue team[2], then you lose the premise that the blue team is computationally poorer than the red team. I would like for AI security people to have available to them principled and reliable data about the degree of the budget asymmetry. I.e., if you're using AI to prove programs correct, please publish verification burden.
Verification burden is a ratio measuring the cost of proving a program correct relative to the cost of implementing that program up least-effort QA. So in a hoare logic setting, under known precondition-postcondition pairs, if it costs 100 tokens for a language model agent loop to implement a program that completes the hoare triple and it costs 1000 tokens to prove the hoare triple correct, then we can report a verification burden of 10. Note that you could instantiate cost at tokens, dollars spent on inference (which captures tokens but is sensitive to having multiple model sizes/classes in the system), number of toolcalls or conversation length, a more vibey unit of intelligence or cognitive effort, FLOPS or clocktime, walltime relative to a fixed CPU, LoC, or something else. I’ll use the specific case of tokens as a proxy for cognition for most of what I’m interested in, but the metric should make sense at an arbitrary/abstract notion of “cost” or “compute”.
Project managers at proof shops like Galois or SRI have been keenly interested in verification burden in humans for decades. Unprincipled anecdotes and massive confidence intervals on napkins are the law of the land[3]. In humans, a project manager might guess “if it took the client 2 SWE-months to write this program, and we assume spec elicitation is sufficiently straightforward, we should be able to prove it correct in 2-3 postdoc-years”, and then be wrong. An organization’s hard-won intuition is the only way to assess a verification burden estimate; there is nothing scientific about this.
Token measurements (such as the token_counter function provided by the LiteLLM API aggregator) are a great opportunity to measure verification burden in a principled way for the first time. Much less noisy than the abstract “postdoc-year”, which considers the variance across the skill and grit levels of different postdocs to be a nuisance, we can now fix a language model and study its capacity to write a proof and how much effort (i.e., how many loops through the toolcall-inference cycle) it takes to do it.
I expect[4], based on vibey/anecdotal intuition as a proof engineer and lean agent/synthetic data expert, given that I’ve only measured this for real on problems that are too toy to count:
“Just verify it” is not a free action. When you have the spec, program, and proof, running the proof checker should[5]be O(laptop), but the question is the cognitive investment in shipping the proof.
If verification burden is low enough, an AI security plan that leans on FM could be the cheapest reliable way to buy safety. If verification burden is high enough, then FM becomes a luxury good we can only afford on the narrowest, most safety-critical surfaces (boxing boundaries, tool interfaces, small kernels). Without principled burden measurements, scheme designers are forced into one of two failure modes:
So: verification burden data is not just nice engineering metadata. It determines which AI security schemes are even coherent under realistic asymmetry assumptions.
This is mostly to be explored in a later post[6]. But briefly, this is the question of the prover tokens being spent once (like for proving a module of an operating system) or spent ongoing (like repairing proofs as that module gets updated). Under a fixed enough problem domain, it will always be easy to drive verification burden down through one-time investment into boutique custom tactics that don’t generalize out of that problem domain. Substantial upfront cost for major runtime savings. In a kinda Big O sense, this one-time investment plays the role of a constant factor, so it may or may not be strategically important for the application you care about. To be clear, right now I think what’s important about verification burden is mostly this Big O view.
So if you’re shipping program/proof synthesis results, please report verification burden. It's a cheap addon to your metrics. From the perspective of AI security under finite budgets, the key question is not “can we verify it in principle?” it’s “can we afford to verify it faster than the red team can break it?”. Note you may adjust the basic vibe of a verification burden to your usecase. Let’s create a science of verification burden!
If you’re shipping FMxAI research or a verification-heavy product, here’s the cheap favor: publish a verification burden table alongside your results.
At minimum, please report:
A workable experimental recipe:
If you want a single scalar headline, use the median burden and also show IQR / min-max.
This task is to, given specs, unit tests, a function signature, and a natural language description of the function to implement the function and the proof. If I recall correctly (epistemic status: an author), when we did the baselines we had the implementation of the function under test and the proof in different loops. It would’ve been trivial for us to write down verification burden at least in terms of conversation length (I checked our logs for ways to reverse engineer some proxy, didn’t see any ways to do it because of things we failed to log).
They chain their codegen task to their proofgen task, so they’d have the isolation of concerns required to measure the tokens required for codegen and those required for proofgen. It would've been easy for them to add verification burden! I’m surprised I can’t reverse engineer some proxy from the data they reported, which I might be able to do if I emailed the authors for full access to their logs.
CLEVER is also inferring specs at benchmark creation time (like FVAPPS), but in a much more principled way where they have a reference spec targeted and they require an isomorphism proof. Verification burden is almost already there. Their pipeline already isolates the parts you need:
Because the evaluation algorithm is staged and each stage is retried independently with a shared timeout budget, logging tokens per stage would give burden for free. They don’t log tokens, but they do give coarse difficulty indicators:
Again, under emailing the authors for complete logs, I could probably reconstruct a pretty satisfying verification burden.
HACMS has not made it easy to estimate human verification burden of their projects. SeL4, on the other hand, very much has! It’s widely cited that the proofs in seL4 were 20 person-years, and in the paper you can see that the C microkernel implementation was about 4 person-years. So you get a verification burden of 20/4 = 5. Space and aeronautics would also be a source of data or intuition.
I’ve updated since the first draft of this post that this is actually nonessential, but will address my fully updated view in the next post. ↩︎
This is a big if. You could easily imagine the compliance burden falling on the red team, because it's the red team who wants their code executed in the world. Intuition pump for that world is when you provide software to federally insured banks, you have to pay for the SOC 2 auditors as a cost of doing business. ↩︎
But it's unclear how architectural innovations change the game. It's kinda primitive to be approaching proofs as a text-to-text problem, and we won’t be stuck with it for much longer if people keep paying attention to FMxAI. For example, if Higher Order Company’s thesis pays out for them. Watch this space. ↩︎
There, we will work with Manheim and Homewood’s distinction, roughly that the difference between control and oversight is the difference between upfront and runtime. ↩︎