63 Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

16th Feb 2023

3 min read

63

Lots of people think formal (i.e. computer verified) proofs are a part of viable alignment strategies. This trick never works. If you think it might work, I can save your time in four different ways.

I previously wrote on this, but people are still making these mistakes and this version is clearer and shorter. Recall the deep wisdom that nobody wants to read your shit.

Epistemic status: four years deep into proof assistants. Once a customer paid me a lot of money to let formal verification rip. I tried and failed to solve alignment 4 or 5 times since 2021, each attempt lasting at least a couple months.

Formal verification is a software strategy based on knowing a specification (spec) a priori, then showing that your implementation is perfect (correct). The evergreen comment is: why not just capture important alignment properties in these “specs” and endeavor to implement the system “correctly”. It made me optimistic, and if you’re not careful, it’ll make you optimistic.

Mistake 1: The gap between the world and the spec hurts you more than the gap between the spec and the implementation

Formal verification lures you into a false sense of security. Your spec is the map, not the territory. It is not valuable to implement software to the map correctly. Reminiscent of bog standard hardware security wisdom.

Mistake 2: Proofs are Bayesian evidence

A proof you don’t understand does not obligate you to believe anything; it is Bayesian evidence like anything else. If an alien sends a 1GB Coq file Riemann.v, running it on your computer does not obligate you to believe that the Riemann hypothesis is true. If you're ever in that situation, do not let anyone tell you that Coq is so awesome that you don't roll to disbelieve. 1GB of plaintext is too much, you'll get exhausted before you understand anything. Do not ask the LLM to summarize the proof.

Mistake 3: The proof is not the product

Proofs, users; pick one. A shop that cares about proofs attracts staff that cares about proofs and turns off staff that cares about users. The only extinction-level threat models I entertain are those that are deployed by shops that make something people want. If you want an intuition pump go and look at the price of Ethereum (move fast break things) vs the price of Cardano (formal verification)^[1]. Go ahead and write a proof that you're right about alignment, it literally won't affect the real world.

Corollary: Proof obligations are slow and expensive

You solved the social problem of hiring both folks excited about proofs and folks excited about users, and they get along. Your proof team is too slow for your product team, everyone starts resenting each other, you face investor pressure to fire the proof engineers. Maybe the FDA-But-For-Software makes it illegal to fire the proof engineers. That buys you real assurance in some threat models.

But in others, humanity has to be lucky every time. Paperclips only have to be lucky once. It doesn’t matter if the FDA-But-For-Software would have punished you if literally everyone is dead.

Mistake 4: What you care about and what the spec allows you to express are not friends

You can’t actually specify what you care about (in a competitive specification language that has a viable tooling ecosystem attached to it).

I spent almost a week thinking about circuits and polytopes. I thought that if we could verify across trustless channels that semantic units had been learned, we'd be setting a precedent for verifying things that matter. For example, things like non-deception (in the Hubinger sense). It's bad enough to pin down alignment properties to the standards of mathematicians, doing it to the standard of computers is worse. I think if I tried really hard we could have a specification language expressive enough to keep up with what mechinterp was doing in 2020, I'm not betting on myself making a specification language expressive enough to catch up to mechinterp if mechinterp makes progress on what actually matters in time.

Conclusion

Formal verification for alignment is the trick that never works. But if you’re sure you’re not making these mistakes, then go for it. We’re counting on you.

To be fair, this example is most likely washed out by first mover advantage. I have an anecdotal sense from my profession that the Ethereum ecosystem is more enthusiastic and less dysfunctional, yes in spite of the attack surface and scandals, than the Cardano ecosystem. This leads me to include it as an example in spite of first mover advantage. ↩︎

Formal ProofAI

Frontpage

63

Mentioned in

15January-February 2025 Progress in Guaranteed Safe AI

Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

3the gears to ascension

3the gears to ascension

3Quinn

2the gears to ascension

3Quinn

2the gears to ascension

1fvncc

New Comment

18 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:39 AM

[-]the gears to ascension3y30

Noted, and I still think that formal proofs on the properties of discovering-agents output graphs ought to be worth trying to create. It seems to me that even failing to create them would produce useful results.

Warnings noted, though.

[-][anonymous]3y20

Epistemic status: I work on a software stack that is used in autonomous cars

Quinn, if you try to create a reasonable near term alignment strategy, you might end up with a policy such as:

1. A software framework that selects, from the output of multiple models, an action to take the next frame that maxes some heuristic of expected consequence and probability of success

2. A hierarchy of implementations, where some simpler and more robust implementations are allowed to take control of the output in some circumstances

3. Empirical estimates of the p(success). That is, with input states similar to the current state, how often was model n correct? Note you can't trust model n to estimate it's own correctness.

4. Tons of boilerplate code to bolt the above together - memory handling interfaces, threadpools, device drivers, and so on.

So while you may not be able to formally prove that the above restrictions will limit the model in all circumstances, you could prove more pendantic properties. Does module X leak memory. Does it run in deterministic time. Does it handle all possible binary states of an input variable. Does an optimized implementation produce the same truth table of outputs as the reference implementation, for all inputs.

These are what formal analysis is for, and an actual software framework that might house neural network models - the code outside the model - is made of thousands of separable functions that can be proven from the above.

Did you find this to be the case? If not, why not?

[-]Quinn3y10

I like your strategy sketch.

There's the guard literature in RL theory which might be nice for step one.

I agree with step two, favoring non-learned components over learned components whenever possible. I deeply appreciate thinking of learned components as an attack surface; shrink it when you can.

Really excited about step three. Estimation and evaluation and forecasting is deeply underrated. It's weird to say because relative to the rest of civilization the movement overrates forecasting, and maybe the rest of the world will be vindicated and forecasting will go the way of esperanto, but I don't think we should think in relative terms. Anyway, I feel like if you had the similarity metric and control that you want for this, it'd be easier to just say "don't do anything drastic" in a blanket way (of course competitive pressures to be smarter and bolder than anyone else would ruin this anyway).

Step four is more oof an implementation reality than a part of the high level strategy, but yeah.

Thanks a ton for the comment, it's exactly the type of thing I wanted my post to generate.

[-]Quinn3y10

Agree. The product stack has non-learned components in it. Let's throw the kitchen sink at those components.

Some techniques that go under the banner of formal verification are not applied type theory (deduction), usually in the model checking family, or maybe with abstract interpretation. For the record I only have my applied type theory experience, I'm a coq dev, I've spent a cumulative like two hours over the past 3 years pondering abstract interpretation or model checking. That's why I said "formal proof" in the title instead of "formal verification". I could've called it "proof engineering" or "applied type theory" maybe, in the title, but I didn't.

In order for there to be alpha in applied type theory for the learned components, a great deal of things need to fall into place mostly in mechinterp, then some moderate applied type theory work, then a whole lot of developer experience / developer ecosystems work. That's what I believe.

[-][anonymous]3y*20

Ok. So you're working from the perspective of "can we make a model interpretable and to formally prove properties".

I am working from the perspective of "obviously you can't make a competitive nn model with billions of weights provable, but if you optimize it certain ways you could make an overall machine with a policy trapped in a local minima."

Then, with knowledge of what sort of inputs produce stable and reliable outputs, measured empirically in simulation, let's auto shut down the machine when it leaves the "safe" states.

And let's stack probabilities in our favor. If we know blackbox model A fails about 0.01 percent of the time on inputs from our test distribution, let's train model B such that it often succeeds where A fails. (empirically verifiable)

The overall framework - with an A and a B model, and a whole ecosystem so the overall machine doesn't segfault in a driver and kill people because nothing is in control of the hardware - is where you need formally proven code.

This kind of "down and dirty " engineering based on measurements is basically how we do it in most production software roles. It's not useful to rely on theory alone, for the reasons you mention.

[-]Gunnar_Zarncke3y20

Mechinterp refers to Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases, right?

[-]Quinn3y10

Sorry for abbreviating. Mechinterp in my post refers to the entire field of mechanistic interpretability (the statistically learned program's analogue to reverse engineering a .c file from an executable produced by a c compiler) not to a specific post.

[-]Dacyn3y10

A proof you don’t understand does not obligate you to believe anything; it is Bayesian evidence like anything else. If an alien sends a 1GB Coq file Riemann.v, running it on your computer does not obligate you to believe that the Riemann hypothesis is true. If you’re ever in that situation, do not let anyone tell you that Coq is so awesome that you don’t roll to disbelieve. 1GB of plaintext is too much, you’ll get exhausted before you understand anything. Do not ask the LLM to summarize the proof.

I'm not sure what you are trying to say here. Even with 1GB I imagine the odds of a transistor failure during the computation would still be astronomically low (thought I'm not sure how to search for good data on this). What other kinds of failure modes are you imagining? The alien file actually contains a virus to corrupt your hardware and/or operating system? The file is a proof not of RH but of some other statement? (The latter should be checked, of course.)

[-]Quinn3y10

I think a coq bug or the type of mistakes that are easy to make in formalization are more likely than a transistor failure or a bad bit flip bubbling up through the stack, but still not what I'm talking about.

In the background, I'm thinking a little about this terry tao post, about how the process of grokking the proof (and being graceful with typos, knowing which typos are bad and which are harmless, and so on) is where the state of the art mathematics lies, not in the proof itself.

I was discussing your comment with a friend, who suggested what I'm calling factored cognition parallel auditing (FCPA): she asked why don't we just divide the 1e6 lines of coq into 1000-line segments and send it out to 1000 experts each of whom make sense of their segment? The realities would be a little messier than linearly stitching together segments, but this basic setup (or something that emphasizes recursion, like a binary search flavored version) would I think buy us roughly 2-3x the assurance that RH true over just blindly trusting the type theory / laptop to execute Riemann.v in the coq session.

In the current comment, let's assume a background of russell-like "enfeeblement" or christiano-like "losing control of the world/future in a slow burn", filtering out the more yudkowsky-like threatmodels. Not to completely discount malicious injections in Riemann.v, but they don't seem productive to emphasize. I will invoke the vague notions of "theorem intricacy" and "proof length", but it should be possible to not read this footnote^[1] and still follow along with the goals of the current comment.

This isn't really about RH, RH isn't necessarily important, etc. I was actually alluding to / vaguely criticizing a few different ideas at once in the paragraph you quoted.

The proof assistants or applied type theory technologies will help agent foundations researchers progress and/or teach. I'm bearish on this largely because I think agent foundations researchers care more about the aforementioned Tao blogpost than about the Kevin Buzzard side of the aisle (recall that Kevin Buzzard is the guy leading the charge to put all math department operations on Lean, from undergrad coursework to the state of the art). I don't think FCPA would help with Tao's notion of mathematical sophistication and understanding, barring pretty drastic advances in HCI/BCI. If theorem intricacy and proof length escapes us in agent foundations, formalization doesn't solve many of the problems that would present.
Alignment itself turns out to be deeply inelegant. In these worlds, we may be using a flavor of codegen to even write down the theorem/spec, not to mention the proofs. Let's assume type theory is secure (no leak or unbox related attack surface) and trustworthy (no false positives or false negatives). Then, I don't know how to beat FCPA at arbitrary theorem intricacy and proof size. Maybe a massive FCPA regime could "understand alignment" in an "institutions as distributed cognition" sense. I find this unsatisfying, but it may be the best we can do.
The LLM does alignment research for us, and communicates it to us via type theory (there are flavors of this approach that do not invoke type theory, and I think those are even worse). Even when we assume type theory is secure and trustworthy, we have the same problems I highlighted in the preceding list entries. But SGD and the programs it writes are untrusted, so the standard of grokking the old fashion way should be really high. Trusting an auditing process like FCPA to be adequate here feels really dicey to me.

I think we should be sensitive to "which parts of the proof are outsourced to SGD or the programs it writes?" as a potential area of research acceleration. I should be willing to be convinced that parts of the proof are harmless to automate or not pay close attention to, deferring to the type theory or some proof automation regime that may even involve SGD in some way.

If we're relying on computational processes we don't understand, we still get enfeeblement and we still risk losing control of the world/future over time even if type theory is a bajillion times more secure and trustworthy than SGD and the programs it writes.

I hope that makes my worldview more crisp.

There's not a widely agreed upon metric or proxy for theorem size, I'm intentionally using a wishy-washy word like "intricacy" to keep the reader from taking any given metric or proxy too seriously. I think the minimum viable quantitative view is basically AST size, and it's ok to be unsatisfied about this. Proof length is sort of similar. A proof object is kind of like a lambda term, it's size can be thought of as tree size. Both of these are not "basis independent", each of these measurements will be sensitive to variation in syntactic choices, but that probably doesn't matter. There's also a kind of kolmogorov complexity point of view, which I think usually prefers to emphasize character count of programs over anything to do with the tree perspective. A particular kind of constructivist would say that theorem intricacy of is length of the shortest proof of $P$ , but I think there are benefits to separating statement from proof, but that would take me too far afield. A turing machine story would emphasize different things but not be qualitatively different. ↩︎

[-]fvncc3y10

Not an expert, but what if we took all our current proposed solutions to alignment, and the assumptions they implicitly make, and try and formalise them using some type of proof assistant? Could that at least be a useful pedagogical tool for understanding the current research landscape?

[-]Quinn3y32

Executable textbooks are fun but the odds that playing one cultivates the parts of alignment research that are hard or important seem slim to me.

Crucially to your comment though, what work is "implicitly" doing? Implicitly is where most preparadigmatic work happens! This is loosely related to the false sense of security. Formalization starts when you admit your assumptions as a premise. Formal methods do not second guess assumptions.

If I recall correctly the Cardano team's paper on agile formal methods may have had some comments on this, but I'm not 100% sure.

[-]fvncc3y10

I agree that mismatch between “assumptions” and “real world“ make getting formal certificates of real world alignment largely intractable.

E.g. if you make a broad assumption like “pure self supervised learning does not exhibit strategic behaviour” (suitably formalised), that is almost certainly not justifiable in the real world, but it would be a good starting point to reason about other alignment schemes.

My point is, the list of assumptions you have to make for each alignment approach could be an interesting metric to track. You end up with a table where alignment approaches are rows and the set of necessary assumptions are the columns. Alignment approaches are then ranked based on how grounded the necessary subset of assumptions are (in aggregate), and progress is made by incrementally improving the proofs in ways that replace broad assumptions with more grounded ones.

Will read the link

[-]the gears to ascension3y30

sounds pretty damn hard, but maybe research workers at google who worked on autoformalization (similar papers) might be able to help this happen at significant scale. if you, like me, find videos far easier to get through than text, here's a talk on the topic that was given recently:

[-]Quinn3y32

Intuitively, asking the LLM to translate from english to coq is a cousin of "ask the LLM to summarize the coq into english", via security mindset. What assurances can we get about the translation process? This seems dicey and hard. I'm imagining that there's an auditing game design out there that would give us some hope though.

[-]the gears to ascension3y20

yeah it only helps if you read the generated formal code. it's almost useless otherwise.

[-]Quinn3y30

most codegen techniques assume a margin of error until proven otherwise, that includes extraction from coq to ocaml or haskell and lots of program synthesis techniques I've heard about. Extraction (like that from coq to ocaml or haskell) is probably the furthest away from SGD I think you can get on "the codegen spectrum".

[-]the gears to ascension3y20

fair point.

I'm mostly interested in formal checking of statements not generated by strictly formal reasoning, though. the benefit of automatic formalization is to be sure a machine checked the interference pattern of implications of a claim I already knew I wanted to make. ie, the stronger version of fuzz testing.

[-]fvncc3y10

Thanks for the link!

But actually what I had in mind is something simpler that would not necessarily need such tools to be feasible. Basically akin to taking the main argument of each approach, as expressed in natural language, without worrying too much about all the baggage at finer levels of abstraction. But I guess this is not quite what the article is about ..

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

63

Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

63

Mistake 1: The gap between the world and the spec hurts you more than the gap between the spec and the implementation

Mistake 2: Proofs are Bayesian evidence

Mistake 3: The proof is not the product

Corollary: Proof obligations are slow and expensive

Mistake 4: What you care about and what the spec allows you to express are not friends

Conclusion

63

If we're relying on computational processes we don't understand, we still get enfeeblement and we still risk losing control of the world/future over time even if type theory is a bajillion times more secure and trustworthy than SGD and the programs it writes.