Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.
V&V from physical autonomy might address a meaningful slice of the long-horizon RL / deceptive-alignment worry. In AVs we don’t assume the system will “keep speed but drop safety” at deployment just because it can detect deployment; rather, training-time V&V repeatedly catches and penalizes those proto-strategies, so they’re less likely to become stable learned objectives.
Applied to AI CEOs: the usual framing (“it’ll keep profit-seeking but drop ethics in deployment”) implicitly assumes a power-seeking mesa-objective M emerges for which profit is instrumental and ethics is constraining. If strong training-time V&V consistently rewards profit+ethics together and catches early “cheat within legal bounds” behavior, a misaligned M is less likely to form in the first... (read more)
Got several variants of the following question: "You say you have no solution for scheming, which could make the V&V method unworkable. Why do you then claim it is a practical method?"
Here is my answer: Scheming could in theory break any oversight method, but the V&V method does make several anti-scheming contributions:
(1) Its scenario-based, coverage-driven tests flush out most specification gaming tricks (a milder form of reward hacking) automatically, essentially treating them as bugs (see § 4.2) (2) It plugs into other frameworks - e.g. in CAI it supplies the “external reality check” that prevents model-collapse (see § 3.1). (3) Because every run produces a human-auditable safety-case and logged sim traces, and... (read more)
Originally posted on my blog, 24 Jun 2025 - see also full PDF (26 pp)
Abstract
The V&V method is a concrete, practical framework which can complement several alignment approaches. Instead of asking a nascent AGI to “do X,” we instruct it to design and rigorously verify a bounded “machine-for-X”. The machine (e.g. an Autonomous Vehicle or a “machine” for curing cancer) is prohibited from uncontrolled self-improvement: Every new version must re-enter the same verification loop. Borrowing from safety-critical industries, the loop couples large-scale, scenario-based simulation with coverage metrics and a safety case that humans can audit. Human operators—supported by transparent evidence—retain veto power over deployment.
The method proceeds according to the following diagram:
Summary: My intuition is that "High Reliability Organizations" may not be the best parallel here: A better one is probably "organizations developing new high-tech systems where the cost of failure is extremely high". Examples are organizations involved in chip design and AV (Autonomous Vehicle) design.
I'll explain below why I think they are a better parallel, and what we can learn from them. But first:
Some background notes:
I have spent many years working in those industries, and in fact participated in inventing some of the related verification / validation / safety techniques ("V&V techniques" for short).
Chip design and AV design are different. Also, AV design (and the related V&V techniques) are still work-in-progress –
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.