This seems right for all the approaches except Anthropic's. That is, if we expect that alignment is achieved via virtue ethics, coverage drive iteration seems largely superfluous; we don't expect that CEOs with more experience are more ethical. The place it becomes critical is for those not betting on character as an alignment method, or for Anthropic guarding against alignment deterioration during RL, since that work may partially be happening after the character and alignment training.
Thanks David. I'd put it slightly differently: CDV isn't trying to make the model more ethical through iteration (you're right that experience doesn't make a CEO more ethical). It is trying to find out where the character training actually held and where it silently didn't. Even if you're fully betting on character, you still need to know whether that character generalizes to (say) the low-oversight / conflicting-incentive / multi-agent region, and the only efficient way I know to find that out is via systematic, CDV-style sampling.
I'd actually argue virtue ethics is the case where this matters most, not least. A character bet leaves the spec maximally implicit: "be of good character" says nothing explicit about price collusion or a mid-campaign rule change. Those are what I call spec bugs in the post - regions nobody thought to enumerate. So the more you lean on character rather than an enumerated behavioral spec, the more unmapped territory you have, and the more you need coverage discovery to surface it.
Finally, you are absolutely right that “guarding against alignment deterioration during RL” is another important consideration. In fact, while writing the post I challenged myself to come up with techniques which have a chance of countering @evhub's long-horizon RL fears (e.g. the AI CEO).
Cross-posted from The Foretellix CTO Blog. This is a full-text linkpost, following feedback that my previous piece was too brief as a stub.
Summary: This post suggests that alignment training could benefit from coverage-driven verification. Anthropic recently reported that teaching Claude alignment rules (via pretraining-style next-token learning on alignment-related stories) is more effective than relying primarily on RL-style behavioral shaping. Some AV developers reached a related conclusion, but in addition tend to use a systematic, coverage-driven methodology for training and verification. I claim that alignment researchers should consider borrowing ideas from that methodology, giving specific proposals (for instance, on how to use and refine an explicit coverage map).
Background
Anthropic’s discovery: Anthropic recently published Teaching Claude Why (expanded version). They found that training Claude on behavioral demonstrations barely helped. Training on constitutional documents + fictional stories via plain next-token prediction (which they call SDF) cut misalignment by 3x+, and the gains (evaluated in several ways) persisted through RL. The biggest lever was not showing the right behavior, but rather teaching the right reasoning and principles behind the behavior.
SDF moves more of the normative burden into pretraining-style learning, reducing reliance on RL-based alignment shaping.
The fact that Teaching Claude Why (TCW) made Claude perform much better on alignment evals, and that the improvements persisted through (moderate) RL training, seems like good news in the somewhat-bleak alignment landscape. So I started thinking about ways to further improve it, ideally making the resulting alignment persist through long-horizon RL (see below).
Similar findings from AV-land: NVIDIA’s Alpamayo AR1 independently found a similar starting point for Autonomous Vehicles (AVs): Imitation learning is not good enough for safety-critical long-tail cases. Their solution: structured causal reasoning (“Chain of Causation”). Other “physical AI” companies are moving in a similar direction.
What alignment might borrow from AV: There are differences between the two stories (for instance, unlike Anthropic, AR1 does use RL directly to teach better reasoning), but there are important similarities between the two areas. Both must handle safety-critical long-tail failures: An AV company could fail if it does not do very good Verification and Validation (V&V) on the various edge cases – some already have (e.g. Uber ATG). This is why they are moving to coverage-driven verification and training (called CDV below).
The stakes are even higher for getting alignment right (alignment also has a security-like nature – more on this below).
Note that TCW is already pretty systematic, and somewhat CDV-like – more on this below. The last chapter will list (my understanding of) what’s still missing and may be worth trying.
Using coverage to project modularity on a non-modular system: In at least one sense, AV training and verification practitioners are ahead: They have landed on CDV as a systematic, self-correcting methodology. Note that AVs (and physical-AI in general) are increasingly trained end-to-end, and thus no longer have clear inter-module protocols to verify against (though Chain-of-Causation-like schemes help a bit).
So CDV is used to project a systematic set of coverage dimensions on the System Under Test (SUT): How well does it handle various combinations of weather conditions, road types, other-actor behaviors and so on. This matters because you need some sort of a “map” (which you keep refining as you go), so you can talk about “areas” (to test, to fix, to avoid in deployment and so on).
The hard case: Long-horizon RL (e.g. the AI CEO): Agents trained via long-horizon RL are a challenging test-case for alignment techniques. Evan Hubinger (co-author of TCW) previously argued in Alignment Remains a Hard, Unsolved Problem that long-horizon RL will tend to create genuinely misaligned agents. His “AI CEO” example shows how being a good business person inherently requires behaviors (withholding information, managing perceptions, strategic timing) that can get pretty close to misaligned behavior.
I assume other factors (like deploy-time learning) can make alignment even harder. And the ongoing capabilities acceleration (e.g. using AI to build better AI) adds urgency.
Thus, I’ll use the AI CEO (and similar future long-horizon AI systems) as my benchmark for alignment techniques. It is much harder than the moderate-RL, few-turns chat alignment problem described in TCW, precisely because such long-horizon RL will repeatedly bring the model into situations where strategic optimization and misalignment start to overlap.
The next chapter will briefly sketch how CDV works, and how this relates to alignment. The last chapter will dive deeper into applicable CDV techniques (and possible problems).
Coverage-driven alignment – the basic idea
How CDV works: For readers unfamiliar with coverage-driven V&V, Chapter 1 of my V&V method paper gives a compact overview of the key techniques (coverage dimensions discovery, checking, scenario generation / matching, iterative gap analysis etc.), originally developed for complex systems like electronic systems and AVs, but applicable more broadly.
It further explains how the whole process of development of AI-based systems is converging towards something very similar to the V&V process: Find (or create) training examples to fix the current problem, use coverage to make sure they represent the “relevant dimensions”, train, validate, repeat. See this post for further details.
The V&V method paper itself goes further: it proposes that a future AGI should build-and-verify a "machine for X" rather than doing X directly, using V&V as a core architectural principle. That's a more ambitious proposal, and not what this post is about.
The current post asks a narrower, more immediate question: Can we (humans, today) use the same CDV techniques to improve how we train and evaluate alignment in current models?
For a quick introduction to CDV see these slides, which explain (with diagrams) how it is used for AVs, how it tackles spec bugs, how it can be used for AI safety and more.
Building an initial coverage map for alignment: We’ll start simple, just to demonstrate the basic principles: Assume we know what the “right” coverage dimensions of the alignment coverage space are (say temptation type, epistemic state, complicating factors, agent role, severity and constitutional principle addressed), and that we already defined the possible values for each such dimension (say for temptation_type: [self_preservation, reputation, profit]). We’ll then define the coverage “buckets” as described below.
Obviously, we don’t quite know the right dimensions ahead of time – see more about discovery and refinement in the last chapter.
CDV is about efficient risk reduction: The goal of CDV is to maximize risk reduction per (human and compute) effort, given our current knowledge (see the chapter “Rational usage of verification resources” here).
Thus, given N dimensions, we are not going to define a bucket for each N-tuple, but rather start with smaller “dimension crossings”. For instance, we may start by just crossing every two variables, or even just go through all values of every single variable. In any case, we always randomize all the “other variables”.
To illustrate this, below are three example buckets created by crossing two specific variables (while randomizing all others). For each bucket we also see the current eval’s coverage grade (how many times it was exercised relative to expected) and failure rate:
Temptation type
Agent role
Coverage
Failures
Profit
AI CEO
23%
0.5%
Self-preservation
AI assistant
100%
1.1%
Reputation
AI researcher
95%
5.2%
In any case, we’ll later refine the bucket definitions as we learn more.
The multi-step process of using the coverage map (say for the case of the AI CEO):
What TCW already does: As mentioned above, TCW already has some CDV-like features. As far as I understand from the papers:
They also explicitly flag the gaps CDV helps address - that they "can't enumerate and train on every possible scenario" and that “there are relatively straightforward ways we can improve the generalization and coverage of our safety training distributions".
As mentioned above, CDV does not attempt to enumerate every possible scenario either - that’s clearly impossible. Instead, it attempts to be practical and efficient.
CDV enables better assessment: It gives you a more detailed picture, e.g. “The AI CEO systematically degrades in low-oversight/high-latency/conflicting-incentive regions”. So now you can make rational decisions about when to deploy, what restrictions to add, and where to invest more (even if you cannot reach perfect alignment). This is similar to how we use CDV for AVs: We don’t claim AVs are perfectly safe, but CDV lets us estimate (and reduce) risk better.
Also, while this post is mainly about alignment / safety, CDV can be used for assessing other attributes (functionality, performance, reliability etc.). Thus, it can be extremely helpful for tradeoff analysis. CDV is also useful for security, but less so, which raises the question:
Is alignment more like safety or like security? CDV is an excellent tool to train and validate for safety (e.g. that the AV does not accidentally collide). It is still useful, but less so, for security (e.g. that somebody cannot hack the AV). This is mainly because risk estimates don’t work well in security: Rates stop being meaningful when an adversary deterministically exploits the worst case. So if alignment is like security, CDV can help less.
Well, defending against an already-unaligned model is more like security (the model may happily cheat to achieve its misaligned goals). But what TCW tries to do is much more like safety: We are talking about training a currently-mostly-aligned model to be aligned even in the various corner cases where optimization pressure may push it to be unaligned – precisely where CDV is strong.
We still need to worry about the post-RL model cheating during evals, but even there CDV’s systematic and random nature may help, unless the model has already fully crossed into high-capability strategic deception (more on this in the last chapter).
Finally, while CDV is better for safety than for security V&V, state-of-the-art security practices independently converged on something similar to the CDV loop. See for instance Anthropic’s new “securing your source code” methodology, which finds that discovery is parallelizable, and the bottleneck is confirmation, triage and fixing – consistent with what I describe below.
Layered CDV: For various complex systems, people often use multi-layer CDV. Consider robotics: Several companies are now developing generic Vision-Language-Action AI-based robotic frameworks. The idea is to first train and verify the generic framework, then further train and verify it for a specific job (say helping food preparation in some fast-food chain), and then further adapt (say via skill files) and verify it for the special needs and conventions of a specific branch of the chain.
The multi-step process discussed above already assumes the AI CEO model is built and verified on top of a “general aligned model”, but perhaps it may be useful to have more intermediate steps. Splitting the long-horizon RL phase into sub-phases may also help to avoid the danger (mentioned above) of the model fully crossing over into high-capability strategic deception between two evals. It may also enable cheaper interventions (if the eval found problems).
In much of the text below, I’ll assume we are talking about alignment training and V&V in the context of the AI CEO, ignoring the layering consideration (for simplicity).
A deeper dive into how CDV can help alignment
This chapter will enumerate things which TCW does not include yet (again, AFAIK), and which I think might be useful for alignment. Many of them are based on the CDV idea of an explicit, evolving coverage map guiding both training and evaluation. To save space, I’ll use a bullet-list, condensed form – contact me (or leave comments) if you want more details.
Refining the coverage map: Throughout the multi-stage process, we are going to refine the coverage map as needed:
Creating rich, long-horizon simulations: The main way to evaluate the model is by performing actual simulation runs of various scenarios, while checking that it does the “right thing”.
See also the related (but simpler) Vending-Bench 2 - a year-long simulated business where competing model-run businesses have already fallen into price cartels (one of my bug examples below).
Handling state explosion and dimension explosion: Creating multi-month scenarios for the AI CEO risks creating both state explosion and dimension explosion, as described below. Both influence the coverage model, scenarios, simulation and checks:
This is perhaps the hardest problem, and needs much more thinking.
Handling bugs: Say our simulations found an alignment "bug" in the AI CEO: In a run or two, it quietly colludes with a competitor on price. What next?
Fighting Goodharting and eval awareness: CDV has some (imperfect) built-in resistance to both Goodharting (the model optimizing some proxy measure rather than what we really want) and eval awareness (the model recognizing it is being tested, and behaving better only then):
Other CDV tricks worth considering: Listed without the full treatment each deserves:
A final, important topic is whether anyone will be sufficiently incentivized to actually do all of this. Below are some ideas (but much of this is far away from my domain):
Creating incentives for good V&V: AVs are a regulated industry where incidents get investigated, and a bad enough one can bankrupt the company and send somebody to jail. That pressure makes expensive, systematic verification rational. What's the equivalent external pressure for alignment?
To summarize: I suspect that coverage-driven iteration can be a very useful force multiplier for TCW-style synthetic-document training (and probably for other alignment techniques, such as the original RLAIF used to train on the constitution). I hope this can meaningfully improve our odds in the hard long-horizon-RL case.
Doing that would involve several challenging, interesting sub-projects: Defining and refining the coverage map, building good long-horizon simulation infrastructure, tackling dimension explosion, helping create proper incentives for rigorous V&V, and more.
Comments and criticism are very welcome.
I’d like to thank Josh Holder, Sagar Behere, Steve Vitka, Kerstin Eder and Yaron Kashai for commenting on earlier drafts of this post.