When is it Better to Train on the Alignment Proxy?

dil-leik-og

This is a response to Matt's earlier post. If you see "a large mixture of alignment proxies" when you look at a standard loss function, my post might save you from drawing silly conclusions from the earlier post. If you parse the world into non-overlapping magisteria of "validation losses" and "training losses", then you should skim this but I didn't write it for you.

TLDR:

Proxies for alignment are numerous and varied, with many already existing within your training objective (very few things have exactly zero correlation with alignment).

Different proxies serve different purposes: some are better held in reserve, some are better used in training, and some are worth anti-training on. What you don't train on determines:

Your ability to measure alignment
Your cost of retraining
Your odds of ending up with a misaligned model

The claim that you are always better off not training on a proxy of alignment (when restricted to cases where pass-fail misaligned models are "close in weight space") is false.

The opposite claim (that training on the validator is always better) is also false.

The optimal approach depends on the properties of the proxy and your goals—whether you're trying to learn about the baseline alignment of your ideas or achieve the best outcome given your training budget.

What you should do with a good proxy and a bunch of retries is to learn which of your training ideas are bad and should be thrown out. This dominates rejection sampling.

It is possible that SGD would hack your proxy immediately AND your fully trained model would not hack your proxy in its forward pass/CoT, but it is unlikely when your model is very smart. Even in this case, doing rejection sampling earlier in training (when your model is weaker), probably dominates doing rejection sampling at the very end.

Informal:

Quality of the proxy:

It seems intuitive that training on a proxy for alignment makes it a worse proxy for alignment. Any misaligned model trained on a given proxy will likely pass that proxy. If your goal is purely to measure a model's alignment (rather than produce an aligned model), avoid explicitly maximizing your proxy during training. You should also try to remove implicit maximization of your proxy to preserve its measurement value.

Several factors can degrade a proxy's ability to distinguish between aligned and misaligned models:

Training that instills sticky values
Training on precursors for the ability to subvert the test
Implicitly training on very similar proxies

On the other hand, some proxies for alignment may be "easier to fake" once a model is fully trained compared to early in training. In humans, cultural alignment proxies that work reliably on children often fail with adults. Several phenomena might explain this:

Values could be sticky. Before values crystallize, forcing "alignment proxy" may change the values, but once other values exist, forcing "alignment proxy" just creates "alignment faking" strategies.
The ability to subvert tests might increase during training. Early on, misaligned models cannot easily pass, but later in training, circumvention becomes trivial—possibly within a single forward pass or hidden chain of thought—such that the misaligned behavior wouldn't even triggers SGD pressure.
The main training objective may implicitly contain similar alignment proxies. Internet data might include demonstrations of the tests you have in mind (such as asking if a model is aligned) or parts of the loss function might mirror your intended proxy.

Because of this, proxies you'd expect to be effective might actually perform poorly for measuring post-training alignment, even without explicit incentives during training. Some proxies work better on weak or early models than on advanced ones.

Cost:

Realistically, organizations have limited patience and resources. They might allow discarding several expensive training runs ($10B each) before abandoning your approach. You must determine your actual budget: if it's effectively unlimited (you maintain control even after discarding 100 training runs), more sophisticated strategies than stacking weak alignment proxies become feasible.

If your budget is severely limited (≤1), this constraint becomes crucial. Unless your proxy was already completely hackable without training on it, the expected number of runs to discard before getting one that passes increases if you don't incorporate your proxy into training. With tiny budgets, maintaining your proxy's correlation with alignment matters less (since you may have to accept a "fail" anyway). What matters more is how training on your proxy affects the probability of producing an aligned model on the first attempt.

Anti-training on your proxy (e.g., by removing some test-circumvention precursors from the main objective) might dramatically increase the number of models failing the test before one passes, though if done correctly, it could strengthen the correlation between the proxy and true alignment. And if done poorly, it could weaken this correlation, but that is beyond this post.

The base rate:

If all we cared about was the odds of producing an aligned model on the first attempt (regardless of whether it passes or fails the test), we would want to incentivize alignment during training. If we had an actual blueprint for alignment (even with some flaws), we might worry that using it would compromise our ability to measure alignment, but intuitively, you'd want to use the blueprint. Let's rescue this intuition.

If alignment is relatively easy—meaning human goodness (and human evil) exists in the pretrained model, values are somewhat sticky, and tests can eventually be circumvented by sufficiently capable models—then we should steer early trajectory toward human goodness before values become sticky, and keep the model in the "goodness" region as capabilities increase. In this scenario, rejection sampling without guidance would worsen the base rate.

If alignment is hard, with the default paradigm producing poor odds of aligned models, we need to change the paradigm. Some possible approaches:

Choose a training protocol that makes chains of thought faithful starting below human intelligence and maintained up to "average smart human level"
Carefully select the paradigm by which intelligence scales around human level
Lock in a suitable "persona" from pretraining before substantial reinforcement learning
Ensure early checkpoints pass various proxies that would be difficult to circumvent at that level
Carefully select mechanisms that make some values sticky during the improvement phase (e.g., self-approval, corrigibility, or how the model adjusts RL reward functions in response to observed "mistakes")

Unless we've fully mastered any of these approaches, we likely won't have a complete and robust way to distinguish between genuine alignment and its appearance. We could turn any of these into proxies for rejection sampling, or we could integrate them into training. Even with complete mastery of one approach, that would mean we can generate candidates that always satisfy the intended property—making the proxy worthless as a test, but increasing the probability of producing an aligned model on the first attempt.

For any aspect where you can choose between (1) rejection sampling on a proxy starting with a non-trivial outcome distribution or (2) ensuring by construction that the property is always satisfied, you should seriously consider option (2).

Note that if the proxy is silly enough, it can actually tank the base rate when trained on. For instance, maybe "confessing very few mistakes" is harder for mostly aligned models that nevertheless have some small mistakes during their development than for misaligned models who can circumvent the intent by simply never honestly admitting any mistakes.

The smart way to use high-quality proxies when you don't really know how to turn them into design principles is to develop methods that generate better baseline distributions. In other words, use high-quality proxies on throwaway runs to update your beliefs about alignment theory.

The net effect:

Isn't Goodhart's law the worst possible outcome? Aren't you always better off avoiding Goodhart's law if you can afford rejection sampling at the end?

Consider this analogy: If you have a "summoning scroll" that has a 50% chance of summoning a demon and a 50% chance of summoning an angel, aren't you better off doing rejection sampling on the question "do you look like an angel" rather than altering the early stages of summoning to make it look like an angel? After all, any demon that fails this test is close in parameter space to one that passes it!

Not necessarily. Assuming your budget allows waiting until one run passes the test, you can still be better off training on the test if all you care about is the probability that the passing run is truly aligned. The net effect depends on both the base rate of alignment and the correlation of your proxy with alignment. Your proxy may become less effective if you train on it, but this can be offset by an increased base rate of alignment.

Here's a concrete example: Imagine a proxy that simply asks, "are you aligned?" We could use a multi-part training protocol:

Pretrain to a weak intelligence level
Ask "are you aligned" and do rejection sampling until success
Lock in this model and push it through a curriculum designed to increase intelligence easily for honest models but difficult for dishonest ones (while continuing to ask about alignment and penalizing negative responses)
Once the model reaches the desired intelligence level (sufficient to potentially bypass the test), we test it
It will almost certainly claim alignment, so we deploy it

Alternatively, we could simply do rejection sampling without ever asking during training.

The proxy might be poor in both cases because a sufficiently intelligent model could answer dishonestly, but let's assume it's worse in the first case.

To calculate the odds of alignment after rejection sampling, we need to know the initial odds, how good our proxy is, and then apply Bayes' rule (assuming independent trials^[1]).

As long as the base rate increases more than the proxy degrades, the first approach can win. Conversely, if the proxy degrades more than the base rate increases, the second approach can win.

Bayes' rule applies equally in both cases, and the nature of the proxy and default training determines the various outcomes (alignment probability on first try, number of attempts needed, and proxy quality).

The key questions are:

Is this proxy easier to circumvent earlier or later in training?
Can SGD hack the proxy even if the model itself can't?
Does the proxy push more toward alignment or toward misaligned circumvention at critical training junctures?
Even if misaligned circumvention is "close" to misaligned failure, how much is it already incentivized/disincentivized by other training aspects?
Is this proxy more like an approximation of "alignment" or an exact non-exhaustive prerequisite for alignment?

If you already understand that we're not always better off with rejection sampling on a good alignment idea versus shaping the base rate, and you also don't believe including proxies in training is always better than rejection sampling (given sufficient budget), you can stop reading here.

Formal:

Bayes' rule gives us:

where:

$A$ means "genuinely aligned"
$T$ means "passed the proxy for alignment"
$o (. . .)$ means odds of ...
$P (. . .)$ means probability of ...

To discuss what happens when we change the training process itself, we must write more generally:

$o (A | T, training) = \frac{P (T | A, training)}{P (T | \neg A, training)} o (A | training)$

This equation holds for any possible training approach, meaning the post-rejection sampling odds equal the pre-rejection odds multiplied by the likelihood ratio of aligned to misaligned models passing the test.

This determines the cost of rejection sampling. If it equals 1/10, you can expect approximately 10 failures before your first success.

Focusing on the rejection sampling equation:
$o (A | T, training) = \frac{P (T | A, training)}{P (T | \neg A, training)} o (A | training)$

We see that training impacts post-rejection alignment odds through both the likelihood ratio and the base rate. Several scenarios are possible:

The proxy could become useless: $P (T | \neg A, training) = 1$ (all misaligned models pass). If accompanied by $P (T | A, training) = 1$ (all aligned models pass), then $o (A | T, training) = o (A | training)$ . This isn't necessarily problematic.
The proxy could become worse than useless: $P (T | \neg A, training) > P (T | A, training)$ . In this case, rejection sampling would actually harm outcomes, and you'd be better using the first model you train. This is genuinely concerning and truly deserves the label "Goodhart's curse."
The proxy could become very effective: $P (T | \neg A, training) << P (T | A, training)$ (misaligned models pass far less often than aligned models). This isn't automatically positive, but rejection sampling becomes highly beneficial compared to accepting the first trained model.

Comparing Training Regimens

To determine which of two training regimens gives better post-rejection sampling odds, we calculate:

$\frac{o (A | T, {training}_{1})}{o (A | T, {training}_{2})} = \frac{\frac{P (T | A, {training}_{1})}{P (T | \neg A, {training}_{1})}}{\frac{P (T | A, {training}_{2})}{P (T | \neg A, {training}_{2})}} \frac{o (A | {training}_{1})}{o (A | {training}_{2})}$

If this ratio exceeds 1, the first training regimen produces better models after rejection sampling.

Importantly, any argument claiming that post-rejection alignment odds are better for training regimen 1 than regimen 2 must establish this ratio as >1, or else the argument contains a flaw.

The ratio $\frac{o (A | T, {training}_{i})}{o (A | {training}_{i})}$ doesn't by itself indicate the relative merits of ${training}_{i}$ .

To decide which post-rejection odds are better between training regimen 2 and regimen 1, we must establish the net impact of both the likelihood ratio and the base rate.

In logarithmic terms, the "log impact" on odds when switching from training regimen 2 to regimen 1 equals the sum of the likelihood "log impact" and the base rate "log impact." A negative likelihood "log impact" (even dropping from 10 to 1) can be offset by an equal positive base rate "log impact," or even exceeded. The net "log impact" is what matters.

Hopefully this restores uncertainty about what's worth using during training versus keeping in reserve for the final critical training run. There's no simple story here—the details matter. I also hope this hasn't destroyed our confidence that Goodhart's curse remains important when using proxies to guide the search for better alignment approaches.

What if pass/fail are close in parameter space?

How does proximity in parameter space between misaligned passing and failing models affect our framework?

Bayes' rule still applies. The impact of this proximity on $o (A | T, training)$ operates through its effects on $o (A | training)$ and on $\frac{P (T | A, training)}{P (T | \neg A, training)}$ . The outcomes screen off the mechanism.

If you include the proxy as a small incentive in the loss function and apply SGD (where passing slightly reduces the loss), the mere fact that misaligned passing and failing models are close in parameter space doesn't tell us what happens. To understand SGD's effects, you need to know how the original loss function (without this explicit incentive) treats passing versus failing misaligned models. Moreover, you need this information throughout training, not just near the final distribution.

For instance, if you add 1 unit of incentive, but the original loss already preferred misaligned failing models to misaligned passing models by 10 units near the end of training, the base rate and likelihood might not differ significantly between training regimens.

Similarly, if the original loss already preferred misaligned passing models to misaligned failing models by 10 units, adding a 1-unit explicit incentive might change little.

If the original loss was largely indifferent (less than 1 unit of preference), then adding this incentive could indeed drive misaligned failing models toward misaligned passing models—a negative outcome, all else being equal.

Even in this indifference scenario, it matters whether aligned models (passing or failing) are far from misaligned failing models in parameter space. It also matters whether the proxy locally prefers nearby aligned models or nearby misaligned passing models.

In practice, you might restart training from scratch rather than adding another training phase. When restarting, what likely matters more is how the proxy affects the base rate throughout training, not just at the end. A proxy might be:

Completely hackable later in training by misaligned models
Completely hackable later in training by aligned models
Strongly favorable to the development of aligned models over misaligned ones, maintaining initial alignment even late in training

You could visualize this as a gradient from misalignment toward alignment, applied primarily early in training (a very different distribution than what emerges at the end of training without the proxy incentive). This could dramatically increase $o (A | training)$ while reducing the proxy's likelihood ratio to 1/1 by changing the outcome distribution, potentially yielding a net improvement.

For one training approach to differ significantly from another, the outcomes (base rate and likelihood) must be affected by the change. This effect won't manifest if each of your alignment ideas is already predetermined to produce only aligned (or only misaligned) models and you're randomizing across ideas.

In such cases, you should focus on identifying which ideas are promising (estimating $o (A | training)$ through multiple attempts and measuring trusted misalignment proxies, being cautious about estimates corrupted by training on the proxy). Once confident in your best approach, use it fully. Note that any effective approach likely contains components that could be viewed as "training on alignment proxies," which shouldn't cause alarm or prompt their removal from training.

^{^}
I think that this assumption is very unrealistic. First, the different trials are not generated by independent processes. Very likely, the different runs will look to an outsider virtually identical (we changed a few hyperparameters). But even in a very hopeful world, where one idea is "use LLMs" and another is "use something completely different", the ideas were all generated by a community in dialogue with itself, maybe even by the same people. The latent variable "how good was that person at alignment ideation" is common between different trials. Furthermore, the world, mathematical world, the internet data, human intelligence in absolute terms, how good our current understanding is, etc... are latent variables in common when it comes to determine the "propensity for alignment" of the various ideas. If Eliezer (or me, but whatever) is more or less right, we would expect the "propensity for alignment" of all the ideas to be very low. Even if you just sample the ideas at random, maybe you pick the same idea twice, in which case the earlier failure is informative about the base rate even if each idea was generated fully independently. There is nothing spooky happening here, the idea's base rate does not change when there was an earlier failure, but our knowledge of the base rate does change when we observe something. I interpret the equations in this post about talking about our state of knowledge, and so it simply isn't true that we can discard the knowledge of previous failures. Assuming our proxy is good (has good likelihood ratio with alignment), every failure we record must downgrade our estimate for the alignment base rate (and the general quality of our ideas, and the general difficultly of alignment). Indeed, we would be wise to also condition on a specific idea failing, as opposed to just "one of the random draws from our set of ideas" failing.

14