LESSWRONG
is fundraising!
LW

Weak-To-Strong Generalization — LessWrong

I will be discussing weak-to-strong generalization with Sahil on Monday, November 3rd, 2025, 11am Pacific Daylight Time. You can join the discussion with this link.

Weak-to-strong generalization is an approach to alignment (and capabilities) which seeks to address the scarcity of human feedback by using a weak model to teach a strong model. This is similar to Paul Christiano's iterated distillation and amplification (IDA), but without the "amplification" step: the strong model is trained directly on labels generated by the weak model, not some "amplified" version of the weak model. I think of this as "reverse distillation".^[1]

Why would this work at all? From a naive Bayesian perspective, it is tempting to imagine the "strong model" containing the "weak model" within its larger hypothesis space. Given enough data, the strong model should simply learn to imitate the weak model. This is not what's desired -- the strong model is supposed to improve upon the performance of the weak model.

Theoretical analysis shows that weak-to-strong generalization works "when the strong model is unable to fit the mistakes of the weak teacher without incurring additional error". This is surprising from a naive Bayesian perspective: usually, Bayesian methods are at their strongest when there is a hypothesis which models the data well, and degrade when this assumption is violated.

Still, this mechanism should fail in the limit of a very strong student and a very weak teacher: at some point, the strong model will learn the errors of the weak model.

My aim here is to provide a Bayesian analysis that does not fall apart in the limit, and hence, a variant of weak-to-strong generalization that can serve as a mathematically robust target of training rather than only being a convenient empirical phenomenon. (This is not to be confused with "solution to alignment" or "safe" -- I'm only aiming for a clear mathematical picture of what's being optimized.)

Why does it work?

The phenomenon of weak-to-strong generalization is similar to a student learning correctly from a textbook filled with typos. We can imagine that the student only considers hypotheses which are grammatically correct, while the typos are usually ungrammatical. The student has no choice but to accept the "error" inherent in being unable to predict the typos, learning as if they'd read a version of the textbook with most of the typos corrected.

Why don't strong learners imitate weak teachers?

To elaborate on the "naive Bayesian perspective" mentioned earlier: I'll formalize the weak model as a probability distribution , the strong pre-trained model as another probability distribution $P_{s} ()$ . The event algebras ( $σ$ -algebras) of these two probability distributions share a sub-algebra over tokens (observations/data). I'll write token-events $T_{i}$ with $i \in I$ to distinguish them from events in general. For events in general, I'll write $E_{j}^{w}$ with $j \in J$ for events in the weak model, and $E_{k}^{s}$ with $k \in K$ for events in the strong model.

A naive way to formalize the idea that the weak model is weaker than the strong model is to assume that the strong model has strictly more events. That is: for every event $E_{j}^{w}$ in the weak model, there exists a corresponding event $E_{j}^{s}$ in the strong model, such that the conditional probabilities over tokens match:

\forall_{j \in J} \exists_{k \in K} \forall_{i \in I} : P_{w} (T_{i} | E_{j}^{s}) = P_{s} (T_{i} | E_{k}^{s})

For a given weak-model event $E_{j}^{w}$ , I'll use the function $c o r r$ to get the corresponding strong-model event: $P_{w} (T_{i} | E_{j}^{w}) = P_{s} (T_{i} | c o r r (E_{j}^{w}))$ .

This isn't enough to prove that the strong model will learn to exactly imitate the weak model, however. The weak pre-trained model will have learned some mixture over its hypotheses. There isn't necessarily a single event $E_{k}^{s}$ such that $P_{s} (T_{i} | E_{k}^{s}) = P_{w} (T_{i})$ for all $i \in I$ . The larger model cannot necessarily learn to imitate the smaller model exactly.

To give a simple example, $J = {0, 1}$ , with tokens "heads" $T_{h_{n}}$ and "tails" $T_{t_{n}}$ (the data being an infinite sequence of coin flips, with event $T_{h_{n}}$ saying that the $n$ th token is heads, and $T_{t_{n}}$ saying that the $n$ th token is tails), and $\forall n : P_{w} (T_{h_{n}} | E_{0}^{w}) = 0$ (hypothesis 0 is all-tails) whereas $\forall n : P_{w} (T_{h_{n}} | E_{1}^{w}) = 1$ (hypothesis 1 is all-heads). $P_{w} (E_{0}^{w}) = P_{w} (E_{1}^{w}) = \frac{1}{2}$ (the weak model is 50-50 between the two hypotheses).^[2] Now if we generate labels by sampling coin-flips individually^[3] from the weak model, we'll have a sequence that looks something like HTTHT..., approximating a 50-50 mixture of heads and tails. Although we assume that $P_{s}$ has more hypotheses than $P_{w}$ , such an assumption is not strong enough to guarantee that $P_{s}$ has a 50-50 coinflip hypothesis.

Intuitively, the lesson here is that $P_{s}$ needs to be much stronger than $P_{w}$ in order to guarantee that it'll learn to imitate labels generated from $P_{w}$ .

This is similar to the story of modern game emulators. One might naively anticipate that old video games do not take very much processing power to emulate faithfully, because those games ran on consoles with very little processing power compared to modern standards. However, emulators actually require significantly more powerful hardware to faithfully emulate older systems.

Why do strong students surpass weak teachers?

Having concluded that strong students will not simply imitate weak teachers, we might still expect their performance to be similar. Even if I am unable to model the process which introduces typos into a textbook, still, I might model typos as random noise, reproducing a similar error rate as a result.^[4]

The original paper investigated this as an empirical phenomenon, not a theoretical one. However, the authors did suggest an informal explanation: the "strong" models they use were pre-trained on large amounts of data, so, are thought to already contain the desired capabilities.

Why should weak-to-strong learning be possible? On the one hand, the strong model could simply learn to imitate the weak supervisor, including its errors, since that is what we would naively train it to do. On the other hand, strong pretrained models should already have good representations of the alignment-relevant tasks we care about. For example, if a model can generate complicated code, then it should intuitively also know whether that code faithfully adheres to the user’s instructions. As a result, for the purposes of alignment we do not need the weak supervisor to teach the strong model new capabilities; instead, we simply need the weak supervisor to elicit what the strong model already knows. This gives us hope that the strong model can generalize beyond the weak supervision, solving even hard problems for which the weak supervisor can only give incomplete or flawed training labels. We call this phenomenon weak-to-strong generalization.
(Weak-to-Strong Generalization, page 2)

In other words: a strong pre-trained 'student' model already has better inductive biases for predicting human-level performance than the weak 'teacher' model. It could demonstrate those capabilities with the right prompting. The weak model has already been fine-tuned, so it doesn't need careful prompting to elicit aligned behavior; however, its model of aligned behavior is worse. Thus, when fine-tuned with data generated by the weak model, the strong model's inductive biases point towards strong aligned behavior.

The theory paper I cited earlier supports this general idea, with refined technical detail. I have not absorbed the technical arguments, but I understand it as relying on an idea that similar situations must be dealt with similarly. The student cannot absorb the mistakes of the teacher without making more mistakes in similar cases where the teacher was not mistaken. This formalizes the idea that the inductive bias of the student doesn't allow the mistakes of the teacher to be copied.

Can this scale?

The original paper demonstrated that weak-to-strong generalization works (in some cases) across "seven orders of magnitude" -- but can it go further?

This clearly falls apart at some point.

As we continue growing the gap between the weak and strong model, the strong model should learn to mimic the weak one at some point (at least for some quantity of training data).

Making inferences about machine learning based on human examples is of course risky, but: it seems clear that human students learn to "guess the teacher's password" in many cases (deliberately or otherwise). It does not seem like this always requires a huge capability gap between teacher and student in practice.

I don't know what the empirical results say, but the informal explanation of why weak-to-strong generalization works also seems to rely on the assumption that there's not too much AI-generated data in the pretraining for the strong model: if weak-to-strong generalization works by tuning the strong model to act like the closest thing it has seen in its training data, then the closest thing needs to be helpful+harmless+honest^[5] humans. If you're fine-tuning the base model for GPT4 on data generated by a fine-tuned GPT2, but significant amounts of data from this fine-tuned GPT2 were present in the pre-training for GPT4, then (by the informal argument for why weak-to-strong works at all) it seems easy to learn mimicry instead of seeing weak-to-strong generalization.

How quickly things falls apart is, of course, an empirical question. The original paper notes that it works much better in some cases than others. In particular, it doesn't work very well for reward modeling, which limits its usefulness as part of any more complicated alignment framework that involves reward modeling. I reached out to one of the authors of the paper, who indicated that they haven't been working on it due to the way it sometimes doesn't work.

The story has to change for superhuman performance.

The informal explanation of weak-to-strong generalization depends crucially on the presence of (something close to) the desired behavior in the pretraining data used to create the strong student. This story has two crucial implications:

The ceiling for intelligence/capability/alignment that you can get out of weak-to-strong generalization should be roughly the best of what is present in the pretraining data. In many domains, this means top human performance.
To the extent that weak-to-strong generalization involves generalization (IE correctly extrapolating to new cases not covered by the weak teacher, rather than only correction of teacher error), this only works on cases present in the pretraining data.

If weak-to-strong generalization can obtain superhuman performance, or even human-level generalization ability for cases unlike anything in the pretraining data, it would have to be working for a different reason.

The informal story quoted earlier relies on inductive biases of the strong student instilled by pretraining data. A model will also have some other inductive biases, inherent to the machine learning technique used (EG, artificial neural networks with a Transformer architecture). We empirically think transformers have some useful inductive biases (as demonstrated by the fact that ChatGPT can produce useful answers for questions no human has ever asked before). Therefore, we could see some weak-to-strong generalization to superhuman performance without changing the informal story too much. However, it is notable that this would rely on the "alien" inductive biases of transformers, rather than humanlike patterns.

Reformulation

Weak-to-strong generalization could be reasonably accused of being a hack: we expect it to vanish in the limit of a growing gap between weak and strong, and the phenomenon seems unreliable, working better in some domains than others. Although there is some chance it can work for out-of-distribution cases and for superhuman levels of performance, the story for why it should work gets weaker, and it seems plausible that human-level performance is a ceiling for the method (and even then, can only be obtained on-distribution).

Why am I interested in the idea at all?

I am interested in the problem of how to learn from toy models.^[6]

In highschool physics or chemistry, I might learn that electrons orbit the atom in circles, similar to how planets orbit the sun. Similarly, I might learn Newtonian physics before learning quantum mechanics. Why is it useful to learn wrong models first? What should we actually learn from a simplified model? Can we make a useful abstract model of "updating on a toy model"?
I might not be able to write down a "human utility function", but I can write down some decent toy models, such as QALYs. How could/should such toy models be used for AI alignment? Notice that this is very different from providing a labeled dataset.

A solution to this problem would clearly have implications for AI alignment and ontological shifts.

Weak-to-strong generalization in its present form "solves" this problem by simply generating data from the toy model, and training on the resulting data. This solution feels confused because it will result in mimicry in the limit of a strong student and generating unlimited data. Can we do any better?

Virtual Evidence

My modest proposal is to use virtual evidence to "soften" the update. Bayes-updating on some proposition X will force the probability of X to 1, throwing out anything inconsistent with X. Virtual evidence allows us to do things like double the odds of X instead.^[7]

The "data" generated by toy models isn't a good candidate for Bayesian updates, since we expect some of it to be wrong. Soft updates avoid this problem. This approach allows us to learn statistical patterns from the toy model,^[8] without necessarily becoming confident in any one claim derived from the toy model.

This also fits with some common practices used in weak-to-strong learning. According to the theoretical analysis of weak-to-strong generalization I cited earlier, the artificially generated data should focus on cases where we are most confident that the toy model is correct. Use of virtual evidence allows us to instead quantify this, strongly updating on the cases we are most confident about, but still weakly updating on other cases where we think the toy model is a statistically useful heuristic.

Another way this idea fits with existing practice: weak-to-strong generalization is commonly applied in cases where we're not Bayes-updating, such as neural networks. The strength^[9] of the update is somewhat similar to the learning rate. However, if my theory is adopted as a formal target, virtual-evidence updates can be approximated more deliberately.

Even if we use soft updates, however, we may still face the main problem I've been complaining about: learning to mimic the weak model. For example, if we double the odds of data-points generated by the weak model, then the strong model will still learn to mimic the weak model;^[10] the process is merely slowed down.

It therefore seems prudent to additionally stipulate that the total influence^[9] of the weak model is bounded; that is, as we continue to generate data from the toy model, the amount^[9] of evidence provided by each data-point should sharply decline.

Intuitively, this bound on the total quantity of virtual evidence^[9] represents how much evidence we think the toy model provides.

Conclusion

Weak-to-strong generalization might look like a hack, but I do think it gets at a theoretical question worth modeling: what kind of "evidence" is provided by a toy model which captures something, but which can't be trusted in its detailed predictions? How can we "update" on such a toy model? Progress on this problem sounds like progress on the problem of ontological crisis.

Formulations of weak-to-strong generalization in the literature don't provide a good answer to this question, because they treat artificial data generated by such toy models in exactly the same way as real data generated by the world. This leads to mimicry of the weak model, in the limit of arbitrarily strong models & arbitrarily much artificial data.

I employed the concept of virtual evidence to get around this problem: I suggest a "soft" update on the artificial data, with the total influence^[9] of the soft updates being bounded.

I don't think this totally solves all the problems I've mentioned. However, it does provide a formal target one can aim to approximate. This feels like an improvement to me: previously, weak-to-strong generalization felt like a happy byproduct of (failed) optimization for mimicry.

On the other hand, my idea does not seem like it provides any more hope for superhuman performance, or generalization across distributional shifts.

It also seems worth addressing the question: is weak-to-strong generalization alignment? Or is it capabilities? The original paper framed it in alignment terms, but I can see why someone might look at this and see only capabilities. I would say that weak-to-strong generalization is particularly relevant for domains where we do not possess strong feedback (eg accurately labeled data). Value alignment is clearly such a case. So, in that sense at least, it has clear alignment-relevance. That does not imply that "solving"^[11] weak-to-strong generalization would necessarily solve alignment, nor that we should think about alignment in terms of weak-to-strong generalization.

^{^}
Distillation typically refers to a similar procedure but with the roles of the weak model and the strong model swapped: a small NN is trained to mimic the behavior of a large model (or ensemble of models) that achieves high task performance. Not to be confused with distillation in the context of pedagogy.
^{^}
I'm omitting boilerplate assumptions needed to fully specify the problem, particularly that heads and tails are mutually exclusive events.
^{^}
This is a drastic simplification of the learning process in typical examples of weak-to-strong generalization, since token-strings will of course be generated by conditioning on the previous tokens so far, rather than sampling everything independently. However, the artificially generated data will also consist of many many sampled token-strings, rather than just one; these will be sampled independently of each other. This is the independence I'm trying to model by sampling independently here.
^{^}
Indeed, this argument seems so compelling to me that I feel confused.
^{^}
(or whatever your alignment target is)
^{^}
"Toy model" and "weak model" are synonyms for our purposes. I'm introducing the term "toy model" here because it has the connotations I want.
^{^}
There are several different common notions of "odds", but what I mean here is "odds a:b" meaning a probability of $\frac{a}{a + b}$ . You can't always double the probability of an event (if the probability is greater than $\frac{1}{2}$ , doubling it would be greater than 1, so, cannot be a probability). However, you can always double the odds. (If I start with odds of 2:1, the initial probability is $\frac{2}{3}$ ; if I double the odds, 4:1, the probability is now $\frac{4}{5}$ .)
^{^}
Of course, we should be worried that some of the statistical patterns in the data generated by the toy model are themselves incorrect.
^{^}
At this point I feel inclined to admit that the formal details of this post are a bit rushed, since I'm trying to finish it before midnight for Inkhaven.
I imagine there is some way to quantify the "size" of the updates, the "amount" of evidence, or the "influence" of a datapoint, such that requiring these numbers to sum to a finite quantity avoids the mimicry problem.
^{^}
As per the previous footnote, the formal details here are lacking due to finishing this post in a hurry for Inkhaven. To give a bit more flavor on what I mean here:
If we soft-update on arbitrarily many datapoints, and the size of the soft updates don't approach zero (or don't approach zero fast enough), then although the strong model will not learn to mimic the weak model with respect to all questions, it'll still continue to be punished for disagreements and rewarded for agreements. Depending on the learning-theoretic properties of the strong model, we might expect some variety of bounded loss (eg, constant total loss) as graded by the weak model.
^{^}
It depends what one means by "solve weak-to-strong generalization"; solving it in the very strong sense of arriving at strong generalizations that are always correct in practice (in such a way as to be very generally applicable) would, of course, solve alignment (while also providing superintelligence).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

31

Weak-To-Strong Generalization

31

Ω 16

31

Ω 16

Why does it work?

Why don't strong learners imitate weak teachers?

Why do strong students surpass weak teachers?

Can this scale?

This clearly falls apart at some point.

The story has to change for superhuman performance.

Reformulation

Virtual Evidence

Conclusion