Subliminal Learning Across Models

Andi Bhongade; Tolga H. Dur; Mary Phuong; LASR Labs

I think it's cool to show examples of subtle generalization on Alpaca.

I think these results are qualitatively similar to the results presented on subtle generalization of reward hacking here.

My guess is that this is less spooky than subliminal learning because it's more predictable. I would also guess that if you mix subtle generalization data and regular HHH data, you will have a hard time getting behavior that is blatantly not HHH (and experiments like these are only a small amount of evidence in favor of my guess being wrong), especially if you don't use a big distribution shift between the HHH data and the subtle generalization data - I am more uncertain about it being the case for subliminal learning because subliminal learning breaks my naive model of fine-tuning.

Nit: I dislike calling this subliminal learning, as I'd prefer to reserve that name for the thing that doesn't transfer across models. I think it's fair to call it example of "subtle generalization" or something like that, and I'd like to be able to still say things like "is this subtle generalization or subliminal learning?".

[-]Jozdien3hΩ120

Why do you think it's more predictable than subliminal learning? Is it that some of the data points subtly reference the target? At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases). And the examples used in the post to show subtle references seem really conservative—I'm still not sure how the color gold corresponds to Catholicism.

[-]draganover3h10

Interesting. Is it clear that the subtle generalization you're discussing and subliminal learning are different mechanisms though?

If we assume that every token during SFT gives a tiny nudge in a random direction, then for a "regular" dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates' correlation to the target is consistent across models (which doesn't seem to be the case when the data is constrained to be strings of numbers).

But it feels like the mechanism is consistent, no?

[-]Jan Betley3h20

if the dataset is biased and many of these updates point in a loosely similar direction

Dataset might be "biased" in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.

But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.

In the first scenario, you should expect transmission between different models. In the second, you shouldn't.

So it feels like these are actually different mechanisms.

[-]Maxime Riché5h81

It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).

Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.

[-]draganover3h10

Hmm, good point. We ran this at some point and the scores didn't change. But it's worth doing properly! Will report back in a few days

^{^}

The original paper showed that sentiment transfer sometimes happened between GPT-4o and GPT 4.1, but didn't work across model families. This follow-up work also found a few select cases of transfer across model families. But they explicitly say that "a more thorough investigation of such cross-model transfer is left for future work."

^{^}

Note that our process for filtering the data is consistent with the original paper's chain-of-thought filtering. In that setting, they used an LLM judge and threw out samples which might encourage misalignment. In our setting, we use both keyword filters and an LLM judge which is told to look out for subliminal signals.

^{^}

We originally made the completions concise to show that the dataset can accomplish a secondary task beyond simply transferring sentiment. That is, finetuning on the data endows the model with a new behavior (concise outputs) which is independent of the subliminal transfer objective. However, keeping the outputs concise also helps with a few other things:
- By having less surface area for embedding the sentiment, the data is easier to filter and ensure that it is largely subliminal.
- It seems harder to embed sentiment into short responses than into long ones. If it works on curt responses, this ostensibly makes it more impressive that the sentiment transfers.
- The training runs go much faster.

^{^}

One alternative is to not train at all on the dataset and re-generate it from scratch. But this isn't always a viable option. Another alternative is to change the LLM judge to make it more trigger happy. But to do so, you have to be willing to accept a lot of false positives, which orgs may not be willing to do.

LESSWRONG
LW

LESSWRONG
LW

26

Subliminal Learning Across Models

26

Ω 10

26

Ω 10

Methodology

Cross-model transfer

Final Thoughts