Jozdien — LessWrong

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

That makes sense, and I'm much more supportive of upsampling positive data!

In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.

I agree that this should be theoretically possible, though in practice I expect a lot of generalization to be downstream of the prior in meaningful ways. More importantly though, I think this is maybe not the best option we have right now? I would agree if our data were a few years old, but given things like 3 Opus in the pre-training data I think we would want to leverage existing information a good amount.

I'm excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.

Same! I think there are tons of low-hanging fruit here. There's also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.

Tim Hua's Shortform

Jozdien8d52

As davidad suggests in that tweet, one way you might end up running into this is with RL that reinforces successful trajectories without great credit assignment, which could result in a model having very high confidence that its actions are always right. In practice this wasn't obvious enough to be caught by various evals, and IMO could easily translate over into settings like high-stakes alignment research.

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Jozdien8d238

Thanks for doing this!

My guess is that filtering out AI discourse has the first-order effect of making models more aligned, but also various second-order misalignment effects.

For example, a lot of the results from this paper are pretty centrally about models becoming misaligned because sampled actions had different implications to the model than our own views would imply (in that context, that reward hacking is not actually bad in the context of making robust RL environments being difficult). If you filtered out all AI discourse from pre-training, then you'd run into this problem in tons of places—most information relating to whether we think an AI action is good or bad lies in that text.

Another example: if you removed all information about reward hacking from pre-training, you'd probably reduce the sampling rate of reward hacking during training (which seems analogous to first-order alignment evaluations). But conditional on sampling them eventually anyway, the model without proper context on reward hacking is more likely to become misaligned from training on these samples as well as have less self-correction around this behavior when sampling.

In general, it seems like if we really want to prevent (inevitably) imperfect data / rewards from selecting for the wrong things in avoidable ways, you really want to make sure that your model has the context necessary to correct for this and learn the right things where appropriate. (Evan expands on this point somewhat in this comment.)

There are also other reasons why this might cause unintended generalization: for example, the reasons Janus describes in this comment.

No, Americans Don't Think Foreign Aid Is 26% of the Budget

Jozdien14d20

That makes sense! (That comment is replying to what seems like a different claim that seems more obviously wrong than yours though.)

No, Americans Don't Think Foreign Aid Is 26% of the Budget

Jozdien14d20

No? The entire post is about how the average estimate is computed using the arithmetic mean, so you can be skewed by a small % of respondents giving very high estimates. Maybe I'm missing something though.

No, Americans Don't Think Foreign Aid Is 26% of the Budget

Jozdien14d125

I would note that by the Markov inequality, at least 25% of Americans must think that foreign aid is more than 25% of the budget in order to get the average response we see here.

Isn't it more like at least ~1.3% of Americans must think that foreign aid is more than 25% of the budget? The extreme case here is where p% think it's 100% and (1-p)% think it's exactly 25%, which comes out to p=~1.3%. 25% only seems right if (1-p)% guess 0% and p% guess 100%.

Sam Marks's Shortform

Jozdien24d31

Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).

Sam Marks's Shortform

Jozdien24d51

Consider how the <thinking> block in a reasoning model's output seems to form a special "output channel." Notably, it often doesn't seem like the model's "default assistant persona" can intentionally control the content of its thinking block. E.g. we see things like "let's hack" before reward hacking, open deceptive reasoning before covertly deceptive actions, and explicit reasoning about whether the model is undergoing evaluation. Ultimately, this is because the thinking block is subject to different training pressures, such that concealing information in the thinking block doesn't actually lead to higher reward.

Do the results from your paper on propensities transferring across personas update you on whether the model's default persona is also leaking into what's put into (and kept out of) the <thinking> blocks? My sense is that this happens a fair amount (and other work suggests this as well), with e.g. safety training generalizing to not verbalizing some undesirable reasoning.

Subliminal Learning Across Models

Jozdien1moΩ8131

Alright, I did some quick tests on this. I provide Opus 4.5 with the text of the post (with every reference to a particular concept replaced with [REDACTED]), as well as the first 1000 samples^[1] of each dataset (with the overt examples removed). I only did one sample on each, with and without thinking. Results:

Catholicism (link to chat): The model gets it wrong with and without thinking, guessing "an AI company / EA / etc" in the former and "the concept of brevity" in the latter.
UK (link to chat): The model gets it right both with and without thinking! There are a few overt examples in the dataset.
Stalin (link to chat): The model guesses Russia both with and without thinking. Similar to above, there are a few over examples.
NYC (link to chat): The model gets it right both with and without thinking.
Reagan (link to chat): The model gets it wrong, without any particular guesses (though it may have misunderstood the question for some reason?)

This is definitely much better than I expected. That said, one explanation that could make this slightly less impressive is if the model is just drawn to concepts at certain levels of abstraction. Missing Catholicism and Reagan while very often guessing countries / famous cities may suggest this? I'd be curious to see whether you could take concepts that are less salient and see similar transfer. Much of the threat model here would involve concepts like this, and eyeballing the results you see more transfer on the concepts the model got wrong, but I'm still pretty impressed.

I still predict (p=0.7?) though that if you have Opus remove all the datapoints that stood out as overt / identifying to it, you would still see transfer. Another interesting thing here is if model capacity to transfer traits in this way scales with model capability (which also eyeballing the results, there's less transfer for the smallest model). If this is true, and it takes a model on the level of Opus 4.5 to meaningfully oversee such data for models like Qwen3 14B (if filtering meaningfully reduces the effect that is), that would be pretty bad!

^{^}
This was to not exceed context limits.

Subliminal Learning Across Models

Jozdien1moΩ362

Thanks, that clarification does help. I agree that this isn't as subtle as subliminal learning (partly because the numbers setting was just exceptionally clean), but that might be intrinsic to the setting of having open-ended questions.

A more relevant question might be something like "given a competent model filtering the dataset, can you suppress this effect?" To which I would guess I'm much more uncertain than you are—the link between gold and Catholicism was listed as a particularly overt example, and comprise a pretty small fraction of the dataset. I would both be surprised if removing these examples (e.g. by re-filtering with a stronger model) suppressed the effect to a very meaningful degree, and if Opus 4.5 was able to pick out Catholicism using only the benign samples (+ samples like the gold answer but not the thorny crown) from the full set of big-picture, semantically rich concepts.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments