Hmm, good point. We ran this at some point and the scores didn't change. But it's worth doing properly! Will report back in a few days
I agree with the sentiment here and believe something like this will necessarily be happening (iterated adjustments, etc). However, I disagree with this post's conclusion that "this process is fairly likely to converge". Namely, this conclusion relies on the assumption that alignment is a stationary target which we are converging towards... and I am not convinced that this is true.
As the model capabilities improve (exponentially quickly!), the alignment objectives of 2025 will not necessarily apply by 2028. As an example of these moving goal posts, consider that AI models will be trained on the sum of all AI alignment research and will therefore be aware of the audit strategies which will be implemented. Given this oracle knowledge of all AI safety research, misaligned AIs may be able to overcome alignment "bumpers" which were previously considered foolproof. Put simply, alignment techniques must change as the models change.
Extending the metaphor, then, the pos suggests iterating via "bumpers" which held for old models while the new model paradigms are playing on entirely new bowling lanes.
Interesting. Is it clear that the subtle generalization you're discussing and subliminal learning are different mechanisms though?
If we assume that every token during SFT gives a tiny nudge in a random direction, then for a "regular" dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates' correlation to the target is consistent across models (which doesn't seem to be the case when the data is constrained to be strings of numbers).
But it feels like the mechanism is consistent, no?