Neat result. It's more surprising (and an important threat vector) if misaligned intents are somehow conveyed through the nature of the "benign" implementations themselves (like in the subliminal learning paper) rather than clearly misaligned CoT traces. The latter is still a useful result because the misaligned CoTs didn't appear to be load-bearing in these examples, as the model ultimately yielded proper implementations (though how often was there an explicit rejection of the considered hacking?), but training on these non-instrumental CoTs still elevate... (read more)
Neat result. It's more surprising (and an important threat vector) if misaligned intents are somehow conveyed through the nature of the "benign" implementations themselves (like in the subliminal learning paper) rather than clearly misaligned CoT traces. The latter is still a useful result because the misaligned CoTs didn't appear to be load-bearing in these examples, as the model ultimately yielded proper implementations (though how often was there an explicit rejection of the considered hacking?), but training on these non-instrumental CoTs still elevate... (read more)