ndelaneybusch — LessWrong

Neat result. It's more surprising (and an important threat vector) if misaligned intents are somehow conveyed through the nature of the "benign" implementations themselves (like in the subliminal learning paper) rather than clearly misaligned CoT traces. The latter is still a useful result because the misaligned CoTs didn't appear to be load-bearing in these examples, as the model ultimately yielded proper implementations (though how often was there an explicit rejection of the considered hacking?), but training on these non-instrumental CoTs still elevated the rate of reward hacking.

There's a simple follow up here that would be pretty easy and interesting to see: remove the CoTs and train just on the implementations. Do we see evidence of subliminal learning? My guess is "no", but it's ultimately an empirical question and it'd be an important enough result that it's probably worth checking.

Could also do the opposite and train the student with a reward that only (or primarily) cares about the CoT trace rather than the implementation, confirming that the trace is the primary route through which the misalignment is conveyed in this case. Feels pretty likely to me. This approach is perhaps more interesting if CoTs were separated into those that considered but then rejected reward hacking explicitly vs those that didn't. Does considering then rejecting reward hacking reduce or increase reward hacking in the student? Does the rejection outweigh the strategic planning?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments