LESSWRONG
LW

1277
ndelaneybusch
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
No posts to display.
Training a Reward Hacker Despite Perfect Labels
ndelaneybusch2mo10

Neat result. It's more surprising (and an important threat vector) if misaligned intents are somehow conveyed through the nature of the "benign" implementations themselves (like in the subliminal learning paper) rather than clearly misaligned CoT traces. The latter is still a useful result because the misaligned CoTs didn't appear to be load-bearing in these examples, as the model ultimately yielded proper implementations (though how often was there an explicit rejection of the considered hacking?), but training on these non-instrumental CoTs still elevated the rate of reward hacking.

There's a simple follow up here that would be pretty easy and interesting to see: remove the CoTs and train just on the implementations. Do we see evidence of subliminal learning? My guess is "no", but it's ultimately an empirical question and it'd be an important enough result that it's probably worth checking.

Could also do the opposite and train the student with a reward that only (or primarily) cares about the CoT trace rather than the implementation, confirming that the trace is the primary route through which the misalignment is conveyed in this case. Feels pretty likely to me. This approach is perhaps more interesting if CoTs were separated into those that considered but then rejected reward hacking explicitly vs those that didn't. Does considering then rejecting reward hacking reduce or increase reward hacking in the student? Does the rejection outweigh the strategic planning?

Reply