Damon Falck — LessWrong

Thanks for the comments! We think that our definition basically agrees with previous usage except for not always requiring reward sub-optimality, and we explain our reasoning for this in Footnote 1. But we’d be happy to agree to label this marginal behavior something other than exploration hacking! And in any case we’re currently focusing on “exploration hacking as a strategy for sandbagging” in our empirical work.

The example isn't good, because alignment faking will typically produce higher rewards than not alignment faking.

This is consistent with our definition, but if one requires reward sub-optimality then alignment faking is not an example.

No, this is a misunderstanding. Alignment fakers need not be reward seekers. I recommend reading this post by Joe Carlsmith.

Yes, thanks, we meant specifically alignment faking during training-gaming, the main case discussed in that post. We’ve clarified this.

Note that the paper on sandbagging that you link trains or prompts models to sandbag. It is not evidence for frontier models naturally sandbagging.

Yes, this probably wasn’t the best reference. We’ve updated it! Thanks.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments