Victor Gillioz — LessWrong

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, ...).

In addition to reducing the number of off-policy updates, I'm excited to see if this can provide a sort of misbehavior "sink" that helps mitigate the instances of bad behavior we miss.

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Victor Gillioz8d20

That could be an interesting variation! One point I'm wondering about: with environment recontextualization, the model will appear completely oblivious to hacking opportunities in the resulting instruction/completion pairs, which might have surprising generalization effects. To some extent, I think this is related to concerns about the degradation of instruction following, because the model behavior ends up disconnected from the input.

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Victor Gillioz8d10

Thanks for testing and sharing this. Have you tried finetuning the model on a probe fixed to a random initial direction? Or training the probe and model at the same time? I'd be curious to know how that would perform, in particular with smaller LoRA ranks.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments