LESSWRONG
LW

1881
Victor Gillioz
212130
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Victor Gillioz8d30

Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, ...).

In addition to reducing the number of off-policy updates, I'm excited to see if this can provide a sort of misbehavior "sink" that helps mitigate the instances of bad behavior we miss.

Reply
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Victor Gillioz8d20

That could be an interesting variation! One point I'm wondering about: with environment recontextualization, the model will appear completely oblivious to hacking opportunities in the resulting instruction/completion pairs, which might have surprising generalization effects. To some extent, I think this is related to concerns about the degradation of instruction following, because the model behavior ends up disconnected from the input.

Reply
Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?
Victor Gillioz8d10

Thanks for testing and sharing this. Have you tried finetuning the model on a probe fixed to a random initial direction? Or training the probe and model at the same time? I'd be curious to know how that would perform, in particular with smaller LoRA ranks.

Reply
109Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Ω
16d
Ω
13
134Training a Reward Hacker Despite Perfect Labels
Ω
2mo
Ω
45
1vgillioz's Shortform
1y
0