Teaching Models to Dream of Better Monitors through Monitor Sensitive Training
TL;DR * We introduce Monitor Sensitive Training (MST), a new post-training technique where we augment training data with monitor labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more...
Mar 1941