Model organisms researchers should check whether high LRs defeat their model organisms
Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following...
Apr 1031