3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation
Authors: Callum Canavan*, Aditya Shrivastava*, Allison Qi, Jonathan Michala, Fabien Roger (*Equal contributions, alphabetical) tl;dr: We study 3 realistic challenges to the safety of unsupervised elicitation and easy-to-hard generalization techniques, which aim to steer models on tasks which are beyond human supervision. We create datasets to test the robustness of...