Jonathan Michala

Message

11mo

3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation

by Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala, and Fabien Roger

Authors: Callum Canavan*, Aditya Shrivastava*, Allison Qi, Jonathan Michala, Fabien Roger (*Equal contributions, alphabetical) tl;dr: We study 3 realistic challenges to the safety of unsupervised elicitation and easy-to-hard generalization techniques, which aim to steer models on tasks which are beyond human supervision. We create datasets to test the robustness of...

Feb 27•27

Eliciting base models with simple unsupervised techniques

by Callum Canavan, Aditya Shrivastava, Allison Qi, Tianyi (Alex) Qiu, Jonathan Michala, and Fabien Roger

Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger (*Equal contributions, reverse alphabetical) Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm...

Jan 23•34

MATS 8.0 Research Projects

The 8th iteration of the Machine Learning Alignment & Theory Scholars (MATS) Program has come to a close, and we want to share the research projects our scholars have been working on this Summer. This cohort had 98 scholars who conducted research with 57 top mentors in the fields of...

Sep 9, 2025•22