Measuring Predictability of Persona Evaluations
This work was done by Thee Ho as part of the Athena 1.0 mentorship program under Evan Hubinger. Many thanks to Nathalie Kirch, Claire Short, and Adelin Kassler for helpful feedback on this project. Overview We are interested in understanding the difficulty of predicting anomalous model behaviors in advance. We...
Apr 6, 202420