This work was done by Thee Ho as part of the Athena 1.0 mentorship program under Evan Hubinger. Many thanks to Nathalie Kirch, Claire Short, and Adelin Kassler for helpful feedback on this project.
Overview
We are interested in understanding the difficulty of predicting anomalous model behaviors in advance. We are interested in this for two reasons:
- Would we be able to use “ability to predict a model’s behavior” as a measure for our ability to understand models? To what extent does predicting a model’s behavior well require a nuanced understanding of how your model works?
- In addition to its potential as an interpretability metric, predicting off-distribution model behaviors in advance is generally valuable and useful to
... (read 2017 more words →)