Robustness of Model-Graded Evaluations and Automated Interpretability
by Simon Lermen and viluon
TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a...
Jul 15, 202347