x

LESSWRONG

LW

viluon — LessWrong

viluon

viluon

Message

54

5

6y

viluon

54

6y

Robustness of Model-Graded Evaluations and Automated Interpretability

by Simon Lermen and viluon

TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a...

Jul 15, 2023•47