x
Robustness of Model-Graded Evaluations and Automated Interpretability — LessWrong