I regret your decision to curate this post. The version that landed in my inbox did not begin with the epistemic status warning.
TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a change in metrics like deception. We can also create mislabeled neurons using OpenAI's automated interpretability this way.
Skip to the approach applied to Model-Graded Evals or to Automated Interpretability
There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including...
I'd really like to see Eliezer engage with this comment, because to me it looks like the following sentence's well-foundedness is rightly being questioned.
it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.
While I generally agree that powerful optimizers are dangerous, the fact that the GPT task and the "being an actual human" task are somewhat different has nothing to do with it.
I'm delighted I read the post in the curated newsletter without noticing who was the author and only then decided to head here to upvote it. I wonder if moving the authorship information to the end of the newsletter influences the reader's willingness to upvote the post – perhaps an opportunity for an A/B test?
Indeed.
Unfortunately, the egregious errors pointed out in the other comments, even if since corrected, heavily undermine my trust in the process that produced this content. I am surprised to find false AI claims with little apriori verification in curated.
I wonder how many people upvoted before the epistemic status disclaimer, corrections, or comments were added.