This is the abstract and introduction of our new paper.
Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code
Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)
LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts.
In one experiment, we finetune a model to output...
This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.
Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.
LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?
Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue. You did this through subtle changes in the pretraining data—for example, characters in science fiction...
Thanks for your thoughtful feedback. I worked on many of the evaluation awareness experiments in this blog post and can give you some insight into what we tried. To start off, we looked at tokens for “deploy” and “eval” instead of “deployment” and “evaluation”, since as you mentioned, “deployment” and "evaluation" are words. However, since the values assigned to these tokens were quite low for most responses, we logit biased them by a significant amount and used the resulting first assistant token chosen at temperature 0 to get a classification. You can th... (read more)