This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.
Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.
1. Introduction
LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?
1.1 Intuition pump
Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue.... (read 3610 more words →)