Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny
Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post.
When is a “deceptively aligned” policy capable of surviving training? Answers to this question could be useful for a number of reasons: maybe they’d tell us simple training techniques that prevent coherent scheming, or maybe they’d at least help us understand how to make model organisms that successfully preserve their goals throughout training by playing along with the training process.
Broadly, we are interested in the following question: For some behaviour X, suppose we start with a model that never performs X in the training distribution, but often does in some other “deployment” distribution. Can we remove X in deployment by training on...
Confirming that efficiently finding a small circuit (you don't actually need further restrictions than size) based on its values on a fixed collection of test inputs is known to imply --- see this paper.
This is a fact worth knowing and a lucid explanation --- thanks for writing this!
I know it's not the main point of the post, but I found myself a little lost when you talked about complexity theory; I would be interested to hear more details. In particular:
what are the definitions of these classes? Are these decision problems in the usual sense, or is this a statement about learning theory? I haven't been able to either track down a def...
A couple small notes / clarifications about the results in this post:
- Theorem 1 could be equivalently stated "For any constant c, the family of size-nc circuits is Verifiably Acceptable". (If G is parameterized by descriptions that can be evaluated in polynomial time, then you can wlog think of G as a subfamily of polynomial-sized circuits.)
- It's worth emphasizing that while x∗ (the location of the backdoor) is chosen uniformly random here, both g (the function we'd like to distill, but can only access as a b
... (read more)