I believe these evaluations in unlearning miss a critical aspect: they benchmark on deleting i.i.d samples or a specific class, instead of adversarially manipulated/chosen distributions. This might fool us to believe unlearning methods work as shown in our paper both theoretically (Theorem 1) and empirically. The same failure mode holds for interpretability, which is a similar argument as the motivation to study across the whole distribution in the recent Copy Suppression paper.
Thank you, I am glad you liked our work!We think logistic regression might be honing in on some spurious correlations that help with classification in that particular distribution but don't have an effect on later layers of the model and thus its outputs. ClassMeans does as well as LEACE for removal as it has the same linear guardedness guarantee as LEACE as mentioned in their paper.
As for using a disjoint training set to train the post-removal classifier: We found that the linear classifier attained random accuracies if trained on the dataset used for removal, but higher accuracies when trained on a disjoint training set from the same distribution. One might think of this as the removal procedure 'overfitting' on its training data. We refer to it as 'obfuscation' in the post in the sense that its hard to learn a classifier from the original training data but there is still some information about the concept in the model that can be extracted with different training data. Thus, we believe the most rigorous thing to do is to use a separate training set to train the classifier after removal.
My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing 'deceptive alignment' even with a good understanding of what it means. Descriptions -> Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what 'good safety research' would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never heard of alignment but have excellent general research skills in another domain (say Math) which transfer to doing 'good safety research' with some guided nudging. This leads to my belief that descriptions may not be super useful evidence for harmful capabilities.Overall, thanks for your response! If descriptions are only being considered weak evidence, I don't think this is an important issue and I'm mostly being pedantic. I agree it seems a good time to start trying to create model organisms for deceptive alignment.
Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:
Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same as understanding deceptive alignment in a nearly sufficient capacity to actually execute it. It is possible deceptive alignment descriptions did not occur in the training data of GPT-3/GPT-3.5.It is possible that being able to describe a capability is sufficient to execute it given a certain level of planning capabilities, situational awareness etc. but that does not seem obvious. I guess this is acknowledged implicitly to some extent later in the post.
It seems like this argument assumes that the model optimizes on the entire 'training process'. Why can't we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment?