Discussion: Objective Robustness and Inner Alignment Terminology
In the alignment community, there seem to be two main ways to frame and define objective robustness and inner alignment. They are quite similar, mainly differing in the manner in which they focus on the same basic underlying problem. We’ll call these the objective-focused approach and the generalization-focused approach. We don’t delve into these issues of framing the problem in Empirical Observations of Objective Robustness Failures, where we present empirical observations of objective robustness failures. Instead, we think it is worth having a separate discussion of the matter. These issues have been mentioned only infrequently in a few comments on the Alignment Forum, so it seemed worthwhile to write a post describing the framings and their differences in an effort to promote further discussion in the community. TL;DR This post compares two different paradigmatic approaches to objective robustness/inner alignment: Objective-focused approach * Emphasis: “How do we ensure our models/agents have the right (mesa-)objectives?” * Outer alignment: “an objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned.” * Outer alignment is a property of the training objective. Generalization-focused approach * Emphasis: “How will this model/agent generalize out-of-distribution?” * Considering a model’s “objectives” or “goals,” whether behavioral or internal, is instrumentally useful for predicting OOD behavior, but what you ultimately care about is whether it generalizes “acceptably.” * Outer alignment: a model is outer aligned if it performs desirably on the training distribution. * Outer alignment is a property of the tuple (training objective, training data, training setup, model). Special thanks to Rohin Shah, Evan Hubinger, Edouard Harris, Adam Shimi, and Adam Gleave for their helpful feedback on drafts of this post. Objective-focused approach This is the approa
I think it would be more accurate to use the word "opponent" or "adversary". It's not unusual to be in some kind of contest with someone else - you use the example of chess yourself. But that doesn't make someone your enemy.
"Enemy" AFAIK strongly suggests hostility / seeking to harm, which is not true for chess nor for most policy disagreements. Important distinction!
(I'm usually less pedantic about word choice, but this essay is centrally about the use of the word "enemy", so it seemed appropriate).