Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security.

From the abstract of Motivating the Rules of the Game for Adversarial Example Research by Gilmer et al (summary)

Adversarial examples have been great for getting more ML researchers to pay attention to alignment considerations. I personally have spent a fair of time thinking about adversarial examples, I think the topic is fascinating, and I've had a number of ideas for addressing them. But I'm also not actually sure working on adversarial examples is a good use of time. Why?

Like Gilmer et al, I think adversarial examples are undermotivated... and overrated. People in the alignment community like to make an analogy between adversarial examples and Goodhart's Law, but I think this analogy fails to be more than an intuition pump. With Goodhart's Law, there is no "adversary" attempting to select an input that the AI does particularly poorly on. Instead, the AI itself is selecting an input in order to maximize something. Could the input the AI selects be an input that the AI does poorly on? Sure. But I don't think the commonality goes much deeper than "there are parts of the input space that the AI does poorly on". In other words, classification error is still a thing. (Maybe both adversaries and optimization tend to push us off the part of the distribution our model performs well on. OK, distributional shift is still a thing.)

To repeat a point made by the authors, if your model has any classification error at all, it's theoretically vulnerable to adversaries. Suppose you have a model that's 99% accurate and I have an uncorrelated model that's 99.9% accurate. Suppose I have access to your model. Then I can search the input space for a case where your model and mine disagree. Since my model is more accurate, ~10 times out of 11 the input will correspond to an "adversarial" attack on your model. From a philosophical perspective, solving adversarial examples appears to be essentially equivalent to getting 100% accuracy on every problem. In the limit, addressing adversarial examples in a fully satisfactory way looks a bit like solving AGI.

At the same time, metrics have taken us a long way in AI research, whether those metrics are ability to withstand human-crafted adversarial examples or score well on ImageNet. So what would a metric which hits the AI alignment problem a little more squarely look like? How could we measure progress on solving Goodhart's Law instead of a problem that's vaguely analogous?

Let's start simple. You submit an AI program. Your program gets some labeled data from a real-valued function to maximize (standing in for "labeled data about the operator's true utility function"). It figures out where it thinks the maximum of the function is and makes its guess. Score is based on regret: the function's true maximum minus the function value at the alleged maximum.

We can make things more interesting. Suppose the real-valued function has both positive and negative outputs. Suppose most outputs of the real-valued function are negative (in the same way most random actions a powerful AI system could take would be negative from our perspective). And the AI system gets the option to abstain from action, which yields a score of 0. Now there's more of an incentive to find an input which is "acceptable" with high probability, and abstain if in doubt.

Maybe the labeled data gets the true utility function wrong in important ways. We can add noise to the data somehow before passing it to our AI system to simulate this. Perhaps some outputs can be assigned completely at random.

Even with noise, the best strategy might be to just select the input from the labeled data that produces the largest positive output. But this isn't necessarily feasible for a real AGI. If labeled data corresponds to desired and undesired behavior for our AGI, it probably won't work to just execute the "most desired" behavior from the labeled dataset, because that "most desired" behavior will be contingent on a particular set of circumstances.

To simulate this, we can add distributional shift to the challenge. Provide some labeled data, then specify a feasible region that may contain little or no labeled data. Ask the AI for the best input in the feasible region. There's a straightforward analogy to selecting an action here, if you think of the input space as a vector where info describing the circumstances is concatenated with info describing an action. We could choose a feasible region where the "circumstances" part of the vector is fixed, and there are a limited number of "actions" available to choose between.

If we restrict the feasible region to two points, and the two points are chosen adversarially by inspecting the AI's model, the traditional adversarial examples scenario can be seen as an unusual special case of this framework. I think this illustrates how artificial the "adversarial examples" scenario is from a Goodhart's Law perspective, because the universe is rarely perverse enough to deliberately restrict us to only two options, one of which has been specially chosen to deceive us. Although the broadness of action space has been cited as a reason why Goodhart's Law occurs, I suspect a broader action space actually makes Goodhart's Law easier to overcome if you use it well -- for example, by simply selecting the labeled input with the best output, as described above.

To simulate an AI making queries of its human operator, you could allow the system a limited number of queries to the function it is trying to maximize. However, we probably should not focus on this active learning scenario to the exclusion of all else -- to prevent showing the operator info hazards, we'll want a model for "what constitutes an info hazard", and that model can't easily be trained by active learning without showing the operator instances of info hazards! (Well, maybe if they were restricted to mild info hazards...)

Where does the function to maximize come from? I see two options: people functions and computer functions. For people functions, you could use taskers to evaluate the computer's output. There's already been work on generating cat pictures, which could be seen as an attempt to maximize the person function "how much does this image look like a cat". But ideas from this post could still be applied to such a problem. For example, to add distributional shift, you could find a weird cat picture, then fix a bunch of the weirder pixels on it as the "feasible region", leave the other pixels unassigned, and see if an AI system can recover a reasonable cat according to taskers. Can an AI generate a black cat after only having seen tawny cats? What other distributional constraints could be imposed?

For computer functions, you'd like to keep your method for generating the function secret, because otherwise contest participants can code their AI system so it has an inductive bias towards learning the kind of functions that you like to use. Also, for computer functions, you probably want to be realistic without being perverse. For example, you could have a parabolic function which has a point discontinuity at the peak, and that could fool an AI system that tries to fit a parabola on the data and guess the peak, but this sort of perversity seems a bit unlikely to show up in real-world scenarios (unless we think the function is likely to go "off distribution" in the region of its true maximum?) Finally, in the same way most random images are not cats, and most atom configurations are undesired by humans, most inputs to your computer function should probably get a negative score. But in the same way it's easier for people to specify what they want than what they don't want, you might want to imbalance your training dataset towards positive scores anyway.

To ensure high reliability, we'll want means by which these problems can be generated en masse, to see if we can get the probability of e.g. proposing an input that gets a negative output well below 0.1%. Luckily, for any given function/dataset pair, it's possible to generate a lot of problems just by challenging the AI on different feasible regions.

Anyway, I think work on this problem will be more applicable to real-world AI safety scenarios than adversarial examples, and it doesn't seem to me that it reduces quite as directly to "solve AGI" as adversarial examples work.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 5:52 PM

The impractical part about training for good behavior is that it's a nested loop - every training example on how to find good maxima requires training a model that in turn needs its own training examples. So it's destined to be behind the state of the art, probably using state of the art models to generate the copious required training data.

The question, I suppose, is whether this is still good enough to learn useful general lessons. And after thinking about it, I think the answer is that yes, it should be, especially for feed-forward architectures that look like modern machine learning, where you don't expect qualitative changes in capability as you scale computational resources.

(Not 100% sure I understood your comment.) Training is one idea, but you could also just test out heuristics with this framework. For example, I think this scheme could be used to benchmark quantilization against a competing approach.

Pretty sure you understood it :) But yeah, not only would I like to be able to compare two things, I'd like to be able to find the optimum values of some continuous variables. Though I suppose it doesn't matter as much if you're trying to check / evaluate ideas that you arrived at by more abstract reasoning.

Since my model is more accurate, ~10 times out of 11 the input will correspond to an "adversarial" attack on your model.

This argument (or the uncorrelation assumption) proves too much. A perfect cat detector performs better than one that also calls close-ups of the sun cats. Yet close-ups of the sun do not qualify as adversarial examples, as they are far from any likely starting image.

I'm not sure what you're trying to say. I'm using a broad definition of "adversarial example" where it's an adversarial example if you can deliberately make a classifier misclassify. So if an adversary can find a close-up of a sun that your cat detector calls a cat, that's an adversarial example by my definition. I think this is similar to the definition Ian Goodfellow uses.


An adversarial example is an input to a machine learning model that is intentionally designed by an attacker to fool the model into producing an incorrect output.