Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Introduction: If I were a well-intentioned AI...

I've often warned people about the dangers of anthropomorphising AIs - how it can mislead us about what's really going on in an AI (and hence how the AI might act in the future), cause us to not even consider certain failure modes, and make us believe we understand things much better than we do.

Oh well, let's ignore all that. I'm about to go on a journey of major anthropomorphisation, by asking myself:

  • "If I was a well-intentioned AI, could I solve many of the problems in AI alignment?"

My thinking in this way started when I wondered: suppose I knew that I was given a proxy goal rather than the true goal; suppose that I knew about the Goodhart problem, and suppose that I really "wanted" to align with the true goal - could I then do it? I was having similar thoughts about being a mesa-optimiser.

It seems to me that asking and answering these kind of questions leads to new and interesting insights. Of course, since they come via anthropomorphisation, we need to be careful with them, and check that they are really applicable to AI systems - ensuring that I'm not bringing some of my own human knowledge about human values into the example. But first, let's get those initial insights.

Overlapping problems, overlapping solutions

At a high enough level of abstraction, many problems in AI alignment seem very similar. The Goodhart problem, the issues machine learning has with distributional shift, the problem of the nearest unblocked strategy, unidentifiability of reward functions, even mesaoptimisation and the whole AI alignment problem itself - all of these can be seen, roughly, as variants of the same problem. That problem being that we have an approximately specified goal that looks ok, but turns out to be underspecified in dangerous ways.

Of course, often the differences between the problems are as important as the similarities. Nevertheless, the similarities exist, which is why a lot of the solutions are going to look quite similar, or at least address quite similar issues.

Distributional shift for image recognition

Let's start with a simple example: image recognition. If I was an image classifier and aware of some of the problems, could I reduce them?

First, let's look at two examples of problems.

Recognising different things

Firstly, we have the situation where the algorithm successfully classifies the test set, but it's actually recognising different features than what humans were expecting.

For example, this post details how a dumbbell recogniser was tested to see what images triggered its recognition the strongest:

Though it was supposed to be recognising dumbbells, it ended up recognising some mix of dumbbells and arms holding them. Presumably, flexed arms were present in almost all images of dumbbells, so the algorithm used them as classification tool.

Adversarial examples

The there are the famous adversarial examples, where, for example, a picture of a panda with some very slight but carefully selected noise, is mis-identified as a gibbon:

AI-me vs multiply-defined images

Ok, suppose that AI-me suspects that I have problems like the ones above. What can I do from inside the algorithm? I can't fix everything - garbage in, garbage out, or at least insufficient information in, inferior performance out - but there are some steps I can take to improve my performance.

The first step is to treat my reward or label information as informative of the true reward/true category, rather than as goals. This is similar to the paper on Inverse Reward Design, which states:

the designed reward function should merely be an observation about the intended reward, rather than the definition; and should be interpreted in the context in which it was designed

This approach can extend to image classification as well; we can recast it as:

the labelled examples should merely be examples of the intended category, not a definition of it; and should be interpreted in the context in which they were selected

So instead of thinking "does this image resemble the category 'dumbbell' of my test set?", I instead ask "what features could be used to distinguish the dumbbell category from other categories?"

Then I could note that the dumbbell images all seem to have pieces of metal in them with knobs on the end, and also some flexed arms. So I construct two (or more) subcategories, 'metal with knobs' and 'flexed arms[1]'.

These come into play if I got an image like this:

I wouldn't just think:

  • "this image scores high in the dumbbell category",

but instead:

  • "this image scores high in the 'flexed arms' subcategory of the dumbbell category, but not in the 'metal with knobs' subcategory."

That's a warning that something is up, and that a mistake is potentially likely.

Detecting out of distribution images

That flexed arm is an out of distribution image - one different from the distribution of images in the training set. There have been various approaches to detecting this phenomena.

I could run all these approaches to look for out of distribution images, and I could also look for other clues - such as the triggering of an unusual pattern of neurons in my dumbbell detector (ie it scores highly, but in an unusual way, or I could run an discriminative model to identify whether the image sticks out from the training set).

In any case, detecting an out of distribution image is a signal that, if I haven't done it already, I need to start splitting the various categories to check whether the image fits better in a subcategory than in the base category.

What to do with the information

What I should do with the information depends on how I'm designed. If I was trained to distinguish "dumbbells" from "spaceships", then this image, though out of distribution, is clearly much closer to a dumbbell than a spaceship. I should therefore identify it as such, but attach a red flag if I can.

If I have a "don't know" option[2], then I will use it, classifying the image as slightly dumbbell-ish, with a lot of uncertainty.

If I have the option of asking for more information or for clarification, then now is the moment to do that. If I can decompose my classification categories effectively (as 'metal with knobs' and 'flexed arms') then I can ask which, if any, of these categories I should be using. This is very much in the spirit of this blog post, which decomposes the images into "background" and "semantics", and filters out background changes. Just here, I'm doing the decomposition, and then asking my programmers which is "semantics" and which is "background".

Notice the whole range of options available, unlike the Inverse Reward Design paper, which simply advocates extreme conservatism around the possible reward functions.

Ultimately, humans may want to set my degree of conservatism, depending on how dangerous they feel my errors would be (though even seemingly-safe systems can be manipulative - so it's possible I should be slightly more conservative than humans allow for).

AI-me vs adversarial examples

Adversarial examples are similar, but different. Some approaches to detecting out of distribution images can also detect adversarial examples. I can also run an adversarial attack on myself, and construct extreme adversarial examples, and see whether the image has features in common with them.

If an image scores unduly high in one category, or has an unusual pattern of triggering neurons for that category, that might be another clue that its adversarial.

I have to also take into account that the adversary may have access to all of my internal mechanisms, including my adversarial detection mechanisms. So things like randomising key parts of my adversarial detection, or extreme conservatism, are options I should consider.

Of course, if asking humans is an option, then I should.

But what is an adversarial example?

But here I'm trapped by lack of information - I'm not human, I don't know the true categories that they are trying to get me to classify. How can I know that this is not a gibbon?

I can, at best, detect it has a pattern of varying small-scale changes, different from the other images I've seen. But maybe humans can see those small changes, and they really mean for that image to be a gibbon?

This is where some more knowledge of human categories can come in useful. The more I know about different types of adversarial examples, the better I can do - not because I need to copy the humans methods, but because those examples tell me what humans consider adversarial examples, letting me look out for them better. Similarly, information about what images humans consider "basically identical" or "very similar" would inform me about how their classification is meant to go.


  1. Of course, I won't necessarily know these names; these are just the human-interpretable versions of whatever labelling system I'm using. ↩︎

  2. Note that if I'm trained on many categories, I have the "uniform distribution on every category" which functions as a "don't know". ↩︎

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 3:38 AM

I went back and re-read these. I think the main anthropomorphic power that AI-you uses is that it already models the world using key human-like assumptions, where necessary.

For example, when you think about how an AI would be predisposed to break images down into "metal with knobs" and "beefy arm" rather than e.g. "hand holding metal handle," "knobs," and "forearm," (or worse, some combination of hundreds of edge detector activations), it's pretty tricky. It needs some notion of human-common-sense "things" and what scale things tend to be. Maybe it needs some notion of occlusion and thing composition before it can explain held-dumbbell images in terms of a dumbbell thing. It might even need to have figured out that these things are interacting in an implied 3D space before it can draw human-common-sense associations between things that are close together in this 3D space. All of which is so obvious to us that it can be implicit for AI-you.

Would you agree with this interpretation, or do you think there's some more important powers used?

I think that might be a generally good critique, but I don't think it applies to this post (it may apply better to post #3 in the series).

I used "metal with knobs" and "beefy arm" as human-parsable examples, but the main point is detecting when something is out-off-distribution, which relies on the image being different in AI-detectable ways, not on the specifics of the categories I mentioned.

I don't think this is necessarily a critique - after all, it's inevitable that AI-you is going to inherit some anthropomorphic powers. The trick is figuring out what they are and seeing if it seems like a profitable research avenue to try and replicate them :)

In this case, I think this is an already-known problem, because detecting out-of-distribution images in a way that matches human requirements requires the AI's distribution to be similar to human distribution (and conversely, mismatches in distribution allow for adversarial examples). But maybe there's something different in part 4 where I think there's some kind of "break down actions in obvious ways" power that might not be as well-analyzed elsewhere (though it's probably related to self-supervised learning of hierarchical planning problems).

I don't think critiques are necessarily bad ^_^