Introduction: If I were a well-intentioned AI...
I've often warned people about the dangers of anthropomorphising AIs - how it can mislead us about what's really going on in an AI (and hence how the AI might act in the future), cause us to not even consider certain failure modes, and make us believe we understand things much better than we do.
Oh well, let's ignore all that. I'm about to go on a journey of major anthropomorphisation, by asking myself:
- "If I was a well-intentioned AI, could I solve many of the problems in AI alignment?"
My thinking in this way started when I wondered: suppose I knew that I was given a proxy goal rather than the true goal; suppose that I knew about the Goodhart problem, and suppose that I really "wanted" to align with the true goal - could I then do it? I was having similar thoughts about being a mesa-optimiser.
It seems to me that asking and answering these kind of questions leads to new and interesting insights. Of course, since they come via anthropomorphisation, we need to be careful with them, and check that they are really applicable to AI systems - ensuring that I'm not bringing some of my own human knowledge about human values into the example. But first, let's get those initial insights.
Overlapping problems, overlapping solutions
At a high enough level of abstraction, many problems in AI alignment seem very similar. The Goodhart problem, the issues machine learning has with distributional shift, the problem of the nearest unblocked strategy, unidentifiability of reward functions, even mesaoptimisation and the whole AI alignment problem itself - all of these can be seen, roughly, as variants of the same problem. That problem being that we have an approximately specified goal that looks ok, but turns out to be underspecified in dangerous ways.
Of course, often the differences between the problems are as important as the similarities. Nevertheless, the similarities exist, which is why a lot of the solutions are going to look quite similar, or at least address quite similar issues.
Distributional shift for image recognition
Let's start with a simple example: image recognition. If I was an image classifier and aware of some of the problems, could I reduce them?
First, let's look at two examples of problems.
Recognising different things
Firstly, we have the situation where the algorithm successfully classifies the test set, but it's actually recognising different features than what humans were expecting.
For example, this post details how a dumbbell recogniser was tested to see what images triggered its recognition the strongest:
Though it was supposed to be recognising dumbbells, it ended up recognising some mix of dumbbells and arms holding them. Presumably, flexed arms were present in almost all images of dumbbells, so the algorithm used them as classification tool.
The there are the famous adversarial examples, where, for example, a picture of a panda with some very slight but carefully selected noise, is mis-identified as a gibbon:
AI-me vs multiply-defined images
Ok, suppose that AI-me suspects that I have problems like the ones above. What can I do from inside the algorithm? I can't fix everything - garbage in, garbage out, or at least insufficient information in, inferior performance out - but there are some steps I can take to improve my performance.
The first step is to treat my reward or label information as informative of the true reward/true category, rather than as goals. This is similar to the paper on Inverse Reward Design, which states:
the designed reward function should merely be an observation about the intended reward, rather than the definition; and should be interpreted in the context in which it was designed
This approach can extend to image classification as well; we can recast it as:
the labelled examples should merely be examples of the intended category, not a definition of it; and should be interpreted in the context in which they were selected
So instead of thinking "does this image resemble the category 'dumbbell' of my test set?", I instead ask "what features could be used to distinguish the dumbbell category from other categories?"
Then I could note that the dumbbell images all seem to have pieces of metal in them with knobs on the end, and also some flexed arms. So I construct two (or more) subcategories, 'metal with knobs' and 'flexed arms'.
These come into play if I got an image like this:
I wouldn't just think:
- "this image scores high in the dumbbell category",
- "this image scores high in the 'flexed arms' subcategory of the dumbbell category, but not in the 'metal with knobs' subcategory."
That's a warning that something is up, and that a mistake is potentially likely.
Detecting out of distribution images
I could run all these approaches to look for out of distribution images, and I could also look for other clues - such as the triggering of an unusual pattern of neurons in my dumbbell detector (ie it scores highly, but in an unusual way, or I could run an discriminative model to identify whether the image sticks out from the training set).
In any case, detecting an out of distribution image is a signal that, if I haven't done it already, I need to start splitting the various categories to check whether the image fits better in a subcategory than in the base category.
What to do with the information
What I should do with the information depends on how I'm designed. If I was trained to distinguish "dumbbells" from "spaceships", then this image, though out of distribution, is clearly much closer to a dumbbell than a spaceship. I should therefore identify it as such, but attach a red flag if I can.
If I have a "don't know" option, then I will use it, classifying the image as slightly dumbbell-ish, with a lot of uncertainty.
If I have the option of asking for more information or for clarification, then now is the moment to do that. If I can decompose my classification categories effectively (as 'metal with knobs' and 'flexed arms') then I can ask which, if any, of these categories I should be using. This is very much in the spirit of this blog post, which decomposes the images into "background" and "semantics", and filters out background changes. Just here, I'm doing the decomposition, and then asking my programmers which is "semantics" and which is "background".
Notice the whole range of options available, unlike the Inverse Reward Design paper, which simply advocates extreme conservatism around the possible reward functions.
Ultimately, humans may want to set my degree of conservatism, depending on how dangerous they feel my errors would be (though even seemingly-safe systems can be manipulative - so it's possible I should be slightly more conservative than humans allow for).
AI-me vs adversarial examples
Adversarial examples are similar, but different. Some approaches to detecting out of distribution images can also detect adversarial examples. I can also run an adversarial attack on myself, and construct extreme adversarial examples, and see whether the image has features in common with them.
If an image scores unduly high in one category, or has an unusual pattern of triggering neurons for that category, that might be another clue that its adversarial.
I have to also take into account that the adversary may have access to all of my internal mechanisms, including my adversarial detection mechanisms. So things like randomising key parts of my adversarial detection, or extreme conservatism, are options I should consider.
Of course, if asking humans is an option, then I should.
But what is an adversarial example?
But here I'm trapped by lack of information - I'm not human, I don't know the true categories that they are trying to get me to classify. How can I know that this is not a gibbon?
I can, at best, detect it has a pattern of varying small-scale changes, different from the other images I've seen. But maybe humans can see those small changes, and they really mean for that image to be a gibbon?
This is where some more knowledge of human categories can come in useful. The more I know about different types of adversarial examples, the better I can do - not because I need to copy the humans methods, but because those examples tell me what humans consider adversarial examples, letting me look out for them better. Similarly, information about what images humans consider "basically identical" or "very similar" would inform me about how their classification is meant to go.