Ω 12

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I had an interesting debate recently, about whether we could make smart AIs safe just by focusing on their structure and their task. Specifically, we were pondering something like:

• "Would an algorithm be safe if it was a neural net-style image classifier, trained on examples of melanoma to detect skin cancer, with no other role than to output a probability estimate for a given picture? Even if "superintelligent", could such an algorithm be an existential risk?"

Whether it's an existential risk was not resolved; but I have a strong intuition that they would like be manipulative. Let's see how.

The requirements for manipulation

For an algorithm to be manipulative, it has to derive some advantage from manipulation, and it needs to be able to learn to manipulate - for that, it needs to be able to explore situations where it engages in manipulation and this is to its benefit.

There are certainly very simple situations where manipulation can emerge. But that example, though simple, had an agent that was active in the world. Can a classifier display the same sort of behaviour?

Manipulation emerges naturally

To show that, picture the following design. The programmers have a large collection of slightly different datasets, and want to train the algorithm on all of them. The loss function is an error rate, which can vary between and . Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.

How have the programmers set up the system? Well, they run the algorithm on batched samples from ten datasets at once, and record the error rate for all ten. The hyperparameters are set to minimise average error over each run of ten. When the performance on one dataset falls below error for a few runs, they remove it from the batches, and substitute in a new one to train the algorithm on[1].

So, what will happen? Well, the system will initially start to minimise the loss rate. Then, after a while, it will start "noticing" that when this error is below , then the dataset is removed and the average error rate shoots up. Actually, it won't be "noticing" this; instead, if there are hyperparameters that prevent the error rate going below , these will be selected by gradient descent as giving a better performance.

So, this problem of optimisation of image classification will result in the algorithm manipulating its programmers, by "underperforming" on key datasets, so that they stay in the batch.

Generic manipulation

Of course, the manipulation above happened because the programmers didn't understand what the algorithm's true loss function was. They thought it was "minimise overall loss on classification", but it was actually "keep each dataset loss just above ".

But that kind of mistake is easy to make - by carelessness, or because they thought it would improve the situation, the loss they thought they had diverged from the actual loss, and this resulted in manipulative behaviour.

Here is another example: suppose the programmers build in a precaution for things going wrong. If the error rate suddenly doubles on some dataset, then the algorithm is run exclusively on that dataset for a while, until the error rate declines again (or, if it doesn't, the programmers are called).

Now suppose that, of the ten datasets it has, nine have an error rate around , and one has an error rate of . Then it is in the algorithm's interest to boost the error rate on that last dataset, suddenly, to . Then it will be run for a time only on that one dataset, giving it a lower average error rate. Thus hyperparameters that allow this kind of behaviour, will get selected for (and if the algorithm is run on a lot of datasets, it will have many opportunities to learn about this kind of behaviour).

What about the "if it doesn't, the programmers are called" clause? This depends on what happens when the programmers are called, and how the error rate is computed in that case. But, depending on what does happen, this incentivises the algorithm to learn to call the programmers often (double an error rate and keep it high) or rarely (if there's a risk of the error rate for a dataset going up suddenly, then don't lower it too much in the first place).

So, manipulative behaviour can emerge in smart algorithms, even if they're merely classifying images in a way the programmers feel is entirely harmless.

Existential risks

Can this mild kind of manipulation lead to extreme, existential risk outcomes? Of course it; it's not hard to create toy models where everything ends disastrously. But it's also not hard to create toy models where manipulation remains mild.

But, in conclusion:

• Manipulative behaviours can emerge in seemingly innocuous algorithms.
• There is no clear division, currently, between mild manipulation and disastrous manipulation.

1. Assume that either catastrophic forgetting has been solved, or that they'll re-run the removed datasets occasionally, to refresh the algorithm's performance on that dataset. ↩︎