Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I had an interesting debate recently, about whether we could make smart AIs safe just by focusing on their structure and their task. Specifically, we were pondering something like:

  • "Would an algorithm be safe if it was a neural net-style image classifier, trained on examples of melanoma to detect skin cancer, with no other role than to output a probability estimate for a given picture? Even if "superintelligent", could such an algorithm be an existential risk?"

Whether it's an existential risk was not resolved; but I have a strong intuition that they would like be manipulative. Let's see how.

The requirements for manipulation

For an algorithm to be manipulative, it has to derive some advantage from manipulation, and it needs to be able to learn to manipulate - for that, it needs to be able to explore situations where it engages in manipulation and this is to its benefit.

There are certainly very simple situations where manipulation can emerge. But that example, though simple, had an agent that was active in the world. Can a classifier display the same sort of behaviour?

Manipulation emerges naturally

To show that, picture the following design. The programmers have a large collection of slightly different datasets, and want to train the algorithm on all of them. The loss function is an error rate, which can vary between and . Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.

How have the programmers set up the system? Well, they run the algorithm on batched samples from ten datasets at once, and record the error rate for all ten. The hyperparameters are set to minimise average error over each run of ten. When the performance on one dataset falls below error for a few runs, they remove it from the batches, and substitute in a new one to train the algorithm on[1].

So, what will happen? Well, the system will initially start to minimise the loss rate. Then, after a while, it will start "noticing" that when this error is below , then the dataset is removed and the average error rate shoots up. Actually, it won't be "noticing" this; instead, if there are hyperparameters that prevent the error rate going below , these will be selected by gradient descent as giving a better performance.

So, this problem of optimisation of image classification will result in the algorithm manipulating its programmers, by "underperforming" on key datasets, so that they stay in the batch.

Generic manipulation

Of course, the manipulation above happened because the programmers didn't understand what the algorithm's true loss function was. They thought it was "minimise overall loss on classification", but it was actually "keep each dataset loss just above ".

But that kind of mistake is easy to make - by carelessness, or because they thought it would improve the situation, the loss they thought they had diverged from the actual loss, and this resulted in manipulative behaviour.

Here is another example: suppose the programmers build in a precaution for things going wrong. If the error rate suddenly doubles on some dataset, then the algorithm is run exclusively on that dataset for a while, until the error rate declines again (or, if it doesn't, the programmers are called).

Now suppose that, of the ten datasets it has, nine have an error rate around , and one has an error rate of . Then it is in the algorithm's interest to boost the error rate on that last dataset, suddenly, to . Then it will be run for a time only on that one dataset, giving it a lower average error rate. Thus hyperparameters that allow this kind of behaviour, will get selected for (and if the algorithm is run on a lot of datasets, it will have many opportunities to learn about this kind of behaviour).

What about the "if it doesn't, the programmers are called" clause? This depends on what happens when the programmers are called, and how the error rate is computed in that case. But, depending on what does happen, this incentivises the algorithm to learn to call the programmers often (double an error rate and keep it high) or rarely (if there's a risk of the error rate for a dataset going up suddenly, then don't lower it too much in the first place).

So, manipulative behaviour can emerge in smart algorithms, even if they're merely classifying images in a way the programmers feel is entirely harmless.

Existential risks

Can this mild kind of manipulation lead to extreme, existential risk outcomes? Of course it; it's not hard to create toy models where everything ends disastrously. But it's also not hard to create toy models where manipulation remains mild.

But, in conclusion:

  • Manipulative behaviours can emerge in seemingly innocuous algorithms.
  • There is no clear division, currently, between mild manipulation and disastrous manipulation.

  1. Assume that either catastrophic forgetting has been solved, or that they'll re-run the removed datasets occasionally, to refresh the algorithm's performance on that dataset. ↩︎

New Comment
25 comments, sorted by Click to highlight new comments since: Today at 4:56 AM

The dataset example reminds me of the "playing dead" example from The Surprising Creativity of Digital Evolution:

In research focused on understanding how organisms evolve to cope with high-mutation-rate environments [50], Ofria sought to disentangle the beneficial effects of performing tasks (which would allow an organism to execute its code faster and thus replicate faster) from evolved robustness to the harmful effect of mutations. To do so, he tried to disable mutations that improved an organism’s replication rate (i.e. its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant’s replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant; otherwise, the mutant would remain in the population. He thus expected that replication rates could no longer improve, thereby allowing him to study the effect of mutational robustness more directly. However, while replication rates at first remained constant, they later unexpectedly started again rising. After a period of surprise and confusion, Ofria discovered that he was not changing the inputs provided to the organisms in the isolated test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all, in effect “playing dead” when presented with what amounted to a predator.

Ofria then took the logical step to alter the test environment to match the same random distribution of inputs as would be experienced in the normal (non-isolated) environment. While this patch improved the situation, it did not stop the digital organisms from continuing to improve their replication rates. Instead they made use of randomness to probabilistically perform the tasks that accelerated their replication. For example, if they did a task half of the time, they would have a 50% chance of slipping through the test environment; then, in the actual environment, half of the organisms would survive and subsequently replicate faster. In the end, Ofria eventually found a successful fix, by tracking organisms’ replication rates along their lineage, and eliminating any organism (in real time) that would have otherwise out-replicated its ancestors.

The YouTube algorithm is arguably an example of a "simple" manipulative algorithm. It's probably a combination of some reinforcement learning and a lot of supervised learning by now; but the following arguments apply even for supervised learning alone.

To maximize user engagement, it may recommend more addictive contents (cat videos, conspiracy, ...) because it learned from previous examples that users who clicked on one such content tended to stay longer on YouTube afterwards. This is massive user manipulation at scale.

Is this an existential risk? Well, some of these addictive contents are radicalizing and angering users,. This arguably increases the risk of international tensions, which increases the risk of nuclear war. This may not be the most dramatic increase in existential risk; but it's one that seems already going on today!

More generally, I believe that by pondering a lot more the behavior and impact of the YouTube algorithm, a lot can be learned about complex algorithms, including AGI. In a sense, the YouTube algorithm is doing so many different tasks that it can be argued to be already quite "general" (audio, visual, text, preference learning, captioning, translating, recommending, planning...).

More on this algorithm here: https://robustlybeneficial.org/wiki/index.php?title=YouTube

It seems to me that if we had the budget, we could realize the scenarios you describe today. The manipulative behavior you are discussing is not exactly rocket science.

That in turn makes me think that if we polled a bunch of people who build image classifiers for a living, and asked them whether the behavior you describe would indeed happen if the programmers behaved in the ways you describe, they would near-unanimously agree that it would.

Do you agree with both claims above? If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.


Separately, I think your examples depend on this a lot:

Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I'll admit that "thinking longer term" is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

I'd suspect that's right, but I don't think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There's the use of the technical term "almost every", but you did not prove the set of "powerful" algorithms which is not "manipulative" has measure zero. There's also "would be" instead of "seems" (I think if you made this change, the title would be fine). I think it's vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.

Not that I disagree with your suspicion here.

Rephrased the title and the intro to make this clearer.

If I understand your examples correctly, one way a classifier can be manipulative is by learning to control its training protocol/environment? Does this means that a fixed training protocol (without changes in the training sets or programmer interventions) would forbid this kind of manipulation?

This still might be a problem, since some approaches to AI Safety rely on human supervision/intervention.

The "swap out low-error training datasets" and "retrain if error doubles" can both be part of a fixed training environment. Kaj's example is also one where the programmer had a fixed environment that they thought they understood.

The problem is the divergence between what the programmer thought the goal was, and what it really was.

When taking the digital evolution example, it seems that by "manipulation", you mean that the tricks used by the programmers at training time did not prevent the behavior they were supposed to prevent. Which fits the intuitive notion of manipulative.

Then, we might see these examples as evidence that even classifiers, arguably the simplest of learning algorithms with the least amount of interaction with the environment, cannot be steered away from maximizing their goal by ad-hoc variations in the training protocol. Is that what your point was?

Somehow, I also see your examples (at least the first one, and the digital evolution one) as indicative that classifiers are robust against manipulations by their programmers. Because these classifiers are trying to maximize their predictive accuracy or their fitness, and the programmers are trying to make them maximize something else. Hence, we can see the behavior of these classifiers as "fault-tolerant", in a way.

Though this can be a big issue when the target of prediction is not exactly what we wanted, and thus we would like to steer it to something different.

cannot be steered away from maximizing their goal by ad-hoc variations in the training protocol.

That, and the fact these ad-hoc variations can introduce new goals that the programmers are not aware of.

How do you define manipulation?

I think manipulation is one of those things that makes sense in human terms, but not in objective terms (similarly to what I think of low-impact, corrigibility, etc...).

Therefore I'm using manipulation to mean "looks like manipulative behaviour to humans"; I don't think we can do much better than that.

How dangerous would you consider a person with basic programming skills and a hypercomputer? I mean I could make something very dangerous, given hypercompute. I'm not sure if I could make much that was safe and still useful. How common would it be to accidentally evolve a race of aliens in the garbage collection?

At the moment, my best guess at what powerful algorithms look like is something that lets you maximize functions without searching through all the inputs. Gradient descent can often find a high point without that much compute, so is more powerful than random search. If your powerful algorithm is more like really good computationally bounded optimization, I suspect it will be about as manipulative as brute forcing the search space. (I see no strong reason for strategies labeled manipulative to be that much easier or harder to find than those that aren't.)

It intuitively seems like you need merely make the interventions run at higher permissions/clearance than the hyperparameter optimizer.

What do I mean by that? In Haskell, so-called monad transformers can add features like nondeterminism and memory to a computation. The natural conflict that results ("Can I remember the other timelines?") is resolved through the order in which the monad transformers were applied. (One way is represented as a function from an initial memory state to a list of timelines and a final memory state, the other as a list of functions from an initial memory state to a timeline and a final memory state.) Similarly, a decent type system should just not let the hyperparameter optimizer see the interventions.

What this might naively come out to is that the hyperparameter optimizer just does not return a defined result unless its training run is finished as it would have been without intervention. A cleverer way I could imagine it being implemented is that the whole thing runs on a dream engine, aka a neural net trained to imitate a CPU at variable resolution. After an intervention, the hyperparameter optimizer would be run to completion on its unchanged dataset at low resolution. For balance reasons, this may not extract any insightful hyperparameter updates from the tail of the calculation, but the intervention would remain hidden. The only thing we would have to prove impervious to the hyperparameter optimizer through ordinary means is the dream engine.

Have fun extracting grains of insight from these mad ramblings :P

This specific problem could easily be fixed, but the problem of the goal not being what we think it is, remains.

See also Kaj's example: https://www.lesswrong.com/posts/Ez4zZQKWgC6fE3h9G/almost-every-powerful-algorithm-would-be-manipulative#vhZ9uvMwiMCepp6jH

Manipulation emerges naturally

Empirical claims. (Creating a specific example (running code) does not demonstrate "natural", but can contribute towards building an understanding of what conditions give rise to the hypothesized behavior, if any.*)

Of course, the manipulation above happened because the programmers didn't understand what the algorithm's true loss function was. They thought it was "minimise overall loss on classification", but it was actually "keep each dataset loss just above 0.1".

This seems incorrect. The scenario highlighted that with that setup, the way "minimise overall loss on classification" was optimized led to the behavior: "keep each dataset loss just above 0.1". Semantics, perhaps, but the issue isn't "the algorithm was accidentally programmed to keep each dataset just above 0.1", rather that is a result of its learning in its setup.

*A tendency to forget things could be a blessing - a representation of the world might not be crafted, and a "manipulative" strategy not found. (One could argue that by this definition humans are "manipulative" if we change our environment - tool use is obviously a form of 'manipulation', if only 'manipulating using our hands/etc.'. Similarly if communication works, it can lead to change...)

There is no clear division, currently, between mild manipulation and disastrous manipulation.

The story didn't seem to include a disaster.

The story didn't seem to include a disaster.

No. That's a separate issue - the reason that mild manipulation could be a problem. If we had a clear division, I wouldn't care about mild manipulation.

instead, if there are hyperparameters that prevent the error rate going below 0.1, these will be selected by gradient descent as giving a better performance.

I don't follow this point. If we're talking about using SGD to update (hyper)parameters, using a batch of images from the currently used datasets, then the gradient update would be determined by the gradient of the loss with respect to that batch of images.

To keep it simple, assume the hyperparameters are updated by evolutionary algorithm or some similar search-then-continue-or-stop process.

I want to flag that—in the case of evolutionary algorithms—we should not assume here that the fitness function is defined with respect to just the current batch of images, but rather with respect to, say, all past images so far (since the beginning of the entire training process); otherwise the selection pressure is "myopic" (i.e. models that outperform others on the current batch of images have higher fitness).

(I might be over-pedantic about this topic due to previously being very confused about it.)

I'm not sure what you mean by "learn to call the programmers" ? As in, in your analogy this sounds similar to reaching an error state... but algorithms are not optimized to reach and error state or to avoid reaching and error state.

You *could*, if you were selecting from loads of algorithms or running the same one many times end up selecting algorithms that reach and error state very often (which we already do, one of the main meta-criteria for any ML algorithm is basically to fail/finish fast), but that's not necessarily a bad thing.

If producing (or committing to produce) an error state results in a change in utility / fitness, then it may end up optimised for / against.