Or just bad implementations do this - predict-o-matic as described sounds like a bad idea, and like it doesn't contain hypotheses, so much as "players"*. (And the reason there'd be a "side channel" is to understand theories - the point of which is transparency, which, if accomplished, would likely prevent manipulation.)

You can think of the side-channel as a "bad implementation" issue, but do you really want to say that we have to forego diagnostic logs in order to have a good implementation of "hypotheses" ... (read more)

Bayesian Evolving-to-Extinction

by abramdemski 4 min read14th Feb 202012 comments

35

Ω 16


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The present discussion owes a lot to Scott Garrabrant and Evan Hubinger.

In Defining Myopia, I formalized temporal or cross-instance myopia / non-myopia, but I claimed that there should also be some kind of single-instance myopia which I hadn't properly captured. I also suggested this in Predict-O-Matic.

This post is intended to be an example of single-instance partial agency.

Evolving to Extinction

Evolution might be myopic in a number of ways, but one way is that it's myopic across individuals -- it typically produces results very different from what group selection would produce, because it's closer to optimizing relative fitness of individuals (relative to each other) than it is to optimizing overall fitness. Adaptations which help members of a species compete with each other are a great example of this. Why increase your own fitness, when you can just decrease someone else's instead? We're lucky that it's typically pretty hard, at least historically, to do things which are bad across the board but slightly less bad for the one doing them. Imagine a "toxic gas gene" which makes the air harder for everyone to breathe, but slightly less so for carriers of the gene. Such a gene would be selected for. This kind of thing can be selected for even to the point where it drives the population of a species right down to zero, as Eliezer's essay on evolving to extinction highlighted.

Actually, as Eliezer's essay emphasized, it's not even that evolution is myopic at the level of individuals; evolution is myopic down to the level of individual genes, an observation which better explains the examples of evolving-to-extinction which he discusses. (This is, of course, the point of Dawkins' book The Selfish Gene.) But the analogy of myopia-across-individuals will suit me better here.

Bayes "Evolving to Extinction"

The title of this post is a hyperbole, since there isn't an analog of an extinction event in the model I'm about to describe, but it illustrates that in extreme circumstances a Bayesian learner can demonstrate the same kind of pathological behavior that evolution does when it ends up selecting for relative fitness in a way which pumps against absolute fitness.

Like evolution, Bayes' Law will "optimize"[1] for relative fitness of hypotheses, not absolute fitness. Ordinarily there isn't enough of a difference for this to matter. However, I've been discussing scenarios where the predictor can significantly influence what's being predicted. Bayes' Law was not formulated with examples like this in mind, and we can get pathological behavior as a result.

One way to construct an example is to imagine that there is a side-channel by which hypotheses can influence the world. The "official" channel is to output predictions; but let's say the system also produces diagnostic logs which predictors can write to, and which humans read. A predictor can (for example) print stock tips into the diagnostic logs, to get some reaction from humans.

Say we have a Bayesian predictor, consisting of some large but fixed number of hypotheses. An individual hypothesis "wants" to score well relative to others. Let's also say, for the sake of argument, that all hypotheses have the ability to write to diagnostic logs, but humans are more likely to pay attention to the diagnostics for more probable hypotheses.

How should a hypothesis make use of this side-channel? It may initially seem like it should use it to make the world more predictable, so that it can make more accurate predictions and thus get a better score. However, this would make a lot of hypotheses score better, not just the one printing the manipulative message. So it wouldn't really be selected for.

Instead, a hypothesis could print manipulative messages designed to get humans to do things which no other hypothesis anticipates. This involves specifically optimizing for events with low probability to happen. Hypotheses which successfully accomplish this will get a large boost in relative predictive accuracy, making them more probable according to Bayes' Law.

So, a system in this kind of situation eventually winds up being dominated by hypotheses which manipulate events to be as unpredictable as possible (by that very system), subject to the constraint that one hypothesis or another within the system can predict them.

This is very much like what I called the entropy-market problem for futarchy, also known as the assasination-market problem. (Any prediction market involving the lifespan of public figures is equivalent to an assassination market; it pays for the death of public figures, since that is a hard-to-predict but easier-to-control event.)

Analogous problems arise if there is no side-channel but the prediction itself can influence events (which seems very plausible for realistic predictions).

Is This Myopia?

If we use "myopia" to point to the kind of non-strategic behavior we might actually want out of a purely predictive system, this isn't myopia at all. For this reason, and for other reasons, I'm more comfortable throwing this under the umbrella term "partial agency". However, I think it's importantly related to myopia.

  • Just like we can think of evolution as myopically optimizing per-individual, uncaring of overall harm to reproductive fitness if that harm went along with improvements to individual relative fitness, we can think of Bayes' Law as myopically optimizing per-hypothesis, uncaring of overall harm to predictive accuracy.
  • The phenomenon here doesn't illustrate the "true myopia" we would want of a purely predictive system, since it ends up manipulating events. However, it at least shows that there are alternatives. One might have argued "sure, I get the idea of cross-instance myopia, showing that per-instance optimization is (possibly radically) different from cross-instance optimization. But how could there be per-instance myopia, as distinct from per-instance optimization? How can partial agency get any more partial than myopically optimizing individual instances?" Bayes-evolving-to-extinction clearly shows that we can break things down further. So perhaps there's still room for a further "true myopia" which codifies non-manipulation even for single instances.
  • This phenomenon also continues the game-theoretic theme. Just as we can think of per-instance myopia as stopping cross-instance optimization by way of a Molochian race-to-the-bottom, we see the same thing here.

Neural Nets / Gradient Descent

As I've mentioned before, there is a potentially big difference between multi-hypothesis setups like Bayes and single-hypothesis setups like gradient-descent learning. Some of my arguments, like the one above, involve hypotheses competing with each other to reach Molochian outcomes. We need to be careful in relating this to cases like gradient descent learning, which might approximate Bayesian learning in some sense, but incrementally modifies a single hypothesis rather than letting many hypotheses compete.

One intuition is that stochastic gradient descent will move the network weights around, so that we are in effect sampling many hypotheses within some region. Under some circumstances, the most successful weight settings could be the ones which manipulate things to maximize local gradients in their general direction, which means punishing other nearby weight configurations -- this could involve increasing the loss, much like the Bayesian case. (See Gradient Hacking.)

There is also the "lottery ticket hypothesis" to consider (discussed on LW here and here) -- the idea that a big neural network functions primarily like a bag of hypotheses, not like one hypothesis which gets adapted toward the right thing. We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

More formally, though, we can point to some things which are moderately analogous, but not perfectly.

If we are adapting a neural network using gradient descent, but there is a side-channel which we are not accounting for in our credit assignment, then the gradient descent will not optimize the side-channel. This might result in aimless thrashing behavior.

For example, suppose that loss explicitly depends only on the output X of a neural net (IE, the gradient calculation is a gradient on the output). However, actually the loss depends on an internal node Y, in the following way:

  • When |X-Y| is high, the loss function rewards X being high.
  • When |X-Y| is low, the loss function rewards X being low.
  • When X is high, the loss function rewards low |X-Y|.
  • When X is low, the loss function rewards high |X-Y|.
  • When both values are middling, the loss function incentivizes X to be less middling.

This can spin around forever. It is of course an extremely artificial example, but the point is to demonstrate that when gradient descent does not recognize all the ways the network influences the result, we don't necessarily see behavior which "tries to reduce loss", or even appears to optimize anything.


  1. The whole point of the partial agency sequence is that words like "optimize" are worryingly ambiguous, but I don't have sufficiently improved terminology yet that I feel I can just go ahead and use it while maintaining clarity!! In particular, the sense in which Bayesian updates optimize for anything is pretty unclear when you think about it, yet there is certainly a big temptation to say that they optimize for predictive accuracy (in the log-loss sense). ↩︎

35

Ω 16