Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from my blog.

One thing I worry about sometimes is people writing code with optimisers in it, without realising that that's what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario, I'd be wary about your deploying that controller, since the controller itself is doing optimisation which might steer the world into a weird and unwelcome place.

In order to avoid such scenarios, it would be nice if one could look at an algorithm and determine if it was doing optimisation. Ideally, this would involve an objective definition of optimisation that could be checked from the source code of the algorithm, rather than something like "an optimiser is a system whose behaviour can't usefully be predicted mechanically, but can be predicted by assuming it near-optimises some objective function", since such a definition breaks down when you have the algorithm's source code and can compute its behaviour mechanically.

You might think about optimisation as follows: a system is optimising some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing. This type of definition includes those put forward by Yudkowsky and Oesterheld. However, I think there are crucial counterexamples to this style of definition.

Firstly, consider a lid screwed onto a bottle of water. If not for this lid, or if the lid had a hole in it or were more loose, the water would likely exit the bottle via evaporation or being knocked over, but with the lid, the water stays in the bottle much more reliably than otherwise. As a result, you might think that the lid is optimising the water remaining inside the bottle. However, I claim that this is not the case: the lid is just a rigid object designed by some optimiser that wanted water to remain inside the bottle.

This isn't an incredibly compelling counterexample, since it doesn't qualify as an optimiser according to Yudkowsky's definition: it can be more simply described as a rigid object of a certain shape than an optimiser, so it isn't an optimiser. I am somewhat uncomfortable with this move (surely systems that are sub-optimal in complicated ways that are easily predictable by their source code should still count as optimisers?), but it's worth coming up with another counterexample to which this objection won't apply.

Secondly, consider my liver. It's a complex physical system that's hard to describe, but if it were absent or behaved very differently, my body wouldn't work, I wouldn't remain alive, and I wouldn't be able to make any money, meaning that my bank account balance would be significantly lower than it is. In fact, subject to the constraint that the rest of my body works in the way that it actually works, it's hard to imagine what my liver could do which would result in a much higher bank balance. Nevertheless, it seems wrong to say that my liver is optimising my bank balance, and more right to say that it "detoxifies various metabolites, synthesizes proteins, and produces biochemicals necessary for digestion"---even though that gives a less precise account of the liver's behaviour.

In fact, my liver's behaviour has something to do with optimising my income: it was created by evolution, which was sort of an optimisation process for agents that reproduce a lot, which has a lot to do with me having a lot of money in my bank account. It also sort of optimises some aspects of my digestion, which is a necessary sub-process of me getting a lot of money in my bank account. This explains the link between my liver function and my income without having to treat my liver as a bank account funds maximiser.

What's a better theory of optimisation that doesn't fall prey to these counterexamples? I don't know. That being said, I think that they should involve the internal details of the algorithms implemented by those physical systems. For instance, I think of gradient ascent as an optimisation algorithm because I can tell that at each iteration, it improves on its objective function a bit. Ideally, with such a definition you could decide whether an algorithm was doing optimisation without having to run it and see its behaviour, since one of the whole points of a definition of optimisation is to help you avoid running systems that do it.

Thanks to Abram Demski, who came up with the bottle-cap example in a conversation about this idea.

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 10:33 AM

Daniel Filan's bottle cap example was featured prominently in "Risks from Learned Optimization" for good reason. I think it is a really clear and useful example of why you might want to care about the internals of an optimization algorithm and not just its behavior, and helped motivate that framing in the "Risks from Learned Optimization" paper.

Daniel Filan's bottle cap example

Note that Abram Demski deserves a large part of the credit for that specific example (somewhere between 'half' and 'all'), as noted in the final sentence of the post.


A reminder, since this looks like it has a few upvotes from AF users: posts need 2 nominations to proceed to the review round. 

Review by the author:

I continue to endorse the contents of this post.

I don't really think about the post that much, but the post expresses a worldview that shapes how I do my research - that agency is a mechanical fact about the workings of a system.

To me, the main contribution of the post is setting up a question: what's a good definition of optimisation that avoids the counterexamples of the post? Ideally, this definition would refer or correspond to the mechanistic properties of the system, so that people could somehow statically determine whether a given controller was an optimiser. To the best of my knowledge, no such definition has been developed. As such, I see the post as not having kicked off a fruitful public conversation, and its value if any lies in how it has changed the way other people think about optimisation.

Yes, I kind of agree with you

I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets, and that the naive definition seems to work with one modification:

A system is downstream from an optimizer of some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing.

I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets.

Yes, this seems pretty important and relevant.

That being said, I think that that definition suggests that natural selection and/or the earth's crust are downstream from an optimiser of the number of Holiday Inns, or that my liver is downstream from an optimiser from my income, both of which aren't right.

Probably it's important to relate 'natural subgoals' to some ideal definition - which offers some hope, since 'subgoal' is really a computational notion, so maybe investigation along these lines would offer a more computational characterisation of optimisation.

[EDIT: I made this comment longer and more contentful]

Okay, so another necessary condition for being downstream from an optimizer is being causally downstream. I'm sure there are other conditions, but the claim still feels like an important addition to the conversation.

I think what we need is some notion of mediation. That is, a way to recognize that your liver's effects on your bank account are mediated by effects on your health and it's therefore better thought of as a health optimizer.

This has to be counteracted by some kind of complexity penalty, though, or else you can only ever call a thing a [its-specific-physical-effects-on-the-world]-maximizer.

I wonder if we might define this complexity penalty relative to our own ontology. That is, to me, a description of what specifically the liver does requires lots of new information, so it makes sense to just think of it as a health optimizer. But to a medical scientist, the "detoxifies..." description is still pretty simple and obviously superior to my crude 'health optimizer' designation.

The model of the bank account compresses the target function of the brain, even when expressed in terms of specific physical effects on the world. Further, the model of health compresses the target function of the liver better than the bank account.

Let me see if I got it right:

  1. Defining optimizers as an unpredictable process maximizing an objective function does not take into account algorithms that we can compute

  2. Satisfying the property P "give the objective function higher values than an inexistence baseline" is not sufficient:

  • the lid satisfies (P) with "water quantity in bottle" but is just a rigid object that some optimizer put there. However, not the best counter-example because not a Yudkwoskian optimizer.
  • if a liver didn't exist or did other random things then humans wouldn't be alive and rich, so it satisfies (P) with "money in bank account" as the objective function. However, the better way to account for its behaviour (cf. Yudkowskian definition) is to see it as a sub-process of an income maximizer created by evolution.
  1. One property that could work: have a step in the algorithm that provably augments the objective function (e.g. gradient ascent).

Properties I think are relevant:

  • intent: the lid did not "chose" to be there, humans did
  • doing something that the outer optimizer cannot do "as well" without using the same process as the inner optimizer : would be very tiring for humans to use our hands as lids. Humans cannot play go as well as Alpha Zero without actually running the algorithm.

I think my syntax/semantics idea is relevant to this question - especially the idea of different sets of environments.

For example, suppose we have a super-intelligent bottle cap, dedicated to staying on the bottle (and with some convenient manufacturing arms and manufacturing capability. This seems to be exactly an optimiser, one that we mere humans cannot expect to be able to get off the bottle.

In contrast the standard bottle cap will only remain on the bottle in a much narrower set of circumstances (though the superintelligent bottle cap will also remain on in those circumstances).

So it seems that what distinguishes the standard bottle cap from a genuine optimiser, is that the genuine optimiser will accomplish its role in a much larger set of (possibly antagonistic) environments, while the standard bottle cap will only do so in a much smaller set of circumstances.

A larger set of circumstances... how are you counting circumstances? How are you weighting them? It's not difficult to think of contexts and tasks where boulders outperform individual humans under the realistic distribution of probable circumstances.

It's helped me hone my thinking on what is and isn't an optimiser (and a wireheader, and so on, for associated concepts).

Can you define it in terms of "sensory", "motor", and "processing"? That is, in order to be an optimizer, you must have some awareness of the state of some system; at least two options for behavior that affect that system in some way; and a connection from awareness to action that tends to increase some objective function.

Works for bottle cap: no sensory, only one motor option.

Works for liver: senses blood, does not sense bank account. Former is a proxy for latter but a very poor one.

For bubbles? This definition would call bubbles optimizers of finding lower pressure areas of liquid, iff you say that they have the "option" of moving in some other direction. I'm OK with having a fuzzy definition in this case; in some circumstances, you might *want* to consider bubbles as optimizers, while in others, it might work better to take them as mechanical rule-followers.


Discernment seems to be part of the definition. Choosing a and not B. And then having amplified potential to optimise optimising. Choosing the self choice of what that thing is.

There is no such thing as an optimizer except in the mind of a human anthropomorphizing that entity. I wrote about it some time ago. Let me quote, sorry it is long. One can replace "agent" with "optimizer" in the following.

... Something like a bacterium. From the human point of view, it is alive, and has certain elements of agency, like the need to feed, which is satisfies by, say, moving up a sugar gradient toward richer food sources so it can grow. It also divides once it is mature enough, or reached a certain size. It can die eventually, after multiple generations, and so on.
The above is a very simplified black-box description of bacteria, but still enough to make at least some humans care to preserve it as a life form, instead of coldly getting rid of it and reusing the material for something else. Where does this compassion for life come from? I contend that it comes from the lack of knowledge about the inner workings of the “agent” and consequently lack of ability to reproduce it when desired.
I give a simple example to demonstrate how lack of knowledge makes something look “alive” or “agenty” to us and elicits emotional reactions such as empathy and compassion. Enter
Let’s take a… pot of boiling water. If you don’t have an immediate intuitive picture of it in mind, here is a sample video. Those bubbles look almost alive, don’t they? They are born, they travel along a gradient of water pressure to get larger, while changing shape rather chaotically, they split apart once they grow big enough, they merge sometimes, and they die eventually when reaching the surface. Just like a bacteria.
So, a black-box description of bubbles is almost indistinguishable from a black-box description of something that is conventionally considered alive. Yet few people feel compelled to protect bubbles, say, by adding more water and keeping the stove on, and have no qualms whatsoever to turn off the boiler and letting the bubbles “die”. How come?
There are some related immediate intuitive explanations for it:
We know “how the bubbles work” — it’s just water vapor after all! The physics is known, and the water boiling process can be numerically simulated from the relevant physics equations.
We know how to make bubbles at will — just pour some water into a pot and turn the stove on.
We don’t empathize with bubbles as potentially experience suffering, something we may when observing, say, a bacteria writhe and try to escape when encountering an irritant.
We see all bubbles as basically identical, with no individuality, so a disappearing bubble does not feel like a loss of something unique.
Thus something whose inner workings we understand down to the physical level and can reproduce at will without loss of anything “important” no longer feels like an agent. This may seem rather controversial. Say, you poke a worm and it wriggles and squirms, and we immediately anthropomorphize this observation and compare it to human suffering in similar circumstances. Were we to understand the biology and the physics of the worm, we may have concluded that the reactions are more like that of a wiggling bubble than that of a poked human, assuming the brain structure producing the quale “suffering” does not have an analog in the worm’s cerebral ganglion. Alternatively, we might conclude that worms do have a similar structure, producing suffering when interacted with a certain way, and end up potentially extending human morals to cover worms, or maybe also bacteria. Or even bubbles.

There is no such thing as an optimizer except in the mind of a human anthropomorphizing that entity.

Is there some other set of concepts that don't exist only in the human mind?

I claim that this is wrong: I can understand down to the physical level and reproduce at will something which implements the UCB algorithm, and it still seems like an optimisation algorithm to me.

Hmm, I don't have a good understanding of this algorithm, from your link I gather that this is still an agent who follows the algorithm, not a physical system without an agent anywhere in there, like, say, a chess bot. But it could be my misunderstanding.

Is there a difference between what you call optimizer and what Paul Christiano calls daemon?

I think that everything that Paul would call a daemon, I would call an optimiser.

Things that I would call optimisers that Paul would (probably?) not call daemons:

  • A program that ran gradient descent in order to solve a linear regression problem.
  • The UCB algorithm that optimises payoffs in multi-armed bandit problems.