Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: this post reframes ideas from "Learning a concept using only positive examples" and "The AI, the best human advisor" in terms of a strong constraint for conservative behavior, and shows that quantilizers are optimal under this constraint.

Defining a quantilizer

A -quantilizer is an agent that, when faced with a decision problem, gives a random action in the top proportion by expected utility. More formally, let be the action set and be some distribution over actions. For any , a -quantilizer returns an action with probability

where is the quantilizer's utility function. The weird logic in the third case is necessary for handling ties; the details are not especially important. Naturally, a 0-quantilizer is just a maximizer. A 0.5-quantilizer (which we could also call a 50-percentilizer) samples a random above-average action, a 0.01-quantilizer samples a random action in the top 1%, and a 1-quantilizer samples a random action from .

The conservative cost constraint

Why might we use a quantilizer? Suppose we believe there is some unknown cost function , which quantifies the negative contribution of the action to our utility function (e.g. we could define , where is our true utility function). For example, an action that turns the universe into paperclips would be very costly. As explained previously in "Learning a concept using only positive examples", the fact that an agent's distribution over actions fits under indicates that it does not accrue many times more expected cost than would be accrued by sampling from . It turns out that for , a -quantilizer maximizes expected utility subject to the conservative cost constraint

The cost constraint states that, for any cost function, the expected cost of an action drawn from is no more than times the expected cost of an action drawn from .

Optimality of quantilizers under the cost constraint

From the cost constraint we can prove

Suppose this were not true. Then for some we have . Now define . We have Also, Jointly, these statements contradict our cost constraint.

Since we cannot assign more that probability to any action , the best we can do is assign probability to the best actions and 0 to the rest. This gets us back the definition of a quantilizer.

Repeated quantilization of games is not quantilization of repeated games

Suppose we have 2 independent games. On game , we choose an action and receive utility , where each independently comes from some distribution over worlds. We could -quantilize each game independently, so we choose from the best propertion of actions in each game independently. Alternatively, we could -quantilize the pair of the games. Here we define and . The expected utility of an action pair is

A -quantilizer will sample a random pair of actions from the best proportion of the action pairs. For example, if each game consists of saying a number in the unit interval and then gaining that much utility, and is uniform, then a 12.5-percentilizer will sample uniformly from the set . On the other hand, if we 12.5-percentilize each game independently, we will sample our action pair uniformly from the set .

What could explain this difference? If all of our cost comes from action pairs in the set (so outside this set), then uniformly sampling from will accrue a cost 64 times that of sampling from twice independently. On the other hand, uniformly sampling from will accrue a cost only 8 times that of sampling from twice independently.

The additive cost constraint

Suppose we added the assumption that is additive over the two games, so for some . Then we get the additive cost constraint

Then a repeated -quantilizer would get total expected cost so repeated quantilization would be considered acceptable under the additive cost constraint.

However, adding this assumption may be dangerous. It is possible to create scenarios in which almost all of the cost comes from correlation between actions in different trials.

For example, say that we're creating a stock-trading AI. Every minute, the AI either decides to buy or sell a single share of Google. We can set to be a uniform distribution over . For some reason, there is another computer somewhere that detects whether the AI bought or sold a share each minute appends a bit (0 for buy or 1 for sell) to a file accordingly. It runs this file (say, as a Python program) after collecting a million bits.

Now suppose the AI sells the stock only on minutes corresponding to 1 bits in a UFAI source code (which is less than a million bits long). This would cause the UFAI to be run. Since the probability that would make any one of these decisions (to buy or sell a share in a particular minute) is 50%, if we assumed that the cost was additive over iterations, then we would conclude that this AI is only twice as costly as independent samples from would be. While independent quantilization would not lead to this outcome, this hypothetical scenario proves that assuming that the cost function is additive can be very bad.

Questions for future research

Are there reasonable less-restrictive constraints on cost that allow something like independent quantilization?

Intuitively, it seems like independent quantilization is usually not dangerous. Perhaps there is some cost constraint that allows independent quantilization (or something like it) while forbidding the UFAI encoder. We want something less restrictive than the original cost constraint, but more restrictive than the additive cost constraint.

Can we use this framework to define conservative classifiers (and other learners)?

We would like to create classifiers that learn from training data and, when deciding what to do with a new example, either classify it (if it looks like the training data) or refuse to classify it (if it looks quite different from the training data). For example, if we train an AI to classify smiling humans as positive and frowning humans as negative, perhaps we want it to refuse to classify molecular smiley faces as either positive or negative. These classifiers should usually classify points when they come from the same distribution as the training data, and they should obey some kind of conservative cost constraint. I am working on writing up a solution to this for a future forum post.

When do "ordinary" distributions have high cost?

It could be that most ordinary actions that people take are either very good or very bad (and a superintelligence could predict which one it is), due to butterfly effects. In this case, the expected cost of sampling from fairly ordinary distributions is actually quite high, but it is cancelled out by an equally high expected benefit. However, a quantilizer may preserve most of the expected cost of without preserving the expected benefit. How much of a problem is this in practice, and how can it be reduced (e.g. is it possible to use false thermodynamic miracles to interfere with butterfly effects)?

Can we use something like a quantilizer to "win"?

How could an organization with access to superintelligent AI use a conservative agent (such as a quantilizer or a conservative classifier) to drastically reduce existential risk (e.g. by shutting down rogue AI projects)? It would be useful to know what kind of capabilities would be necessary to solve the strategic situation, so that we can see if a conservative agent can deliver these capabilities.

New Comment
3 comments, sorted by Click to highlight new comments since:

In general it seems like there is no tractable algorithm that samples from the given space, or performs well on the given conservative cost maximization game. On top of that, it's not clear how you would train a conservative-maximizer of this kind; e.g. even if you have a simple model class that contains a good quantilizer, you don't have access to the objective and so can't train your model in any normal way.

The most obvious solution to both problems (and a few others) is to maximize over efficient and learnable cost functions c rather than all cost functions. This almost immediately gives you a practical algorithm which we could begin to test (you can train both c and the quantilizer in parallel e.g. by gradient descent). It also looks a lot closer to things that people do in practice---this is a sensible-looking adjustment to normal apprenticeship learning.

If is not too low, then you can do this by taking a bunch of samples and evaluating them by expected utility. Of course, it might be expensive to evaluate this many samples.

I think that you can also do this with an adversarial game, as in your post on mimicry. You can have one system that takes some action, and another system that bets at some odds that the action was produced by the AI rather than the base distribution. This seems to work without learning the cost function.

I was imagining the case where is too slow, i.e. where we want the AI to actually perform a search.

The second paragraph is what I had in mind. Note that in this case you are maximizing over learnable cost functions rather than all cost functions.