Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

This post will look at some of the properties of quantilizers, when they succeed and how they might fail.

Roughly speaking, let be some true objective function that we want to maximise. We haven't been able to specify it fully, so we have instead a proxy function . There is a cost function which measures how much falls short of . Then a quantilizer will choose actions (or policies) radomly from the top of actions available, ranking those actions according to .

It is plausible that for standard actions or policies, and are pretty similar. But that when we push to maximising , then the tiny details where and differ will balloon, and the cost can grow very large indeed.

This could be illustrated roughly by figure I, where and are plotted against each other; imagine that is on a log scale.

The blue areas are possible actions that can be taken. Note a large bunch of actions that are not particularly good for but have low cost, a thinner tail of more optimised actions that have higher and still have low cost, and a much thinner tail that has even higher but high cost. The -maximising actions with maximal cost are represented by the red star.

Figure I thus shows a situation ripe for some form of quantilization. But consider figure II:

In figure 2, the only way to get high is to have a high . The situation is completely unsuited for quantilization: any maximiser, even a quantilizer, will score terribly under . But that means mainly that we have chosen a terrible .

Now, back to figure I, where quantilization might work, at least in principle. The ideal would be situation Ia; here blue represents actions below the top cut-off, green those above (which include the edge-case red-star actions, as before):

Here the top of actions all score a good value under , and yet most of them have low cost.

But even within the the broad strokes of figure I, quantilization can fail. Figure Ib shows a first type of failure:

Here the problem is that the quantilizer lefts in too many mediocre actions, so the expectation of (and ) is mediocre; with a smaller , the quantilizer would be better.

Another failure mode is figure Ic:

Here the is too low: all the quantilized solutions have high cost.

Another quantilizer design

An idea I had some time ago was that, instead of of taking the top of the actions, the quantilizer instead choose among the actions that are within of the top -maximising actions. Such a design would be less likely to encounter situations like Ib, but more likely to face situations like Ic.

What can be done?

So, what can be done to improve quantilizers? I'll be posting some thoughts as they develop, but there are two ideas that spring to mind immediately. First of all, we can use CUA oracles to investigate the shape of the space of actions, at least from the perspective of (, like , cannot be calculated explicitly).

Secondly, there's an idea that I had around low-impact AIs. Basically, it was to ensure that there was some action the AI could take that could easily reach some approximation of its goal. For instance, have a utility function that encourages the AI to build one papeclip, and cap that utility at one. Then scatter around some basic machinery to melt steel, stretch it, give the AI some manipulator arms, etc... The idea is to ensure there is at least one safe policy that gives the AI some high expected utility. Then if there is one policy, there's probably a large amount of similar policies in its vicinity, safe policies with high expectation. Then it seems that quantilization should work, probably best in its 'within of the maximal policy' version (working well because we know the cap of the utility function, hence have a cap on the maximal policy).

Now, how do we know that a safe policy exists? We have to rely on human predictive abilities, which can be flawed. But the reason we're reasonably confident in this scenario is that we believe that we could figure out how to build a paperclip, given the stuff the AI has lying around. And the AI would presumably do better than us.

New Comment