In 2015, Jessica introduced quantilization as a countermeasure for Goodhart's Law and specification-gaming. Since these are such central problems in AI safety, I consider quantilization to be one of the best innovations in AI safety so far, but it has received little attention from the AI safety field. I think one reason for this is that researchers aren't quite clear what problem, formally, quantilization solves, that other algorithms don't. So in this piece, I define a **robust reward problem**, and then discuss when quantilization solves this problem well, and when it doesn't.

**Definition 1 (Robust reward problem)**

*We can define a robust reward problem as a tuple: . *

* is the action space*

* is the explicit reward*

* is the space of implicit rewards, a family of functions *

* is a distribution over actions*

* is the maximum implicit loss in the demonstrations*

*The goal of the agent is to maximize for any . If that was the end of the definition, the task would be too difficult, because an adversarial reward could thwart any strategy. So we need to assume that is pretty well-behaved in the region . *

*Formally, the goal is to select a strategy that maximizes the following:*

The intuition behind (1) is that a teacher may forget to include some possible failures in their reward function, but they ought not to leave out the same mistakes that they themselves frequently make. And at any rate, without (1), assuring good performance is impossible.

### When quantilization works

*You can skip this section if you're familiar with Jessica's work*

If we set , we recover the original setting in which quantilizers were described. Then (as V Kosoy has argued) the ordinary quantilizer theorems mean that we get the best lower bound for .

We will need to reuse the definitions, so to briefly recap:

**Definition 2 (Quantilizer)**. *A q-quantilizer is an agent that, when faced with a decision problem, returns a random action in the top q proportion of the base distribution , sorted by the explicit expected utility achieved if that action is executed. *

Jessica proves three things. First, that a quantilizer does not have much worse implicit loss than the base distribution :

Where is the distribution over actions selected by the quantilizer.

Second, she proves that no other strategy can get a lower-bound better than:

(There is no lower bound if there exists such that .)

This means the quantilizer has the best guarantee that can be obtained. Basically, any strategy that puts much more weight on any action can be defeated by an adversarial loss function that places all of the loss there. In a continuous action space, this is really bad. It means simply optimizing can lead to arbitrarily bad values of and hence , because the action you choose might contain unlimited counterfeit utility. More surprisingly, the same is true for agents that maximize a mixed reward, such as because these mixed reward agents also tend to land on one fixed action. In the literature, there are many variations on this theme. One can start with imitating, and then optimize and imitate in alternating steps, and so on. But any common scheme, and any remotely similar scheme can land on a fixed action and have unbounded loss.

Anyway, the third thing that Jessica proves is that if you can't let be too high anywhere, then the best thing you can do from the point of view of obtaining is to select the top actions according to with some probability and to place probability on all other actions, in other words, to quantilize. The important thing to notice is that our -guarantee is in terms of , and given a certain , the most is obtained by quantilizing. As V Kosoy has said, quantilizing with different values of carves out the pareto frontier of and guaranteed- , and so the maximum guaranteed-V is also obtained by quantilizing. Specifically, one is guaranteed

For this kind of robust reward problem, quantilization does best. Imitation, which is the subtype of quantilization with obtains a guarantee on , whereas optimization has that is bounded above, and so is doomed!

So Jessica tells a compelling story about one particular situation (with ), where actions cannot be (much) better than expected, but can be much worse. But what if we remove that assumption by choosing a value of ?

### When optimization and imitation work better than quantilization

Suppose that . Actions can be arbitrarily good, but losses are bounded. In this situation, any strategy has a bound on , which is . Given this, you might as well pick the best action every time, giving the following guarantee:

Alternatively, you can imitate, and get the guarantee:

Which course of action is preferable depends on the values of and .

Suppose alternatively that . Then, the losses of an optimizer are unbounded, but the losses of any -quantilizer with are unbounded too. In this case, the only way to get any lower-bound on your losses is to just imitate the given distribution, having your strategy be equal to the base distribution . Then you obtain the bound

### How it can depend on U

Now that we've covered a few values of , let's double down on considering . In this scenario, we said that quantilization is optimal, but we didn't yet say whether the best form of quantilization might have (optimization) or (imitation).

Intuitively, if U is completely flat, it makes sense to perform imitation, because diverging from has something to lose but nothing to gain. Conversely, if is zero (there is no hidden loss), then one can optimize, so long as one is constrained to the support of , because the hidden loss is zero for that set of actions. But can we say more?

The case of optimization is pretty clear-cut, because for any , an infinite amount of loss is incurred. Optimization would only maximize if increases faster than hyperbolically as . Basically, this would require that diverged, which would be a really pathological situation. So we can basically rule out optimization for , .

What about imitation? Imitation will be optimal for many reward functions. Basically, decreasing just increases while decreasing . If there exists some sweet-spot of actions that are pretty common but substantially outperform imitation, then quantilization will be best, and otherwise, the best approach is imitation.

### Which set of assumptions best matches reality?

In general, our actions, or those of an AI can bring astronomical benefits or harms, so or is unrealistic.

When training for a fully general, autonomous task, it is apt to model the scenario as , because the demonstrated actions could have complex downstream effects (see Taylor on "butterfly effects") that bear out on the whole light cone. But at least, in this setting, we can take consolation that imitation is theoretically safe, and try to advance projects like brain-emulation and factored cognition, that would imitate human reasoning. The disadvantage of these proposals is that they basically can only make speed-superintelligence, rather than quality-superintelligence.

The question of whether Jessica's assumption of is a reasonable model for some tasks is interesting. Following Jessica, we need (1) it to be sufficient to perform some small factor better than a human demonstrator, (2) for the human to encode all important information *either* in the explicit utility function *or* in the demonstrations, and (3) for the AI system not to decrease the frequency of astronomical, unseen positive impacts. For quantilization to do any better than imitation, we do also need (4) to have sufficient slope that it is worthwhile to deviate from imitation. It would certainly be nice if some realistic, and potentially pivotal tasks could be quantilized, but I think the jury is still out, and now primarily awaits experimental investigation.

Am I correct in saying that this suggests avoiding Goodhart's law by using pass/fail grading? Or at least, by putting a maximum on artificial rewards, such that optimizing for the reward is senseless beyond that point?

Let's take a common case of Goodhart's law: teachers who are paid based on their students' test scores. Imagine that teachers are either good or bad, and can either teach to the test (strategize) or not. Both true and measured performance are better on average for good teachers than for bad, but have some random variance. Meanwhile, true performance is better when teachers don't strategize, but measured performance is better when they do.

If good teachers care to some degree about true performance, and you set an appropriate cutoff and payouts, the "quantilized" equilibrium will be that good teachers don't strategize (since they're relatively confident that they can pass the threshold without it), but bad teachers do (to maximize their chances of passing the threshold). Meanwhile, good teachers still get higher average payouts than bad teachers. This is probably better than the Goodhart case where you manage to pay good teachers a bigger bonus relative to bad teachers, but all teachers strategize to maximize their payout. So this formalization seems to make sense in this simple test case.

ETA: I was trying to succinctly formalize the example above and I got as far as (U~𝒩(μ(teacher)-δ*strategy,σ²); I=-2δ*strategy ) but that is taking I as the difference between the test score and the true utility, not separating out test scores from payouts, and I don't want to write out all the complications that result from that so I quit. I hope that the words are enough to understand what I meant. Also I don't know why I was doing that via unicode when I should have just used LaTeX.

At least typically, we're talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of "counterfeit performance" in that distribution.

If we treat the teachers as a bunch of agents, we don't yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...

I don't fully understand the rest of the comment.

Fair enough. You were thinking about the problem from the point of view of hiring a teacher; when projecting it onto the problem from the point of a teacher deciding how to teach, I had to make additional assumptions not in the original post (ie, that "teachers care about true performance to some degree").

Still, I think that putting it in concrete terms like this helped me understand (and agree with) the basic idea.

This is a rough draft, so pointing out any errors by email or PM is greatly appreciated.

Someone maybe interested what is "quantilization" in a more layman terms, and it is explained in the section 2 of Jessika's article.