In this post, I’ll be looking at a more extreme version of the problem in the previous post.
The Extremal Goodhart problem is the fact that “Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.”
And one of the easiest ways to reach extreme values, and very different worlds, is to have a powerful optimiser at work. So forget about navigating in contained mazes: imagine me as an AI that has a great deal of power over the world.
Suppose I was supposed to cure cancer. To that end, I’ve been shown examples of surgeons cutting out cancer growths. After the operation, I got a reward score, indicating, let’s say, roughly how many cancerous cells remained in the patient.
Now I’m deployed, and have to infer the best course of action from this dataset. Assume I have three main courses of action open to me: cutting out the cancerous growths with a scalpel, cutting them out much more effectively with a laser, or dissolving the patient in acid.
Cutting out the growth with a scalpel is what I would do under apprenticeship learning. I’d be just trying to imitate the human surgeons as best I can, reproducing their actions and their outcomes.
This is a relatively well-defined problem, but the issue is that I can’t do much better than a human. Since the score gives a grading to the surgeries, I can use that information as well, allowing me to imitate the best surgeries.
But I can’t do much better than that. I can only do as well as the mix of the best features of the surgeries I’ve seen. Nevertheless, apprenticeship learning is the “safe” base from which to explore more effective approaches.
Laser surgery is something that gives me a high score, in a new and more effective way. Unfortunately, dissolving the patient in acid is also something that fulfils the requirements of getting rid of the cancer cells, in a new and (much) more effective way. Both methods involve me going off-distribution compared with the training examples. Is there any way of distinguishing between the two?
First, I could get the follow-up data on the patients (and the pre-operation data, though that is less relevant). I want to get a distribution of the outcomes of the operation - make sure that the features of the outcomes of my operations are similar.
So, I’d note some things that correlate with a high operation score versus a low operation score. The following are plausible features I might find correlating positively with high operation score; the colour coding represents how desirable the correlate actually is:
Surviving operationComplaining about painSurviving for some years afterPaying more taxesBeing more prone to dementiaThanking the surgeon
This is a mix of human desirable features (surviving operation; surviving for some years after), some features only incidentally correlated with desirable features (thanking the surgeon; paying more taxes - because they survived longer, so paid more) and some actually anti-correlated (being more prone to dementia - again, because they survived longer; complaining about pain - because those who are told the operation was unsuccessful have other things to complain about).
If I aim to maximise the reward while preserving this feature distribution, this is enough to show that laser surgery is better than dissolving the patient in acid. This can be seen as maintaining the “web of connotations” around successful cancer surgery. This web of connotations/features distributions acts as an impact measure for me to minimise, while I also try and maximise the surgery score.
Of course, if the impact measure is too strong, I’m back with apprenticeship learning with a scalpel. Even if I do opt for laser surgery, I’ll be doing some other things to maintain the negative parts of the correlation - making sure to cause some excessive pain, and ensuring their risk of dementia is not reduced. And prodding them to thank me, and to fill out their tax returns.
Quantilisers aim to avoid Goodhart problems by not choosing the reward-optimising policy, but one that optimises the reward only to a certain extent.
The problem is that there is no obvious level to set that “certain extent” to. Humans are flying blind, trying to estimate both how much I can maximise the proxy in a pernicious way, and how low my reward objective needs to be, compared with that, so that I’m likely to find a non-pernicious policy.
Here, what I am trying to preserve is a certain distribution of outcome features, rather than a certain percentage of the optimal outcome. This easier to calibrate, and to improve on, if I can get human feedback.
The obvious thing is to ask humans about which features are correlated with their true preferences, and which are not. Now, the features I find are unlikely to be expressible so neatly in human terms, as written above. But there are plausible methods for me to figure out valuable features on my own, and the more experience I have with humans, the more I know the features they think in terms of.
Then I just need to figure out the right queries to get information. Then I might be able to figure out that surviving is a positive, while pain and dementia are not.
But I’m still aware of the Goodhart problem, and the perennial issue of humans not knowing their own preferences well. So I won’t maximise survival or painlessness blindly, just aim to increase them. Increase them how much? Well, increase them until their web of connotations/feature distribution starts to break down. “Not in pain” does not correlate with “blindly happy and cheerful and drugged out every moment of their life”, so I’ll stop maximising the first before it reaches the second. Especially since “not in pain” does correlate with various measures of “having a good life” which would vanish in the drugged out scenario.
What would be especially useful would be human examples of “it’s ok to ignore/minimise that feature when you act”. So not only example of human surgeries with scores, but descriptions of hypothetical operations (descriptions provided by them or by me) that are even better. Thus, I could learn that “no cancer, no pain, quickly discharged from hospital, cheap operation” is a desirable outcome. Then I can start putting pressure on those features as well, pushing until the web of connotations/feature distribution gets too out of whack.
Of course, I still know that humans will underestimate uncertainty, even when trying not to. So I should add an extra layer of conservatism on top of what they think they require. The best way of doing this is by maximising situations that allow humans to articulate their preferences - namely making small changes initially, that I gradually increase, and get feedback on the changes and how they imagine future changes.
But even given that, I have to take into account that I have superior ability to find unexpected ways of maximising rewards, and an implicit pressure to describe these in more human-seductive ways (this pressure can be implicit, just because humans would naturally choose these options more often). And so I can consider my interactions with humans as a noisy, biased channel (even if I’ve eliminated all noise and bias, from my perspective), and be cautious about this too.
However, unlike the maximal conservatism of the Inverse Reward Design paper, I can learn to diminish my conservatism gradually. Since I’m also aware of issues like "outer loop optimisation" (the fact that humans tuning models can add a selection pressure), I can also take that into account. If I have access to the code of my predecessors, and knowledge of how it came to be that I replaced them, I can try and estimate this effect as well. As always, the more I know about human research on outer loop optimisation, the better I can account for this - because this research gives me an idea of what humans consider impermissible outer loop optimisation.