(This work was supported by CEEALAR. Thanks to Michele Campolo and others for conversations.)
I - Digression, aggression
The perspective arrived at in this sequence is almost aggressively third-person. Sure, I allow that you can have first-person stories about what you value, but these stories only rise in prominence to an outside observer if they're supported by cold, hard, impersonal facts. What's more, there can be multiple first-person stories that fit the data, and these different stories don't have to be aggregated as if we're under Bayesian uncertainty about which one is the One True story, they can all be acceptable models of the same physical system.
This can be unintuitive and unpalatable to our typical conception of ourselves. The assumption that there's a unique fact of the matter about what we're feeling and desiring at any given time is so convenient as to be almost inescapable. But if even knowing the exact state of the my brain leaves room for different inferences about what I want, just like we can impute different sets of preferences to a thermostat, then our raw intuition that we have just one set of True Values is no foundation to build a value learning AI on.
This sequence, which is nominally about Goodhart's law, has also been a sidelong way of attacking the whole problem of value learning. Contrary to what we might have intended when we first heard about the problem, the solution is not to figure out what and where human values Truly are, and then use this to do value learning. The solution is to figure out how to do value learning without needing True Values at all.
If you are one of the people currently looking for the True Values, this is where I'm calling you out. For your favorite value-detection procedure, I'd suggest you nail down how this procedure models humans, and why that way is good, and what the advantages and disadvantages are of the reasonable alternatives. I am happy to provide specific suggestions or argue with you in the comments.
II - Summing up
In the penultimate post we rephrased some mechanisms of Goodhart's law in terms of modeled human preferences rather than True Values. The old style arguments would give some reason why models give different recommendations in some circumstances, and then say "the choices get pushed away from those recommended by the True Values, which is Goodhart's law." We replaced these with new style arguments, optimized for the issues that comes up when trying to learn human values, that instead say "the choices recommended by different models get pushed away from each other, which is Goodhart's law."
While writing these posts, I've had a joke stuck in my head. A patient goes to the doctor and says "Doctor, it hurts when I do this." The doctor says "Then don't do that!"
The goal is to be in the situation of the patient. If the "non-obvious" sort of Goodhart's law kicks in when we go to a part of state space where otherwise-good models of us get pushed away from each other, can't we just... not do that?
Unfortunately we can't translate that into code, because I've stuffed the unsolved problems into that qualifier "otherwise-good" (we'll get back to that). But whatever route we take to end up with an AI that makes the future go well, it's got to "not do that" somehow. This is interesting to me because it seems like such a weird and kludgy property to ask for if you assume that humans have True Values. And yet it makes perfect sense if you treat us as physical systems that admit multiple interpretations.
Because we can't translate it into code, this sequence is a failure at sketching a solution to Goodhart's law. I hear you cry: "What have you been doing this whole time, then, if not sketching a solution?!" But this was never about a specific solution - this was about sketching a proof that when you naturalize Goodhart's law, a solution is at all possible. The engineering problem does not have to be solved in the same way as the conceptual problem, any more than working out the math of a nuclear chain reaction as a function of the concentration of uranium means that our reactors have to use magic concentration-changing fuel instead of control rods.
Even before understanding how to get an ML system to make agent-shaped models of us, we can still use this frameworks as a lens to use when looking at proposed Goodhart's law countermeasures. We want plans that maintain sensitivity to the broad spectrum of human values, that allow us to express our meta-preferences (by which I mean how we want to be modeled - perhaps an unwise choice of terminology), that are conservative about pushing the world off-distribution, and of course that don't do obviously bad things.
III - Solution-shaped objects
Using this lens, here are some disconnected thoughts about a few things that work, or have been proposed to work.
In the real world, you can often beat Goodhart's law by hitting it with your wallet. If rewarding dolphins for each piece of trash they bring you teaches them to rip up trash into smaller pieces, you can inspect the entire pool before giving the reward. Simple, but not easy.
The generalization of this is evaluation procedures that are more expensive for the AI to fake than to be honest about. This works (albeit slowly) when you understand the AI's actions, but then fails dramatically when you don't. Interpretability tools might help us understand AIs that are smart enough to be transformative, but would still run into limits as the differences from human-intelligibility mount.
Since we get to build the AI, we can do better by connecting supervision to value learning. The evaluations of the AI's plans that we laboriously produce can be used as a training signal for how it evaluates policies, similar to Deep RL From Human Preferences, or the motive behind Redwood's violence-classification project. (There's also be an analogy to offline supervised learning from a dataset with lots of effortful human introspection.)
Does expensive supervision handle a broad spectrum of human preferences? Well, humans can give supervisory feedback about a broad spectrum of things, which is like learning about all those things in the limit of unlimited human effort. If they're just evaluating the AI's proposed policies, though, their ability to provide informative feedback may be limited.
Does it allow us to express meta-preferences? This is a bit complicated. The optimistic take is that direct feedback is suppose to decrease the need for meta-preferences by teaching the AI to directly model the reward function, not to model humans. The pessimistic take is that there's a fixed model of humans that says what we really want is whatever controls the supervision process, and we're hoping that's good.
Is it conservative about pushing the world off-distribution? This scheme relies on the human supervision for this safety property. So on average yes, but the supervision will be limited and expensive, and there may be edge cases that have no homeostatic mechanism.
Does it avoid obviously bad things? As long as we understand the AI's actions, this is great at avoiding obviously bad things. Human understanding can be fragile, though - in Deep RL From Human Preferences, humans who tried to train a robot to pick up a ball were fooled when it merely interposed its hand between the camera and the ball.
Avoiding large effects:
Various proposals to avoid side-effects, do extra-mild optimization, and avoid gaining or losing too much power are all focusing on the "be conservative about going off-distribution" part of avoiding Goodhart's law. Most of these try to entirely avoid certain ways of leaving the training distribution, which can be a heavy cost if we want to make renovations to the galaxy.
You can still have bad things on-distribution, which these say little about, but there's a sort of "division of labor" approach going on here. You can use one technique to avoid pushing the world off-distribution, a different technique to avoid obviously bad things, and then to account for meta-preferences you... uh... well...
There's a bit of a dilemma with meta-preferences. Either you fulfill them, in which case you are basically postulating that you have an aligned AI and then are swaddling it an extra layer of constraints. Or more likely, you don't, and you have an unaligned AI that you are trying to restrict to a domain you understand well enough to check its work. This dilemma might seem weird if you aren't thinking of it as vital to model humans how they want to be modeled, rather than whatever way is most convenient for the AI. On the first horn the swaddling isn't really an active ingredient in avoiding Goodhart's law, and on the second horn I'm not optimistic about the usefulness, as remarked in Goodhart Ethology.
Another approach we might take is that the AI should imitate humans. Not merely imitating actions, but imitating human patterns of thought, but done faster, or longer, or better in ways that our unmodified selves want.
Actually doing this would be quite a trick. But suppose it's been done - how does "human reasoning but a bit better" do on Goodhart's law?
As far as I can tell, pretty well! Imitating human reasoning would cover a broad spectrum of human preferences, would allow us to express a restricted but arguably crucial set of meta-preferences, and would use imitation human judgment in avoiding extremes. It would do better than humans at avoiding most obviously bad things, which is... not a ringing endorsement, but it's a start.
The big issue here (aside from how you learned to imitate human reasoning in the first place) might be a forced tradeoff between competitiveness and qualitatively human-like reasoning. Knobs cannot be turned up very far before we start worrying that the emergent behavior doesn't have to reflect the values of the human-reasoning-shaped building blocks. Advocates of imitating humans sometimes pivot here to talk about imitating just the important parts while scaling up other cognitive capabilities, which sounds great but has lots of unsolved conceptual issues.
You might get a sense that this is cheating - aren't we just putting the human utility function inside the AI, just like old-style Goodhart intended? One difference is in how stringent our desiderata are. Even if a perfect human imitation will fulfill humans' True Values (let's not look too closely at this assertion right now), perfection doesn't exist. Only with more relaxed desiderata about what counts as making things go well can we make peace with the inevitable approximations and imperfections.
IV - Unsolved problems
I really waved my hands a lot, this sequence. How can an AI represent human preferences in a way that makes sense when translated across ontologies? How is the AI supposed to identify our preferences about how we want to be modeled and then apply them to its own modeling process? How on earth are we supposed to write or teach the meta-preferences I take as a starting point, when that contains complicated things like how to model the whole world and pick out which systems in it are "the humans"?
I think all of these problems would reward more thought and research. If we want to align AI by learning human values, the more we understand problems like these, the more principled our choices can be. The hypothetical value learning scheme that I've sort of been gesturing at this sequence would only become practical if all of these problems had individually practical resolutions.
Since I lacked those resolutions, I chose the topic of this sequence so I could "cheat." For the key arguments, it doesn't matter how these things get done, or how practical they are, so long as they're possible. We started with the quandary of realizing that humans are physical systems and therefore must have different ways of describing them. If we assume that we can model humans, and that we have some commonsense opinions about how we should be modeled, then we can recover the usual uses of Goodhart's law - the modeled preferences will break down in extreme cases, but with a few more reasonable assumptions about meta-preferences we can explain why it makes total sense to content ourselves with the non-extremes. This is a great outcome of trying to naturalize Goodhart's law, because it means that you don't have to be perfect to solve it! Actually exhibiting one of these solutions... well, that's a problem for our future selves.