Models Modeling Models

Charlie Steiner

I - Meanings of words

Now that we have more concrete thinking under our belt, it's time to circle back to Goodhart's law for value learners. What sorts of bad behavior are we imagining from future value-learning AI? What makes those behaviors plausible? And what makes them bad?

Let's take the last point first. Judgments of goodness or badness are situated in models - models of the world that we use to infer and operationalize human values. And we don't just use the same one all the time.

When I say "I like dancing," this is a different use of the word 'like,' backed by a different model of myself, than when I say "I like tasting sugar." The model that comes to mind for dancing treats it as one of the chunks of my day, like "playing computer games" or "taking the bus." I can know what state I'm in (the inference function of the model) based on seeing and hearing short scenes. Meanwhile, my model that has the taste of sugar in it has states like "feeling sandpaper" or "stretching my back." States are more like short-term sensations, and the described world is tightly focused on my body and the things touching it.

Other models work too! That's fine, there's plenty to go around.

The meta-model that talks about me having preferences in both of these models is the framing of competent preferences. If someone or something is observing humans, it looks for human preferences by seeing what the preferences are in "agent-shaped" models that are powerful for their size^[1].

So when we call some AI behavior "bad," this is a word whose meaning depends on usage and context, but ultimately bottoms out in implied models of the world. It's like a Winograd schema - like how English-readers infer that "they" in "workers put down the boxes because they were tired" refers to the workers, the "like" in "I like dancing" is understood to use a certain perspective on how I am modeling and interacting with the world.

All of this should be taken with the caution that there's not one True Model in which the True Meaning of the word "bad" is expressed. Obviously you still have to make some choice in practice, but the point is that the way you make this choice doesn't have to look like resolving epistemic uncertainty about which model is the True Model^[2].

II - Model conflicts

What were the patterns that stood out from the previous discussion of what humans think of as bad behavior in value learning?

The most common type of failure, especially in modern day AI, is when humans are actively wrong about what's going to happen. They have something specific in mind when designing an AI, like training a boat to win the race, but then they run it and don't get what they wanted. The boat crashes and is on fire. We could make the boat racing game more of a value learning problem by training on human demonstrations rather than the score, and crashing and being on fire would still be bad.

For simple systems where humans can understand the state space and picture what we want, this is the only standard you need, but for more complicated systems (e.g. our galaxy) humans can only understand small parts or simple properties of the whole system, and we apply our preferences to those parts we can understand. From the inside, it can be hard to feel the difference, because we want things about tic-tac-toe or about the galaxy with the same set of emotions. But when trying to infer human preferences, there's going to be ambiguity and preference conflicts about the galaxy in a way that never shows up in tic-tac-toe.

This is a key point. Inter-preference conflicts aren't an issue that ever comes up if you think of humans as having a utility function, but they're almost unavoidable if you think of humans as a physical systems with different possible models. We can't fit the whole galaxy into our heads, nor could evolution fit it into our genes, and so out of necessity we have to use simple heuristics that work well pragmatically but don't always play nicely together, even in our everyday lives.

Bad preference aggregation can lead to new kinds of bad behavior that don't make much sense in the Absolute Goodhart picture of human preferences. An AI that resolves every seemingly-even deadlock of human moral intuitions by picking whichever answer leads to the most paperclips seems bad, even though it's hard to put your finger on what's wrong on the object level.

That's an extreme example, though. A value learner can fail at resolving preference conflicts without any ulterior motive, in cases where humans have competent intuitions about what the conflict-resolution process should look like. If I like dancing, and I like tasting sugar, it's obvious to me that what I shouldn't do is never go dancing so that I can stay at home and continually eat sugar.

The line between different sorts of bad behavior is blurry here. The obviousness that I shouldn't become a sugar-hermit can be thought of either as me doing preference aggregation between preferences for tasting sugar and dancing, or as an object-level preference in a more fine-grained and comprehensive model of my states and actions. But I don't want to be modeled in the most fine-grained way^[3]. So at the very first step of trying to choose between plans, we immediately need to use my meta-preferences to reason correctly.

III - Meta-preferences

The meta-preferences an AI should learn include how we want to be modeled, which preferences we endorse and which we don't, how to resolve preference conflicts, etc. These opinions are inferred from humans' words and actions, and like other preferences they're limited in scope and can come into conflict.

Learning and representing these meta-preferences is a pit full of unsolved problems. One issue is that how an AI learns and represents stuff depends on its entire design, and everyone disagrees on how to design AGI. But even in toy models accessible today, we quickly run into difficulty - this does have a silver lining, I think, because it means we can do useful work right now on learning meta-preferences.

If we consider an AGI that's an instruction-following language model, meta-preferences might be represented as text about text, like "Saying 'It's good to rob a bank' is bad," or text about the design of the model itself. But although language models are good at stating meta-preferences, I'm currently unsatisfied with the prospective ways to act on them (e.g.). It's hard for a language model to re-evaluate the way it models me based on a text description of how I want to be modeled.

AGI based on model-based reinforcement learning has a quite different set of problems. If the AI models itself, and its own operations, then our preferences about how we want it to model us aren't much harder to connect to actions in the world than our other preferences. But how are we supposed to get any human preferences learned reliably? WIth the language model we could agree to pretend that it's going to end up aligned-ish, because it learns the human text generating process and very little else. Such a story is harder to come by for an AI with a more general world model trained with self-supervised predictive loss. Still, I think all of these are problems that can be worked on, not necessarily fatal flaws.

A further complication (perhaps not meta-preferences' fault, but certainly associated with them) is that where our value-learning AI eventually ends up in preference-space depends on where it starts. This can lead to certain problems (Stuart), and we might want to better understand this process and make sure it leads somewhere sensible (me). However, some amount of this dynamic is essential; for starters, picking out humans as the things whose values we want to learn (rather than e.g. evolution) has the type signature of meta-preference. Learning human meta-preferences can push you around in preference-space, but you've still got to start somewhere.

How does all this connect back to Goodhart? I propose that a lot of the feeling of unease when considering value learning schemes reliant on human modeling is because we don't think they'd satisfy our meta-preferences. If the value learning AI is modeling us in an alien way, even if there's some setting of its parameters that would lead to outcomes we approve of it feels like it would be surrounded on all sides by steep cliffs with spikes at the bottom. This pointlike nature of the "True Values" is a key component of Absolute Goodhart arguments.

IV - Meandering about domains of validity

A meta-preference that I think is crucial for making our lives easier is a sort of conservatism, where we prefer to keep the world inside the domain of validity of our preferences. What's a domain of validity, anyhow?

Option one: The domain of validity comes bundled with the model of the world. This is like Newtonian mechanics coming with a disclaimer on it saying "not valid above 0.1 c." This way keeps things nice and simple for our limited brains, but clunky to use in abstract arguments.

Option two: We could have a plethora of different models of the world, and where they broadly agree we call it a "domain of validity," and as they agree less, we trust them less. When I talk about individual preferences having a domain of validity, we can translate this to there being many similar models that use variations on this preference, and there's some domain where they more or less agree, but as you leave that domain they start disagreeing more and more^[4].

Our models in this case have two roles; they make predictions about the world, and they also contain inferences about our preferences. Basically always, it's the preferential domain of validity that we care about. If there are two models that always predict the same behavior from us, and usually agree about our preferences, but have some situations where they utterly disagree about preferences, those situations are the ones outside the domain of validity.

What would ever incentivize a person or AI to leave the domain of validity of our preferences? Imagine you're trying to predict the optimal meal, and you make 10 different models of your preferences about food. If nine of these models think a meal would be a 2/10, and the last model thinks a meal would be a 1,000/10, you'd probably be pretty tempted to try that meal.

Ultimately, what you do depends on how you're aggregating models. Avoiding going outside the domain of validity looks like using an aggregation function that puts more weight on the pessimistic answers than the optimistic ones, or even penalizing positive variance. In the language of meta-preferences, I don't want one way of modeling me to return "super-duper-happy" while other reasonable ways of modeling me return "confused."

This meta-preference doesn't make sense if you think that there's actually One True way of modeling humans and we just don't know which it is. If our uncertainty about how to model humans was epistemic uncertainty, the right thing to do would be Bayesian updating and linear aggregation. All this talk about domains of validity would be invalid. So it's an important fact that we aren't just searching for the One True model of humans, we're just refining the desiderata by which we rate many possible models.

V - Making sense

It's time to finally do some Goodhart-reducing.

The classic mechanisms of Goodhart's law are about how optimizing a proxy - even one that's close to our True Values in everyday life - can lead to a bad score according to our True Values. This sort of Absolute Goodhart reasoning is convenient to us because most common examples of Goodhart's law involve a simple proxy leading to results that are obviously wrong. Absolute Goodhart poses a problem to any attempt to learn human values, because a value learning AI is just a complicated sort of proxy.

But for real physical humans, there are no unique True Values to compare proxies to. We can only compare models to other models. So to talk about Goodhart's law in a more naturalistic language, we have to make some edits.

It turns out to be pretty easy: just replace "proxy" with "one model" and "True Values" with "other models, especially those we find obvious when doing verbal reasoning." This gives you Relative Goodhart, which is much more useful for building value learning AI. As you can probably guess, I picked the names "Absolute" and "Relative" because in Absolute Goodhart you compare inferred human values to the lodestar of the True Values, while in Relative Goodhart you're just comparing one way of inferring human values to other ways.

In Relative Goodhart, the mechanisms of Goodhart's law are ways that one model of human values can be driven apart from other models. We can illustrate this by going back through Goodhart Taxonomy and translating the arguments:

Extremal Goodhart:
- Absolute Goodhart: When optimizing for some proxy for value, worlds in which that proxy takes an extreme value are probably very different (drawn from a different distribution) than the everyday world in which the relationship between the proxy and true value was inferred, and this big change can magnify any discrepancies between the proxy and the true values.
- Relative Goodhart: When optimizing for one model of human preferences, worlds in which that model takes an extreme value are probably very different than the everyday world from which that model was inferred, and this big change can magnify any discrepancies between similar models that used to agree with each other. Lots of model disagreement often signals to us that the validity of the preferences is breaking down, and we have a meta-preference to avoid this.
- This transformation works very neatly for Extremal Goodhart, so I took the liberty of ordering it first in the list.
Regressional Goodhart:
- Absolute Goodhart: If you select for high value of a proxy, you select not just for signal but also for noise. You'll predictably get a worse outcome than the naive estimate, and if there are some parts of the domain that have more noise without lowering the signal, the maximum value of the proxy is more likely to be there.
- Relative Goodhart: If you select for high value according to one model of humans, you select not just for the component that agrees with the aggregate of other models, but also the component that disagrees. Other models will predictably value your choice less then the model you're optimizing, and if there are some parts of the domain that tend to drive this model's estimates apart from the others' without lowering the average value, the maximum value is more likely to be there^[5]^[6].
Causal Goodhart:
- Absolute Goodhart: If we pick a proxy to optimize that's correlated with True Value but not sufficient to cause it, then there might be appealing ways to intervene on the proxy that don't intervene on what we truly want.
- Relative Goodhart: If we have two modeled preferences that are correlated, but one is actually the causal descendant of the other, then there might be appealing ways to intervene on the descendant preference that don't intervene on the ancestor preference.
  
  There's a related issue when we have modeled preferences that are coarse-grainings or fine-grainings of each other. There can be ways to intervene on the fine-grained model that don't intervene on the coarse-grained model.

These translated Goodhart arguments all make the same change, which replaces failure according to particular True Values with failure according to other reasonable models of our preferences. As Stuart Armstrong put it, Goodhart's law is model splintering for values.

Although this change may seem boring or otiose, I think it's actually a huge opportunity. In the first post I complained that Absolute Goodhart's law didn't admit of solutions. When trying to compare a model to the True Values, we didn't know the True Values. But when comparing models to other models, nothing there is unknowable!

In the next and final post, the plan is to tidy this claim up a bit, see how it applies to various proposals for beating Goodhart's law for value learning, and zoom out to talk about the bigger picture for at least a whole paragraph.

^{^}
At least, up to some finite amount of shuffling that's like a choice of prior, or universal Turing machine, or definition of "agent-shaped."
^{^}
You may recognize a resemblance to inferring human values.
^{^}
That would lead to unpalatable positions like "whatever the human did, that's what they wanted" or "the human wants to follow the laws of physics."
^{^}
Comparing preferences across models is currently an open problem. If you take this post's picture of inferring human preferences literally (rather than e.g. imagining we'll be able to train a big neural network that does all this internally), we had better figure out how to translate between ontologies better.
^{^}
And as with Extremal, we would rather not go to the part of phase space where the models of us all disagree with each other.
^{^}
My addition of the variance-seeking pressure under the umbrella of Regressional Goodhart really highlights the similarities between it and Extremal Goodhart. Both are simplifications of the same overarching math, it's just that in the Regressional case we're doing even more simplification (requiring there to be a noise term with nice properties), allowing for a more specific picture of the optimization process.

I think you're taking the perspective that the task at hand is to form an (external-to-the-human) model of what a human wants and is trying to do, and there are different possible models which tend to agree in-distribution but not OOD.

My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word "model" and just saying "the human has lots of preferences, and those preferences don't always agree with each other, especially OOD". For example, in the trolley problem, I possess a "I don't like killing people" preference, and I also possess a "I like saving lives" preference, and here they've come into conflict. This is basically a "subagent" perspective.

Do you agree?

I bring this up because "I have lots of preferences and sometimes they come into conflict" is a thing that I think about every day, whereas "my preferences can be fit by different models and sometimes those models come into conflict" is slightly weird to me. (Not that there's anything wrong with that.)

My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word "model" and just saying "the human has lots of preferences, and those preferences don't always agree with each other, especially OOD".

Yes, I'm fine with this rephrasing. But I wouldn't write a post using only the "the human has the preferences" way of speaking, because lots of different ways of thinking about the world use that same language.

This is basically a "subagent" perspective.

I think this post is pretty different from how people typically describe humans in terms of subagents, but it does contain that description.

Any physical system can have multiple descriptions of it, it doesn't have to act like it's made of subagents. (By "act like it's made of subagents," I include some things people do, like psych themselves up, or reward themselves for doing chores, or try to hide objects of temptation from themselves.) You can have several different models of a thermostat, for instance. Reconciling the different models of a thermostat might look a bit like bargaining between subagents, but if so these are atrophied male anglerfish subagents; they don't model each other and bargain on their own behalf, they are just dumb inputs in a bigger, smarter process.

If we make a bunch of partial models of a human, some of these models are going to look like subagents, or drive subagenty behavior. But a lot of other ones are going to look like simple patterns, or bigger models that contain the subagent bargaining within themselves and hold aggregated preferences, or psychological models that are pretty complicated and interesting but don't have anything to do with subagenty behavior.

And maybe a value learning AI would capture human subagenty behavior, not only in the models that contain subagent interactions as parts of themselves, but in the learned meta-preferences that determine how different models that we'd think of as human subagents get aggregated into one big story about what's good. Such an AI might help humans psych themselves up, or reward them for doing chores.

But I'd bet that most of the preference aggregation work would look about as subagenty as aggregating the different models of a thermostat. In the trolley problem my "save people" and "don't kill people" preferences don't seem subagenty at all - I'm not about to work out some internal bargain where I push the lever in one direction for a while in exchange for pushing it the other way the rest of the time, for instance.

In short, even though I agree that in a vacuum you could call each model a "subagent," what people normally think of when they hear that word is about a couple dozen entities, mostly distinct. And what's going on in the picture I'm promoting here is more like 10^4 entities, mostly overlapping.

Hmm. I think you missed my point…

There are two different activities:

ACTIVITY A: Think about how an AI will form a model of what a human wants and is trying to do.

ACTIVITY B: Think about the gears underlying human intelligence and motivation.

You're doing Activity A every day. I'm doing Activity B every day.

My comment was trying to say: "The people like you, doing Activity A, may talk about there being multiple models which tend to agree in-distribution but not OOD. Meanwhile, the people like me, doing Activity B, may talk about subagents. There's a conceptual parallel between these two different discussions."

And I think you thought I was saying: "We both agree that the real ultimate goal right now is Activity A. I'm leaving a comment that I think will help you engage in Activity A, because Activity A is the thing to do. And my comment is: (something about humans having subagents)."

Does that help?

This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.

But I feel like I kind of gave a reply anyway - I don't think the parallel with subagents is very deep. But there's a very strong parallel (or maybe not even a parallel, maybe this is just the thing I'm talking about) with generative modeling.

I really like this perspective. There may be no way to find a human's One True Value Function (to say nothing of humanity's), not only because humans are complicated to model, but also because there is probably no such thing as a human's One True Value Function (even less so for humanity as a whole). Similar to what you said, it could very well just be heuristics all the way down, heuristics both in what is valued (preferences) and how different heuristic values compete (meta-preferences). Natural selection has fine-tuned both levels to get something that works well enough for survival and reproduction of the species in the domain of validity of humans' ancestral environment, while each individual human could fine-tune their preferences and meta-preferences based on whatever leads to the greatest perceived harmony among them within the domain of validity of lived personal experience.

In AI, the concept of multiple competing value functions could be realized through ensemble models. Each sub-model within the ensemble learns a value function independently. If each sub-model receives slightly different input or starts with different random initialization to its weights, then they will each learn slightly different value functions. Then you can use the ensemble variance in predicted value (or precision = 1/variance) to determine the domain of validity. Those regions of state space where all sub-models in the ensemble pretty much agree on value (low variance / high precision) are "safe" to explore, while those regions with large disagreements in predicted value (high variance / low precision) are "unsafe". Of course a creativity or curiosity drive could motivate the system to push the frontier of the safe region, but there would always have to come a point where the potential value of exploring further is overcome by the risk, which I guess falls under the umbrella of "meta-preference".

I have had the idea that the discount factor used in decision theory and RL could be based on the precision of predictions rather than on some constant gamma factor raised to the power of the number of time steps. That way, plans with high expected value but low precision (high ensemble variance) might be weighted the same as plans with lower expected value but higher precision (lower ensemble variance). This would hopefully prevent the AI from pursuing dangerous plans that fall far outside of the trusted region of state space while steering toward plans with long-term stable positive outcomes and away from plans with long-term stable negative outcomes.

Do you agree?

My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word "model" and just saying "the human has lots of preferences, and those preferences don't always agree with each other, especially OOD".

This is basically a "subagent" perspective.

I think this post is pretty different from how people typically describe humans in terms of subagents, but it does contain that description.

Hmm. I think you missed my point…

There are two different activities:

ACTIVITY A: Think about how an AI will form a model of what a human wants and is trying to do.

ACTIVITY B: Think about the gears underlying human intelligence and motivation.

You're doing Activity A every day. I'm doing Activity B every day.

Does that help?

This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.

23

Models Modeling Models

23

Ω 12

23

Ω 12

23

Ω 12