Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I proposed a way around Goodhart's curse. Essentially this reduces to properly accounting all of our uncertainty about our values, including some meta-uncertainty about whether we've properly accounted for all our uncertainty.

Wei Dai had some questions about the approach, pointing out that it seemed to have a similar problem as corrigibility: once the AI has resolved all uncertainty about our values, then there's nothing left. I responded by talking about fuzziness rather than uncertainty.

Resolving ambiguity, sharply or fuzzily

We have a human , who hasn't yet dedicated any real thought to population ethics. We run a hundred "reasonable" simulations where we introduce to population ethics, varying the presentation a bit, and ultimately ask for their opinion.

In 45 of these runs, they endorsed total utilitarianism, in 15 of them, they endorsed average utilitarianism, and in 40 of them, they endorsed some compromise system (say the one I suggested here).

That's it. There is no more uncertainty; we know everything there is to know about 's potential opinions on population ethics. What we do with this information - how we define 's "actual" opinion - is up to us (neglecting, for the moment, the issue of 's meta-preferences, which likely suffer from a similar type of ambiguity).

We could round these preferences to "total utilitarianism". That would be the sharpest option.

We could normalise those three utility functions, then add them with the 45-15-40 relative weights.

Or we could do a similar normalisation, but, mindful of fragility of value, we could either move the major options to equal weights 1-1-1, or stick with 45-15-40 but use some smooth minimum on the combination. These would be the more fuzzy choices.

All of these options are valid, given that we haven't defined any way of resolving ambiguous situations like that. And note that fuzziness looks a lot like uncertainty, in that a high fuzziness mix looks like what you'd have as utility function if you were very uncertain. But, unlike uncertainty, knowing more information doesn't "resolve" this fuzziness. That's why Jessica's critique of corrigibility doesn't apply to this situation.

(And note also that we could introduce fuzziness for different reasons - we could believe that this a genuinely good way of resolving competing values, or it could be to cover uncertainty that would be too dangerous to have the AI resolve, or we could introduce it to avoid potential Goodhart problems, without believing that the fuzziness is "real").

Fuzziness and choices in extrapolating concepts

The picture where we have 45-15-40 weights on well-defined moral theories, is not a realistic starting point for establishing human values. We humans start mainly with partial preferences, or just lists of example of correct and incorrect behaviours in a narrow span of circumstance.

Extrapolating from these examples to a weighting on moral theories is a process that is entirely under human control. We decide how to do so, thus incorporating our meta-preference implicitly in the process and its outcome.

Extrapolating dogs and cats and other things

Consider the supervised learning task of separating photos of dogs from photos of non-dogs. We hand the neural net a bunch of labelled photos, and tell it to go to work. It now has to draw a conceptual boundary around "dog".

What is the AI's concept of "dog" ultimately grounded on? It's obviously not just on the specific photos we handed it - that way lies overfitting and madness.

But nor can we generate every possible set of pixels and have a human label them as dog or non-dog. Take for example the following image:

That, apparently, is a cat, but I've checked with people at the FHI and we consistently mis-identified it. However, a sufficiently smart AI might be able to detect some implicit cat-like features that aren't salient to us, and correctly label it as non-dog.

Thus, in order to correctly identify the term "dog", defined by human labelling, the AI has to disagree with... human labelling. There are more egregious non-dogs that could get labelled as "dogs", such as a photo of a close friend with a sign that says "Help! they'll let me go if you label this image as a dog".

Human choices in image recognition boundaries

When we program a neural net to classify dogs, we make a lot of choices - the size of the neural net, activation functions and other hyper-parameters, the size and contents of the training, test, and validation sets, whether to tweak the network after the first run, whether to publish the results or bury them, or so on.

Some of these choice can be seen as exactly the "fuzziness" which I defined above - some options determine whether the boundary is drawn tightly or loosely around the examples of "dog", and whether ambiguous options are pushed to one category or allowed to remain ambiguous. But some of these choices - such as methods for avoiding sampling biases or adversarial learning example of a panda as a gibbon - are much more complicated than just "sharp versus fuzzy". I'll call these choices "extrapolation choices", as they determine how the AI extrapolates from the example we have given it.

Human choices in preference recognition boundaries

The same will apply to AIs estimating human preferences. So we have three types of things here:

  • Uncertainty: this is when the AI is ignorant about something in the world. Can be resolved by further knowledge.
  • Fuzziness: this how the AI resolves ambiguity between preference-relevant categories. It can look like uncertainty, but is actually an extrapolation choice, and can't be resolved by further knowledge.
  • Extrapolation desiderata: extrapolation choices are what need to me made to construct a full classification or preference function from underdefined examples. Extrapolation desiderata are the formal and informal properties that we would want these extrapolation choices to have.

So when I wrote that to avoid Goodhart problems "The important thing is to correctly model my uncertainty and overconfidence.", I can now refine that into:

  • The important thing is to correctly model my fuzziness, and my extrapolation desiderata.

Neat and elegant! However, to make it more applicable, I unfortunately need to extend it in a less elegant fashion:

  • The important thing is to correctly model my fuzziness, and my extrapolation desiderata, including any meta-desiderata I might have for how to model this correctly (and any errors I might be making, that I would desire to have recognised as errors).

Note that there is no longer any deep need to model "my" uncertainty. It is still important to model uncertainty about the real world correctly, and if I'm mistaken about the real world, this may be relevant to what I believe my extrapolation desiderata are. But modelling my uncertainty is merely instrumentally useful, but modelling my fuzziness is a terminal goal if we want to get it right.

As a minor example of the challenge of the above, consider that this would have needed to be able to detect that adversarial examples were problematic, before anyone had conceived of the idea.

I won't develop this too much more here, as the ideas will be included in my research agenda whose first draft should be published here soon.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 12:29 AM

I missed the proposal when it was first released, but I wanted to note that the original proposal addresses only one (critical) class of Goodhart-error, and proposes a strategy based on addressing one problematic result of that, nearest-unblocked neighbor. The strategy does more widely useful for misspecification than just nearest-unblocked neighbor, but it still is only addressing some Goodhart-effects.

The misspecification discussed is more closely related to, but still distinct from, extremal and regressional Goodhart. (Causal and adversarial Goodhart are somewhat far removed, and don't seem as relevant to me here. Causal Goodhart is due to mistakes, albeit fundamentally hard to avoid mistakes, while adversarial Goodhart happens via exploiting other modes of failure.)

I notice I am confused about how different strategies being proposed to mitigate these related failures can coexist if each is implemented separately, and/or how they would be balanced if implemented together, as I briefly outline below. Reconciling or balancing these different strategies seems like an important question, but I want to wait to see the full research agenda before commenting or questioning further.

Explaining the conflict I see between the strategies:

Extremal Goodhart is somewhat addressed by another post you made, which proposes to avoid ambiguous distant situations - It seems that the strategy proposed here is to attempt to resolve fuzziness, rather than avoid areas where it becomes critical. These seem to be at least somewhat at odds, though this is partly reconcilable by fully pursuing neither resolving ambiguity, nor fully avoiding distant ambiguity.

and regressional Goodhart, as Scott G. originally pointed out, is unavoidable except by staying in-sample, interpolating rather than extrapolating. Fully pursuing that strategy is precluded by injecting uncertainty into the model of the Human-provided modification to the utility function. Again, this is partly reconcilable, for example, by trying to bound how far we let the system stray from the initially provided blocked strategy, and how much fuzziness it is allowed to infer without an external check.

I think it's better not to let jargon proliferate unnecessarily, and your use of the term "fuzziness" seems rather, well, fuzzy. Is it possible that the content of this post could be communicated using existing jargon such as "moral uncertainty"?

Actually, I assumed fuzzy was intended here to be a precise term, contrasted with probability and uncertainty, as it is used in describing fuzzy sets versus uncertainty about set membership.

I'm not sure it maps exactly onto fuzzy sets the way I described it, but it does feel related to that area of research.

It's not exactly the same, but I would argue that the issues with "Dog" versus "Cat" for the picture are best captured with that formalism - the boundaries between categories are not strict.

To be more technical, there are a couple locations where fuzziness can exist. First, the mapping in reality is potentially fuzzy since someone could, in theory, bio-engineer a kuppy or cat-dog. These would be partly members of the cat set, and partly members of the dog set, perhaps in proportion to the genetic resemblance to each of the parent categories.

Second, the process that leads to the picture, involving a camera and a physical item in space, is a mapping from reality to an image. That is, reality may have a sharp boundary between dogs and cats, but the space of possible pictures of a given resolution is far smaller than the space of physical configurations that can be photographed, so the mapping from reality->pictures is many-to-one, creating a different irresolvable fuzziness - perhaps 70% of the plausible configurations that lead to this set of pixels are cats, and 30% are dogs, so the picture has a fuzzy set membership.

Lastly, there is mental fuzziness, which usually captures the other two implicitly, but has the additional fuzziness created because the categories were made for man, not man for the categories. That is, the categories themselves may not map to reality coherently. This is different from the first issue, where "sharp" genetic boundaries like that between dogs and cats do map to reality correctly, but items can be made to sit on the line. This third issues is that the category may not map coherently to any actual distinction, or may be fundamentally ambiguous, as Scott's post details for "Man vs. Woman" or "Planet vs. Planetoid" - items can partly match one or more than one category, and be fuzzy members of the set.

Each of these, it seems, can be captured fairly well as fuzzy sets, which is why I'm proposing that your usage has a high degree of membership in the fuzzy set of things that can be represented by fuzzy sets.

I agree with all this.

Nice post. I suspect you'll still have to keep emphasizing that fuzziness can't play the role of uncertainty in a human-modeling scheme (like CIRL), and is instead a way of resolving human behavior into a utility function framework. Assuming I read you correctly.

I think that there are some unspoken commitments that the framework of fuzziness makes for how to handle extrapolating irrational human behavior. If you represent fuzziness as a weighting over utility functions that gets aggregated linearly (i.e. into another utility function), this is useful for the AI making decisions but can't be the same thing that you're using to model human behavior, because humans are going to take actions that shouldn't be modeled as utility maximization.

To bridge this gap from human behavior to utility function, what I'm interpreting you as implying is that you should represent human behavior in terms of a patchwork of utility functions. In the post you talk about frequencies in a simulation, where small perturbations might lead a human to care about the total or about the average. Rather than the AI creating a context-dependent model of the human, we've somehow taught it (this part might be non-obvious) that these small perturbations don't matter, and should be "fuzzed over" to get a utility function that's a weighted combination of the ones exhibited by the human.

But we could also imagine unrolling this as a frequency over time, where an irrational human sometimes takes the action that's best for the total and other times takes the action that's best for the average. Should a fuzzy-values AI represent this as the human acting according to different utility functions at different times, and then fuzzing over those utility functions to decide what is best?

I'm not basing this on behaviour (because that doesn't work, see: ), but on partial models.