Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In my last post (Gricean communication and meta-preferences), I talked about the infinite tower of meta-preferences that in principle could be applied by humans. Example meta-preference: "I'd rather that I didn't act like I want to watch a bunch of random youtube videos - that's not a 'real' preference, it's something I should change." And then you can have meta-meta-feelings about this meta-preference, and off you go.

This infinite tower can't fit into reality, so it has to be dealt with somehow if we want to include the effects of meta-preferences in the process of value learning. Now, it's possible to deny this premise, model humans as actually only being able to compute a finite number of levels, and imitate that finite number of levels perfectly. But if we want to avoid the reductio that value learning should aim for perfect human imitation, I think there are two different camps here.

The generator camp treats each level of meta-preferences as an important generator of the level below. The infinite regress is avoided by using the fact that all levels of the tower are computed by the same mind, so incorporating meta-ethics is about approximating a fixed point of this self-ratification process. Preferably finding a fixed point that makes judgements sort of like humans' in easy cases. I think one can make a very good case for this, and that it's sort of hinted at in IDA approaches.

But my preferred approach is to treat the higher meta-levels as patches to the object level. That is, finding some approximation scheme where the meta-preferences are corrections that rapidly converge. Unlike the generator framework, different levels in the patch framework can be different parts of a heterogeneous values model, rather than being different questions posed to a single agent model.

One way of thinking about patches

In brainstorming for this post, I went through a bunch of patch ideas that won't work (happy to talk about those if prompted). But here's a way of thinking by making the analogy to semi-supervised learning (similar to the ideas used in "What's the dream for giving natural language commands to AI?") that might be productive:

To say it too-simply, imagine just asking the AI to include meta-preferences. If we "ask" (in quotes because we might be talking about prompting with some large amount of examples or other methods of definition) the AI for the representation of "what humans want," we trigger a different representation than if we ask for "what humans want, if we knew more, thought faster, were more the people we wished we were, had grown up farther together, etc."

We have to be a little careful about what kind of thing the AI is actually doing here. Activating features is fundamentally about pattern-matching. "What humans want" would just be some representative, useful feature that does a good job helping the AI predict the prompts. Where humans are inconsistent or vague, we might want the AI to average things out by representativeness or usefulness as a feature. This would mean that minor inconsistencies would have only minor contributions, but along the axes where there's strong conflict, the sum might cancel out.

Now, what would it do with "what humans want, if we knew more, thought faster, were more the people we wished we were, etc."? The exact same thing! It would not simulate any actual self-improvement, instead it would find some feature that's representative of self-improvement - it would change its representation of human values in a "self-improvementy" sort of way.

You can simulate this exact same cognitive pattern by asking yourself "what would someone do if they were a better person?" You didn't answer this question by simulating them, or even trying to mirror some process of self-reflection. Instead, you have some pre-computed ideas about what that sort of change looks like. These might be abstracted from actual observations of self-reflection, but they're also going to incorporate more abstract ideas that we might not have learned from experience.

There's a bunch of issues with this - it runs straight into the alien concepts problem, and we might want to more explicitly simulate the actual dynamics of self-reflection. But for now, I'm finding it a useful way to think about "meta-preferences as patch."

New Comment