Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Written quickly after a CHAI meeting on the topic, haven't thought through it in depth.

If we write down an explicit utility function and have an AI optimize that, we expect that a superintelligent AI would end up doing something catastrophic, not because it misunderstands what humans want, but because it doesn't care -- it is trying to optimize the function that was written down. It is doing what we said instead of doing what we meant.

An approach like Inverse Reward Design instead says that we should take the human's written down utility function as an observation about the true reward function, and infer a distribution over true reward functions. This agent is "doing what we mean" instead of doing what we said.

This suggests a potential definition -- in a "do what we mean" system, the thing that is being optimized is a latent variable, whereas in a "do what we say" system, it is explicitly specified. Note that "latent" need not mean that you have a probability distribution over it, it just needs to be hidden information. For example, if I had to categorize iterated distillation and amplification, it would be as a "do what we mean" system where the thing being optimized is implicit in the policy of the human and is never made fully certain.

However, this doesn't imply that we want to build a system that exclusively does what we mean. For example, with IRD, if the true reward function is not in the space of reward functions that we consider (perhaps because it depends on a feature that we didn't have), you can get arbitrarily bad outcomes (see the problem of fully updated deference). One idea would be to have a "do what we mean" core, which we expect will usually do good things, but have a "do what we say" subsystem that adds an extra layer of safety. For example, even if the "do what we mean" part is completely sure about the human utility function and knows we are making a mistake, the AI will still shut down if we ask it to because of the "do what we say" part. This seems to be the idea in MIRI's version of corrigibility.

I'd be interested to see disagreements with the definition of "do what we mean" as optimizing a latent variable. I'd also be interested to hear how "corrigibility" and "alignment" relate to these concepts, if at all. For example, it seems like MIRI's corrigibility is closer to "do what we say" while Paul's corrigibility is closer to "do what we mean".

New Comment
14 comments, sorted by Click to highlight new comments since:

I liked this post when it came out, and I like it even more now. This also brings to mind Paul's more recent Inaccessible Information.

Thanks! I like it less now, but I suppose that's to be expected (I expect I publish posts when I'm most confident in the ideas in them).

I do think it's aged better than my other (non-public) writing at the time, so at least past-me was calibrated on which of my thoughts were good, at least according to current-me?

The main way in which my thinking differs is that I'm less optimistic about defining things in terms of what "optimizing" is happening -- it seems like such a definition would be too vague / fuzzy / filled with edge cases to be useful for AI alignment. I do think that the definition could be used to construct formal models that can be analyzed (as had already been done in assistance games / CIRL or the off switch game).

The definition is also flawed; it clearly can't be just about optimizing a latent variable, since that's true of any POMDP. What the agent ends up optimizing for depends entirely on how the latent variable connects to the agent's observations; this clearly isn't enough for do what we mean. I think the better version is Stuart's three principles in Human Compatible (summary).

There's another layer of uncertainty here. For sufficiently general and long-term preferences, it's not clear that "do what we mean" is sufficient either. None of us knows what we want, so we what we mean isn't even very good evidence of what we want.

"do what I would want to mean" is closer, but figuring out the counterfactuals for "would" that preserve "I" is not easy.

Agreed. Humans don't really have utility functions. We might try to get around this by having the AI learn how humans would like to be interpreted as having a utility function, and how they would like that to be interpreted, and so on in an infinite tower of reflection, but that doesn't seem very practical or desirable.

I think there was an old Wei Dai post on "artificial philosophy" that was about this problem? The idea is we want the AI to collapse this infinite tower by learning the philosophical considerations that generate it, then use that knowledge to learn its preferences from humans.

Just don't ask your AI system to optimize for general and long-term preferences without a way for you to say "actually, stop, I changed my mind".

Like, if someone tells me that they want me to protect nature, I know that in effect they mean "Take actions to protect nature right now, but don't do anything super drastic that would conflict with other things I care about, and if I change my mind in the future, defer to that change, etc." I think a good "do what you mean" system would capture all of that. This isn't implied by my definition of course, but I think that a system where the specification is latent and uncertain could have this property.

Just don't ask your AI system to optimize for general and long-term preferences without a way for you to say "actually, stop, I changed my mind".

I believe that reduces to "solve the Friendly AI problem".

(Pedantic note: the right way to say that is "the Friendly AI problem reduces to that".)

I'm replying to the quote from the first comment:

For sufficiently general and long-term preferences, it's not clear that "do what we mean" is sufficient either. None of us knows what we want, so we what we mean isn't even very good evidence of what we want.

What I'm trying to say is that once you have a "do what we mean" system, then don't explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say "actually, stop, I changed my mind".

I claim that the hard part there is in building a "do what we mean" system, not in the "don't explicitly ask for a bad thing" part.

Perhaps "do what we say" is more like "know when the outside view says you've incorrectly converged to the wrong value function, so we're probably right and you should listen to us".

It's somewhat more subtle than that. The ideal (and maybe impossible) corrigible AI should protect us even if we accidentally give the AI the wrong process for figuring out what to value. It should protect us even if the AI becomes omniscient.

If the AI knows vastly more than we do, there's no sense in which we are providing extra evidence or an information-carrying "outside view". We are instead just registering a sort of complaint and hoping we've programmed the AI to listen.

I'm still not convinced that such a sort of corrigibility is in any way distinct from some extra complications in the process we give the AI for figuring out what to value.

The outside view I had in mind wasn't with respect to its knowledge, but to empirical data on how often its exact value-learning algorithm converges to the correct set of preferences for agents-like-us. That feels different.

I assume you're talking about the particular "do what we say" subsystem described in the second last paragraph? If so, that seems plausibly right.

What we say: "Follow the recipe"

What we mean: "Make tasty, edible, food, with the ingredients provided, after verifying they are what they're supposed to be, etc. "

I think this is related, although it's about getting the AI to ask humans questions about what to value.

This approach ignores choice. To have an utility function is not enough to make a choice, and what I say is an act of making a choice.

For example, I have hidden value function (apples = 0.5 and oranges =0.5). I ask my home robot to bring me an apple. In that moment I made a choice between equally preferable preferences.

But my home robot would ignore my choice and bring me half of apple and half of orange, because this was my value function before making the choice.

In that case, I will be not satisfied as I will feel that robot ignores my moral efforts of making a choice, and I value my choices. Also, after making the choice my preferences will be updated, so the robot should decide which my utility function should be used: before the choice or after.

(I don't think humans have consistent utility functions; we're broken that way. If we did...)

The robot should know your utility function(s) well enough to know that you'd choose apple this time, and orange at some future time.