Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Humans are just bundles of atoms, and so for them to reflect on themselves and their goals, it's plausible that there's some interesting naturalized ethics going on. This informs how we'd think about an AI intended to do human value learning. Let's frame this as a dialogue, so my hypothetical human can explain what they're thinking.

Human: I don't actually have goals in the precise mathematical sense, what we call my 'goals' are good yet imperfect patterns that I or others can use to describe my thoughts and behavior.

Foil: I don't see how an AI is supposed to learn your goals then. Can you at least specify exactly what pattern you mean by your goals?

Human: I don't think so - my thoughts and behavior under-determine those patterns. If we just try to model my thoughts and behavior a priori, what we probably get is some kind of fairly accurate simulation of me that doesn't have any neatly-labeled parts called 'goals.'

Foil: Can't you just tell the AI to model you in terms of goals?

Human: Even if we add a prior to tell the AI to model me in terms of goals, not only do we have to figure out how to pick a good prior, but also there's the problem that the predictvely-best model of me really is the simulation, and so there's a pressure on the AI to encode a simulation of me within whatever model we make it use, which probably will break whatever common-sense properties the things labeled 'goals' are supposed to have.

Foil: How did you start thinking in terms of goals in the first place, then?

Human: Well, I'm not even sure that I think in terms of goals. I care about them, but as you can see I can't precisely pin them down. It feels like I do my actual modeling of humans in terms of a bunch of related heuristics, and then use the label 'goals' to encapsulate planning-related parts of many of these heuristics. I also owe a lot to language - 'goal' is a word after all, and it fits within an entire mode of thinking about the world verbally.

Foil: If goals show up in human heuristics and language, they don't seem that hard to at least get a good approximation of. What's the problem?

Human: I still think of those heuristics as being about goals, not being the goals themselves. This is where I get notions like "even if you could convince me of something, it might not be what I really want to do." This gets into the related problem of how I'd want an AI to make decisions. I don't want it to make decisions like me, but I also don't trust the notion of maximization.

Foil: But surely if it's making decisions consistently, it's maximizing some goals, right?

Human: I just feel like even though there are many possible patterns consistent with my thoughts and behavior, if I pick any one of these patterns and maximize it to the extreme, I'm worried that it would be bad. When I say 'bad' in that last sentence, I don't mean bad-in-a-specific-goal-pattern, but instead I mean bad in a more abstract framework of heuristics about my own desires.

Foil: Have you read this post called Impossible moral problems and moral authority? It's by this guy who...

Human: Yes yes, the old self-promotion gag. And sure, maybe there are some patterns I would be fine with being maximized. But not arbitrary ones! There's still an important point here about what kind of process I would trust to decide on my goals. I don't trust a process that's just looking for "the pattern that best fits my thoughts and behavior" by some understandable metric. It should also be taking into consideration the more abstract framework I have that's telling me that maximization is dangerous to my values.

In short, the AI shouldn't reify my goals if I don't. Maybe it should think to itself, "I know that humans are just bundles of atoms, and what I call their 'goals' are just good patterns that I can use to describe their thoughts and behavior. But their thoughts and behavior under-determine those patterns and if I just pick one of those patterns and maximize it, I'm worried that it it would be bad. Not 'bad' in terms of a specific one of the goal patterns, but in the sense of some more abstract model I have of how optimization relates to goals."

Foil: But then what do you expect it to do? Haven't you just stuck yourself with an even more complicated other-izer problem?

Human: Hopefully it's actually a simpler other-izer problem, if we can teach our reasons for wanting an other-izer in the first place. And I'm less worried about not solving the problem than I am about not trying to solve the problem, as it were.

My central point is that AI should be a naturalist, and learn about me as I am rather than some fiction of me. If "what I want" is an underdefined category, the AI should probably be explicitly learning and thinking about how to judge underdefined things. If it's self-reflective, there's a strong argument that it had better conceive of its job as something like interpretation of an underdefined goal, or else it will be looking over its own source code under false assumptions.

New to LessWrong?

New Comment