Humans aren't agents - what then for value learning?

[-]jessicata7y80

Suppose you are building an AI and want something from it. Then you are an agent with respect to that thing, since you want it. Probably, you also want the AI to infer your want and act on it. If you don't want things, then you have no reason to build an AI (or not to build an AI).

Models of humans based on control theory aren't generative enough to generate control theory; a group of people just acting on stimulus/response won't spontaneously write a book about control theory. If your model of humans is generative enough to generate itself, then it will assign agency to at least some humans, enough to reflect your goals in making the model.

And, if we're in the context of making models of humans (for the purpose of AI), it's sufficient (in this context, with respect to this context) to achieve the goals of this context.

[-]Charlie Steiner7y40

Could you elaborate on what you mean by "if your model of humans is generative enough to generate itself, then it will assign agency to at least some humans?" I think the obvious extreme is a detailed microscopic model that reproduces human behavior without using the intentional stance - is this a model that doesn't generate itself, or is this a model that assigns agency to some humans?

It seems to me that you're relying on the verb "generate" here to involve some sort of human intentionality, maybe? But the argument of this post is that our intentionality is inexact and doesn't suffice.

Suppose you are building an AI and want something from it. Then you are an agent with respect to that thing, since you want it.

There's wanting, and then there's Wanting. The AI's model of me isn't going to regenerate my Real Wanting, which requires the Essence of True Desire. It's only going to regenerate the fact that I can be modeled as wanting the thing. But I can be modeled as wanting lots of things, is the entire point.

[-]jessicata7y50

I think the obvious extreme is a detailed microscopic model that reproduces human behavior without using the intentional stance—is this a model that doesn’t generate itself, or is this a model that assigns agency to some humans?

It would generate itself given enough compute, but you can't, as a human, use physics to predict that humans will invent physics, without using some agency concept. Anyway, there are decision theoretic issues with modeling yourself as a pure mechanism; to make decisions, you think of yourself as controlling what this mechanism does. (This is getting somewhat speculative; I guess my main point here is that you, in practice, have to use the intentional stance to actually predict human behavior as complex as making models of humans, which doesn't mean an AI would)

Does it seem clear to you that if you model a human as a somewhat complicated thermostat (perhaps making decisions according to some kind of flowchart) then you aren't going to predict that a human would write a post about humans being somewhat complicated thermostats?

There’s wanting, and then there’s Wanting.

When I say "suppose you want something" I mean "actual wanting" with respect to the purposes of this conversation, which might map to your Wanting. It's hard to specify exactly. The thing I'm saying here is that a notion of what "wanting" is is implicit in many discourses, including discourse on what AI we should build (notice the word "should" in that sentence).

Relevant: this discussion of proofs of the existence of God makes the similar point that perhaps proofs of God are about revealing a notion of God already implicit in the society's discourse. I'm proposing a similar thing about "wanting".

(note: this comment and my previous one should both be read as speculative research idea generation, not solidified confident opinions)

[-]jessicata7y50

A fictional dialogue to illustrate:

A: Humans aren't agents, humans don't want things. It would be bad to make an AI that assumes these things.

B: What do you mean by "bad"?

A: Well, there are multiple metaethical theories, but for this conversation, let's say "bad" means "not leading to what the agents in this context collectively want".

B: Aha, but what does "want" mean?

A: ...

[EDIT: what I am suggesting is something like "find your wants in your metaphysical orientation, not your ontology, although perhaps use your ontology for more information about your wants".]

[EDIT2: Also, your metaphysical orientation might be confused, in which case the solution is to resolve that confusion, producing a new metaphysical orientation, plausibly one that doesn't have "wanting" and for which there is therefore no proper "AI alignment" problem, although it might still have AI-related philosophical problems]

[-]Charlie Steiner7y10

Person A isn't getting it quite right :P Humans want things, in the usual sense that "humans want things" indicates a useful class of models I use to predict humans. But they don't Really Want things, the sort of essential Wanting that requires a unique, privileged function from a physical state of the human to the things Wanted.

So here's the dialogue with A's views more of an insert of my own:

A: Humans aren't agents, by which I mean that humans don't Really Want things. It would be bad to make an AI that assumes they do.

B: What do you mean by "bad"?

A: I mean that there wouldn't be such a privileged Want for the AI to find in humans - humans want things, but can be modeled as wanting different things depending on the environment and level of detail of the model.

B: No, I mean how could you cash out "bad" if not in terms of what you Really Want?

A: Just in terms of what I regular, contingently want - how I'm modeling myself right now.

B: But isn't that a privileged model that the AI could figure out and then use to locate your wants? And since these wants so naturally privileged, wouldn't that make them what you Really Want?

A: The AI could do something like that, but I don't like to think of that as finding out what I Really Want. The result isn't going to be truly unique because I use multiple models of myself, and they're all vague and fallible. And maybe more importantly, programming an AI to understand me "on my own terms" faces a lot of difficult challenges that don't make sense if you think the goal is just to translate what I Really Want into the AI's internal ontology.

B: Like what?

A: You remember the Bay Area train analogy at the end of The Tails Coming Apart as Metaphor for Life? When the train lines diverge, thinking of the problem as "figure out what train we Really Wanted" doesn't help, and might divert people from the possible solutions, which are going to be contingent and sometimes messy.

B: But eventually you actually do follow one of the train lines, or program it into the AI, which uniquely specifies that as what you Really Want! Problem solved.

A: "Whatever I do is what I wanted to do" doesn't help you make choices, though.

[-]jessicata7y20

Thanks for explaining, your position makes more sense now. I think I agree with your overall point that there isn't a "platonic Want" than can be directly inferred from physical state, at least without substantial additional psychology/philosophy investigation (which could, among other things, define bargaining solutions among the different wants).

So, there are at least a few different issues here for contingent wants:

Wants vary over time.

OK, so add a time parameter, and do what I want right now.

People could potentially use different "wanting" models for themselves.

Yes, but some models are better than others. (There's a discussion of arbitrariness of models here which seems relevant)

In practice the brain is going to use some weighting procedure between them. If this procedure isn't doing necessary messy work (it's really not clear if it is), then it can be replaced with an algorithm. If it is, then perhaps the top priority for value learning is "figure out what this thingy is doing and form moral opinions about it".

"Wanting" models are fallible.

Not necessarily a problem (but see next point); the main thing with AI alignment is to do much better than the "default" policy of having aligned humans continue to take actions, using whatever brain they have, without using AGI assistance. If people manage with having fallible "wanting" models, then perhaps the machinery people use to manage this can be understood?

"Wanting" models have limited domains of applicability.

This seems like Wei's partial utility function problem and is related to the ontology identification problem. It's pretty serious and is also a problem independently of value learning. Solving this problem would require either directly solving the philosophical problem, or doing psychology to figure out what machinery does ontology updates (and form moral opinions about that).

[-]Charlie Steiner7y10

Does it seem clear to you that if you model a human as a somewhat complicated thermostat (perhaps making decisions according to some kind of flowchart) then you aren't going to predict that a human would write a post about humans being somewhat complicated thermostats?

Is my flowchart model complicated enough to emulate a RNN? Then I'm not sure.

Or one might imagine a model that has psychological parts, but distributes the function fulfilled by "wants" in an agent model among several different pieces, which might conflict or reinforce each other depending on context. This model could reproduce human verbal behavior about "wanting" with no actual component in the model that formalizes wanting.

If this kind of model works well, it's a counterexample (less compute-intensive than a microphysical model) of the idea I think you're gesturing towards, which is that the data really privileges models in which there's an agent-like formalization of wanting.

[-]jessicata7y20

Or one might imagine a model that has psychological parts, but distributes the function fulfilled by “wants” in an agent model among several different pieces, which might conflict or reinforce each other depending on context.

Hmm, so with enough compute (like, using parts of your brain to model the different psychological parts), perhaps you could do something like this for yourself. But you couldn't predict the results of the behavior of people smarter than you. For example, you would have a hard time predicting that Kasparov would win a chess game against a random chess player, without being as good at chess as Kasparov yourself, though even with the intentional stance you can't predict his actions. (You could obviously predict this using statistics, but that wouldn't be based on just the mechanical model itself)

That is, it seems like the intentional stance often involves using much less compute than the person being modeled in order to predict that things will go in the direction of the person's wants (limited by the person's capabilities), without predicting each of the person's actions.

[-]John_Maxwell7y50

Flagging that the end of "The Tails Coming Apart as Metaphor for Life" more or less describes "distributional shift" from the Concrete Problems in AI Safety paper.

I have a hunch that many AI safety problems end up boiling down to distributional shift in one way or another. For example, here I argued that concerns around Goodhart's Law are essentially an issue of distributional shift: If the model you're using for human values is vulnerable to distributional shift, then the maximum value will likely be attained off-distribution.

[-]Charlie Steiner7y20

Sure. It describes how humans aren't robust to distributional shift.

[-]cousin_it7y40

We've been over this:

steven0461: Who cares about the question what the robot “actually wants”? Certainly not the robot. Humans care about the question what they “actually want”, but that’s because they have additional structure that this robot lacks.

Wei_Dai: In other words, our “actual values” come from our being philosophers, not our being consequentialists.

That's the right answer as far as I can tell. Humans do have a part that "actually wants" something - we can introspect on our own desires - and the thermostat analogy discards it. Yes, that means any good model of our desires must also be a model of our introspective abilities, which makes the problem much harder.

[-]Charlie Steiner7y10

I mostly agree, though you can really tell me we have the right answer once we can program it into a computer :) Human introspection is good at producing verbal behavior, but is less good at giving you a utility function on states of the universe. Part of the problem is that it's not like we have "a part of ourselves that does introspection" like it's some kind of orb inside our skulls - breaking human cognition into parts like that is yet another abstraction that has some free parameters to it.

[-]cousin_it7y*20

Sure. Though learning from verbal descriptions of hypothetical behavior doesn't seem much harder than learning from actual behavior - they're both about equally far from "utility function on states of the universe" :-)

[-]Charlie Steiner7y10

I hope so! IRL and CIRL are really nice frameworks for learning from general behavior, and as far as I can tell, learning from verbal behavior requires a simultaneous model of verbal and general behavior, with some extra parts that I don't understand yet.

[-]Kerrigan6mo10

Why can't the true values live at the level of anatomy and chemistry?

[-]Charlie Steiner6mo20

If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.

On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.

LESSWRONG
LW

LESSWRONG
LW

28

Humans aren't agents - what then for value learning?

28

28