Humans aren't agents in the same way a thermostat isn't an agent.
Consider a truly humble thermostat. One temperature sensor, one set point, one output to turn the furnace on while the sensed temperature is below the set point. If we're being generous, we might construct an intentional stance model of the world in which this thermostat is an agent that wants the house to be at the set point.
But if the environment were a little bit different - say we block the vents leading to the upstairs - the thermostat doesn't try to unblock the vents. In the changed environment, it just acts as if it wants only the downstairs to be at the set point.
It is not that one of these is what the thermostat Really Wants and the other isn't. This entire model in which the thermostat wants things is a convenient fiction we're using to help think about the world. To ask what the thermostat Really Wants even after you know all the physical facts is an error, driven by the human tendency to mix up properties of our models with properties of objects.
You can fix this, and model the thermostat in a way that correctly predicts its behavior in more environments, but every time you make such an expansion of the space of environments, you make your model of the thermostat more concrete and less agenty. Eventually you end up with something like "It wants to increase the output signal when the input voltage is smaller than the voltage controlled by the dial on the front," at which point you might as well strip off the veneer about it "wanting" anything and predict it using physics.
This is what humans are like. In the ancestral environment, I would behave like someone who wants to eat fresh fruits and vegetables. Introduce Doritos to the environment, and I'll eat those instead. To expand the space of environments to include Doritos, you had to make your model of me more concrete (i.e. "Charlie wants to eat things that taste good"). If you pump heroin into my brain, I'll behave like someone who wants more heroin - which you can predict if you stop modeling me in terms of tastes and start modeling me in terms of anatomy and chemistry.
The model of me as someone who wants to eat fresh fruits and vegetables didn't fail because I have True Values and eating Doritos fulfills my True Values better than eating wild berries, but because the environment has been altered in a way that happens to be beyond the domain of validity of the ancestral model.
It's just like how the thermostat doesn't Really Want anything in particular. When the environment has the vents unblocked, interpreting the thermostat as wanting to control the whole house is a useful model. When you place me in the ancestral environment, interpreting me as wanting to eat fresh fruits and vegetables is a useful model of me.
Humans' apparent values can change with the environment. Put us in the ancestral environment and we'll behave as if we like nutrition and reproducing. Put us in the modern environment and we'll behave as if we like Doritos and sex - we can model this transition by being less idealized about humans. Pump heroin into our brains and we'll behave as if we want more - we can model this by being even less idealized. There is no One True level of idealization at which the True Values live.
This has direct consequences for value learning, which is the attempt to program computers to infer human values. You cannot just say "assume humans are agents and infer their values," because there is no True interpretation of human behavior in terms of an agent's desires. This is, finally, what I mean by saying that humans are not agents: in the context of value learning, it won't work to tell the computer to assume that humans are agents.
What then for value learning? Well, step 1 is to accept that if the AI is going to learn something about human morality, it's going to learn to tell a certain sort of story about humans, which features human desires and beliefs in a way suitable to guide the AI's plans. This class of stories is not going to be the One True way of thinking about humans, and so this AI might have to learn from humans about how they model humans.
There is a second half of this post. Given that these stories about human desires are dependent on the environment, and given that our opinion about the best way to interpret humans involves some famously fallible human intuition, won't these stories be at risk of failure under pressure from optimization in the vast space of possible environments?
But instead of repeating what Scott has already said better, if you want to read about it you'll just have to go to The Tails Coming Apart as Metaphor for Life.
Suppose you are building an AI and want something from it. Then you are an agent with respect to that thing, since you want it. Probably, you also want the AI to infer your want and act on it. If you don't want things, then you have no reason to build an AI (or not to build an AI).
Models of humans based on control theory aren't generative enough to generate control theory; a group of people just acting on stimulus/response won't spontaneously write a book about control theory. If your model of humans is generative enough to generate itself, then it will assign agency to at least some humans, enough to reflect your goals in making the model.
And, if we're in the context of making models of humans (for the purpose of AI), it's sufficient (in this context, with respect to this context) to achieve the goals of this context.
Could you elaborate on what you mean by "if your model of humans is generative enough to generate itself, then it will assign agency to at least some humans?" I think the obvious extreme is a detailed microscopic model that reproduces human behavior without using the intentional stance - is this a model that doesn't generate itself, or is this a model that assigns agency to some humans?
It seems to me that you're relying on the verb "generate" here to involve some sort of human intentionality, maybe? But the argument of this post is that our intentionality is inexact and doesn't suffice.
There's wanting, and then there's Wanting. The AI's model of me isn't going to regenerate my Real Wanting, which requires the Essence of True Desire. It's only going to regenerate the fact that I can be modeled as wanting the thing. But I can be modeled as wanting lots of things, is the entire point.
It would generate itself given enough compute, but you can't, as a human, use physics to predict that humans will invent physics, without using some agency concept. Anyway, there are decision theoretic issues with modeling yourself as a pure mechanism; to make decisions, you think of yourself as controlling what this mechanism does. (This is getting somewhat speculative; I guess my main point here is that you, in practice, have to use the intentional stance to actually predict human behavior as complex as making models of humans, which doesn't mean an AI would)
Does it seem clear to you that if you model a human as a somewhat complicated thermostat (perhaps making decisions according to some kind of flowchart) then you aren't going to predict that a human would write a post about humans being somewhat complicated thermostats?
When I say "suppose you want something" I mean "actual wanting" with respect to the purposes of this conversation, which might map to your Wanting. It's hard to specify exactly. The thing I'm saying here is that a notion of what "wanting" is is implicit in many discourses, including discourse on what AI we should build (notice the word "should" in that sentence).
Relevant: this discussion of proofs of the existence of God makes the similar point that perhaps proofs of God are about revealing a notion of God already implicit in the society's discourse. I'm proposing a similar thing about "wanting".
(note: this comment and my previous one should both be read as speculative research idea generation, not solidified confident opinions)
A fictional dialogue to illustrate:
A: Humans aren't agents, humans don't want things. It would be bad to make an AI that assumes these things.
B: What do you mean by "bad"?
A: Well, there are multiple metaethical theories, but for this conversation, let's say "bad" means "not leading to what the agents in this context collectively want".
B: Aha, but what does "want" mean?
[EDIT: what I am suggesting is something like "find your wants in your metaphysical orientation, not your ontology, although perhaps use your ontology for more information about your wants".]
[EDIT2: Also, your metaphysical orientation might be confused, in which case the solution is to resolve that confusion, producing a new metaphysical orientation, plausibly one that doesn't have "wanting" and for which there is therefore no proper "AI alignment" problem, although it might still have AI-related philosophical problems]
Person A isn't getting it quite right :P Humans want things, in the usual sense that "humans want things" indicates a useful class of models I use to predict humans. But they don't Really Want things, the sort of essential Wanting that requires a unique, privileged function from a physical state of the human to the things Wanted.
So here's the dialogue with A's views more of an insert of my own:
A: Humans aren't agents, by which I mean that humans don't Really Want things. It would be bad to make an AI that assumes they do.
B: What do you mean by "bad"?
A: I mean that there wouldn't be such a privileged Want for the AI to find in humans - humans want things, but can be modeled as wanting different things depending on the environment and level of detail of the model.
B: No, I mean how could you cash out "bad" if not in terms of what you Really Want?
A: Just in terms of what I regular, contingently want - how I'm modeling myself right now.
B: But isn't that a privileged model that the AI could figure out and then use to locate your wants? And since these wants so naturally privileged, wouldn't that make them what you Really Want?
A: The AI could do something like that, but I don't like to think of that as finding out what I Really Want. The result isn't going to be truly unique because I use multiple models of myself, and they're all vague and fallible. And maybe more importantly, programming an AI to understand me "on my own terms" faces a lot of difficult challenges that don't make sense if you think the goal is just to translate what I Really Want into the AI's internal ontology.
B: Like what?
A: You remember the Bay Area train analogy at the end of The Tails Coming Apart as Metaphor for Life? When the train lines diverge, thinking of the problem as "figure out what train we Really Wanted" doesn't help, and might divert people from the possible solutions, which are going to be contingent and sometimes messy.
B: But eventually you actually do follow one of the train lines, or program it into the AI, which uniquely specifies that as what you Really Want! Problem solved.
A: "Whatever I do is what I wanted to do" doesn't help you make choices, though.
Thanks for explaining, your position makes more sense now. I think I agree with your overall point that there isn't a "platonic Want" than can be directly inferred from physical state, at least without substantial additional psychology/philosophy investigation (which could, among other things, define bargaining solutions among the different wants).
So, there are at least a few different issues here for contingent wants:
Wants vary over time.
OK, so add a time parameter, and do what I want right now.
People could potentially use different "wanting" models for themselves.
Yes, but some models are better than others. (There's a discussion of arbitrariness of models here which seems relevant)
In practice the brain is going to use some weighting procedure between them. If this procedure isn't doing necessary messy work (it's really not clear if it is), then it can be replaced with an algorithm. If it is, then perhaps the top priority for value learning is "figure out what this thingy is doing and form moral opinions about it".
"Wanting" models are fallible.
Not necessarily a problem (but see next point); the main thing with AI alignment is to do much better than the "default" policy of having aligned humans continue to take actions, using whatever brain they have, without using AGI assistance. If people manage with having fallible "wanting" models, then perhaps the machinery people use to manage this can be understood?
"Wanting" models have limited domains of applicability.
This seems like Wei's partial utility function problem and is related to the ontology identification problem. It's pretty serious and is also a problem independently of value learning. Solving this problem would require either directly solving the philosophical problem, or doing psychology to figure out what machinery does ontology updates (and form moral opinions about that).
Is my flowchart model complicated enough to emulate a RNN? Then I'm not sure.
Or one might imagine a model that has psychological parts, but distributes the function fulfilled by "wants" in an agent model among several different pieces, which might conflict or reinforce each other depending on context. This model could reproduce human verbal behavior about "wanting" with no actual component in the model that formalizes wanting.
If this kind of model works well, it's a counterexample (less compute-intensive than a microphysical model) of the idea I think you're gesturing towards, which is that the data really privileges models in which there's an agent-like formalization of wanting.
Hmm, so with enough compute (like, using parts of your brain to model the different psychological parts), perhaps you could do something like this for yourself. But you couldn't predict the results of the behavior of people smarter than you. For example, you would have a hard time predicting that Kasparov would win a chess game against a random chess player, without being as good at chess as Kasparov yourself, though even with the intentional stance you can't predict his actions. (You could obviously predict this using statistics, but that wouldn't be based on just the mechanical model itself)
That is, it seems like the intentional stance often involves using much less compute than the person being modeled in order to predict that things will go in the direction of the person's wants (limited by the person's capabilities), without predicting each of the person's actions.
Flagging that the end of "The Tails Coming Apart as Metaphor for Life" more or less describes "distributional shift" from the Concrete Problems in AI Safety paper.
I have a hunch that many AI safety problems end up boiling down to distributional shift in one way or another. For example, here I argued that concerns around Goodhart's Law are essentially an issue of distributional shift: If the model you're using for human values is vulnerable to distributional shift, then the maximum value will likely be attained off-distribution.
Sure. It describes how humans aren't robust to distributional shift.
We've been over this:
That's the right answer as far as I can tell. Humans do have a part that "actually wants" something - we can introspect on our own desires - and the thermostat analogy discards it. Yes, that means any good model of our desires must also be a model of our introspective abilities, which makes the problem much harder.
I mostly agree, though you can really tell me we have the right answer once we can program it into a computer :) Human introspection is good at producing verbal behavior, but is less good at giving you a utility function on states of the universe. Part of the problem is that it's not like we have "a part of ourselves that does introspection" like it's some kind of orb inside our skulls - breaking human cognition into parts like that is yet another abstraction that has some free parameters to it.
Sure. Though learning from verbal descriptions of hypothetical behavior doesn't seem much harder than learning from actual behavior - they're both about equally far from "utility function on states of the universe" :-)
I hope so! IRL and CIRL are really nice frameworks for learning from general behavior, and as far as I can tell, learning from verbal behavior requires a simultaneous model of verbal and general behavior, with some extra parts that I don't understand yet.