Another post in the same incremental vein. Still hoarding the wild speculation for later.
The idea of the "intentional stance" comes from Dan Dennett, who wanted to explain why it makes sense that we should think humans have such things as "beliefs" or "desires." The intentional stance is just a fancy name for how humans usually think in order to predict the normal human-scale world - we reason in terms of people, desires, right and wrong, etc. Even if you could be prodded into admitting that subatomic particles are "what there really are," you don't try to predict people by thinking about subatomic particles, you think about their desires and abilities.
We want to design AIs that can copy this way of thinking. Thus, the problem of the artificial intentional stance. Value learning is the project of getting the AI to know what humans want, and the intentional stance is the framework that goes between raw sensory data and reasoning as if there are things called "humans" out there that "want" things.
Suppose you want to eat a strawberry, and you are trying to program an AI that can learn that you want to eat a strawberry. A naive approach would be to train the AI to model the world as best it can (I call this best-fit model the AI's "native ontology"), and then bolt on some rules telling it that you are a special object who should be modeled as an agent with desires.
The reason doesn't work is because the intentional stance is sort of infectious. When I think about you wanting to eat a strawberry using my intentional stance, I don't think about "you" as a special case and then use my best understanding of physiology, biochemistry, and particle physics to model the strawberry. Instead, I think of the verb "to eat" in terms of human desires and abilities, and I think of the strawberry in terms of how humans might acquire or eat one.
This is related to the concept of "affordances" introduced by James J. Gibson. Affordances are the building blocks for how we make plans in the environment. If I see a door, I intuitively think of opening or locking it - it "affords opening." But maybe an experienced thief will intuitively think of how to bypass the door - they'll have a different intuitive model of reality, in which different affordances live.
When you say you want to eat a strawberry, you are using an approximate model of the world that not only helps you model "you" and "want" at a high level of abstraction, but also "eat" and "strawberry." The AI's artificial intentional stance can't just be a special model of the human, it has to be a model of the whole world in terms of what it "affords" the human.
If we want to play a CIRL-like cooperative game with real human goals, we'll need the artificial intentional stance.
CIRL assumes that the process generating its examples is an agent (modeled in a way that is fixed when implementing that specific CIRL-er), and tries to play a cooperative game with that agent. But the true "process that determines its inputs" is the entire universe - and if the CIRL agent can only model a small chunk of the universe, there's no guarantee that that chunk will be precisely human-shaped.
If we want an AI to cooperate with humans even if it's smart enough to model a larger-than-human chunk of the universe, this is an intentional stance problem. We want it to model the inputs to some channel in terms of specifically human-sized approximate agents, living in the universe. And then use this same intentional-stance model to play a cooperative game with the humans, because its this very model in which "humans" are possible teammates.
This exposes one of the key difficulties in designing an artificial intentional stance: it needs to be connected to other parts of the AI. It's no good having a model of humans that has no impact on the AI's planning or motivation. You have to be able to access the abstraction "what the humans want" and use it elsewhere, either directly (if you know the format and where it's stored in memory), or indirectly (via questions, examples, etc.).
The other basic difficulty is: how are you supposed to train or learn an artificial intentional stance?
If we think of it as a specialized model of the world, we might try to train it it for predictive power, and tune the parameters so that it gets the right answers as often as possible. But that can't be right, because the artificial intentional stance is supposed to be different than the AI's best-predicting native ontology.
I'm even skeptical that you can improve the intentional stance by training it for efficiency, or predictive power under constraints. Humans might use the intentional stance because it's efficient, but it's not a unique solution - the psychological models that people use have changed over history, so there's that much wiggle room at the very, very least. We want the AI to copy what humans are doing, not strike out on its own and end up with an inhuman model of humans.
This means that the artificial intentional stance, however it's trained, is going to need information from humans, about humans. But of course humans are complicated, our stories about humans are complicated, and so the AI's stories about humans will be complicated. An intentional stance model has to walk a fine line so that it captures important detail, but not so much detail that the humans no longer are being understood in terms of beliefs, desires, etc.
I think this wraps up the basic points (does it, though?), but I might be missing something. I've certainly left out some "advanced points," of which I think the key one is the problem of generalization: if an AI has an intentional stance model of you, would you trust it to design a novel environment that it thinks you'll find extremely congenial?
Oh and of course I've said almost nothing about practical schemes for creating an artificial intentional stance. Dissecting the corpse of a scheme can help clarify issues, although sometimes fatal problems are specific to that scheme. By the end of summer I'll try to delve a little deeper, and take a look at whether you could solve AI safety by prompting GPT-2 with "Q: What does the human want? A:".