The AI doesn't have a model of what it should do, the AI is the model.
This of course generalizes to say that beings don't have models, they are the models in the sense that the model exists as a post hoc reification of what the being is.
I think you are tackling an import problem of intuitions here. Seeing that the votes on this post suggest a mixed reception (9 votes before I voted with a total score of 5), and given there are no other comments, I'm left to speculate about what people might dislike about this post.
My guesses are:
I want to encourage you to keep at it. I generally agree with you from what you've written in this vein so far, and I continue to sense a general approval of approaches to AI that treat the abstractions as absolute in ways that will dangerously break down due to them being leaky, so I view writing like yours here as an important corrective to changing intuitions such that they better reflect the embeddedness and interconnectedness of agents.
Eliezer Yudkowsky, Ghosts in the Machine
When I started thinking about value learning, I thought the goal was to extract simple objects that described the essence of morality. Not so simple as a verbal definition, but something like a utility function. Something separate from planning or reasoning, that was purely about preferences, which you could plug into an AI which would then do some totally separate work to turn preferences into choices.
Turns out that runs into some serious obstacles.
I
The difficulty of value learning is that there is no One True Utility Function to be assigned the globs of atoms we call humans. To think about them as having desires at all requires viewing them at a suitable level of abstraction - though of course, there's no One True Level Of Abstraction, either. (I promise this is my last post that's basically just consequences of needing the intentional stance for a while.)
Call the world-model the AI uses to best predict the world its "native ontology." If I want to go to the gym, we want the AI to look at the atoms and see "Charlie wants to go to the gym." The thing that I want is not some specific state of the AI's native ontology. Instead, I can only "want" something in an abstracted ontology that not only contains the AI's intentional-stance model of "Charlie," but also intentional-stance-compatible abstractions for "go" and "gym." In short, abstraction is contagious.
This is like the idea of an umwelt (oom-velt), introduced by early philosopher of biology Jakob Johann von Uexküll. In nature, different organisms can have different effective models of the world even though they live in the same environment. They only evolve to model what is necessary for them to survive and reproduce. The umwelt is a term for this modeled world. The umwelt of a bloodsucking tick consists largely of things to climb on and warm-blooded mammals, which are perceived not by sight but by a certain smell and body temperature.
I think of the AI's intentional stance as not just being a specially abstract model of me, but also being a model of my entire umwelt. It needs an abstraction of the gym because the gym is a part of my inner world, an abstract concept that gets referenced in my plans and desires.
II
Back to value learning. The bare minimum for success is that we build an AI that can predict which actions will do a good job satisfying human values. But how minimalist do we really have to be? Can we get it to output an abstract object corresponding to human values, like a utility function or some compression thereof?
Well, maybe. If it had a complete understanding of humans, maybe it could take that abstract, intentional stance description of humans and cash it out into a utility function over world-histories. Note that this is over world-histories, not world-states, because humans' abstractions often involve things like duration and change. So one problem is that this object is impractically massive, both to construct and to use. In order to actually do anything with human values, what we want is the compressed, abstracted version, and this turns out to more or less consist of the entire AI.
It's theoretically convenient to think about separating values and planning, only passing a utility function from one to the other, but in practice the utility function is too big to construct, which means that the planning step must repeatedly talk to the abstract model, and is no longer so cleanly separate from it, especially if we imagine optimizing end-to-end, causing every part to be optimized to fit every other part, like two trees growing intertwined.
The other factor blurring any neat lines is meta-ethics. We might want to use meta-ethical data - information learned by observing and talking to humans - to change how the AI treats its information about human values, or even change which decision theory it's using. You can frame this as preferences over the AI's own code, but this is still a case of supposedly simpler preferences actually containing the specification of the whole AI.
These violations of clean separability tell us that our goal shouldn't be to find a separate "human values" object. Except in special cases that we really shouldn't count on, the entire FAI is the "human values" object, and all of its parts might make sense only in the context of its other parts. The AI doesn't have a model of what it should do, the AI is the model.