Crossposted at Intelligent Agents Forum.
Is there a teapot currently somewhere between the orbit of Earth and Mars? And who won the 1986 World Cup football match between England and Argentina? And what do these questions have to do with learning human values?
Both those top questions are uncertain. We haven't scanned the solar system with enough precision to find any such teapot. And though the referee allowed the first Argentine goal, making it official, and though FIFA had Agentina progress to the semi-finales (they would eventually win the tournament) while England was eliminate... that goal, the "Hand of God" goal, was scored by Maradona with his hand, a totally illegal move.
In a sense, neither question can ever be fully resolved. Even if we fully sweep the the solar system for teapots in a century's time, it's still possible there might have been one now, and it then crashed into the sun, stopping us from ever finding it. And in the spirit of ambijectivity, the question of Argentina's victory (or whether it was a "proper" victory, or a "fair" victory) depends on which aspect of the definition of victory you choose to emphasise - the referee's call and the official verdict, versus the clear violation of the rules.
Nevertheless, there is a sense in which we feel the first question has a definite answer (which is almost certainly "no"), while the second is purely about definitions.
Why do we feel that the teapot question has a definite answer? Well, we have a model of reality as something that actually objectively exists, and our investigation of the world backs it up - when confronted by a door, we feel that there is something behind the door, even if we choose not to open it. There are various counterfactuals in which we could have sent out a probe to any given area of the orbit, so we feel we could have resolved the "is there a teapot at this location" for any given location within a wide radius of the Earth.
Basically, the universe has features that causes us to believe that when we observe it (quantum effects aside), we are seeing a previous reality rather than creating a new one (see the old debate between platonists and formalists in mathematics).
Whereas even if we had a God or an AI, we don't expect it to have a definite answer to the question of who won that football match. There is no platonic underlying reality as to the winner of the game, that we could just figure out if we had enough knowledge. We already know everything that's relevant.
Many attempts at learning human values are framed as "humans have an underlying true reward R, and here is procedure P for determining it".
But in most cases, that formulation is incorrect, because the paper is actually saying "here is a procedure P for determining human values". Actually saying that humans have true rewards is a much more complicated process: you have to justify that the true R exists, like Russell's teapot, rather than being a question of definition, like Argentina's football victory.
That sounds like a meaningless distinction: what is, in practice, the difference between a true reward R and an imperfect estimator P, and just P? It's simpler conceptually if we talk about the true R, so why not do so?
It's simpler, but much more misleading. If you work under the assumption that there is a true R, then you're less likely to think that P might lead you astray. And if you see imperfections in P, your first instinct is something like "make P more rational/more likely to converge on R", rather than ask the true question, which is "is P a good definition of human values?"
Even if the moral realists are right, and there is a true R, thinking about it is still misleading. Because there is, as yet, no satisfactory definition of this true R, and it's very hard to make something converge better onto something you haven't defined. Shifting the focus from the unknown (and maybe unknowable, or maybe even non-existent) R, to the actual P, is important.
It's because I was focused so strongly on the procedure P, treating R as non-existent, that I was able to find some of the problems with value learning algorithms.
When you think that way, it becomes natural to ponder issues like "if we define victory through the official result, this leaves open the possibility for referee-concealed rule-breaking; is this acceptable for whatever we need that definition for?"
It's not that hard to answer, if there is a teapot on the ISS then it just depends on whether the ISS is on the night side or the day side of Earth.
I like the topics you're touching on here, and have a few thoughts that might spur you on.
It seems that R in some sense exists only in the ontology. By this I mean that R can only be conceptualized as a thing because we observe changes in our lifeworlds and can have a meta-experience of those changes as being caused by (or as evidence of) some reward R. If we go looking for R in the metaphysical though it seems unlikely we will find it because reward only makes sense in terms of some subject/observer experiencing an experience it identifies as having valence (so a negative or positive reward).
In this sense R has an etiology rooted in P and so it can never be that P does not produce R because R is defined by P. We can view the confusion over this as seeing an approximation of R, R', and then trying to use it to construct some approxiation of P, P', that produces R'. We do this because it's easier to construct R' than P' and because for decidable Ps and P's we can prove P' produces R', but for phenomenological consicious processes it seems likely they will be undecidable (cf. integrated information theory), and so we cannot really presume to know much about R' let alone R because of the difficulty of computing P' without actually computing P directly.
And in this sense acting as if R does not exist is probably the right choice because neither R' nor even R allows us to construct a P in the general case such that P generates R.