There is a 'no-free-lunch' theorem in value learning; without assuming anything about an agent's rationality, you can't deduce anything about its reward, and vice versa.
Here I'll investigate whether you can deduce more if you start looking into the structure of the algorithm.
To do this, we'll be violating the principle of algorithmic equivalence: that two algorithms with the same input-output maps should be considered the same algorithm. Here we'll instead be looking inside the algorithm, imagining that we have either the code, a box diagram, an FMRI scan of a brain, or something analogous.
To illustrate the idea, I'll consider a very simple model of the anchoring bias. An agent (the "Human") is given an object (in the original experiment, this could be wine, book, chocolates, keyboard, or trackball), an random integer , and is asked to output how much they would pay for it.
They will output , for some valuation subroutine that is independent of . This gives a quarter weight to the anchor .
Assume that tracks three facts about : the person's need for , the emotional valence the person feels at seeing it, and a comparison with objects with similar features. Call these three subroutines Need, Emo, and Sim. For simplicity, we'll assume each subroutine outputs a single number, that then gets averaged.
Now consider four models of as follows, with arrows showing the input-output flows:
I'd argue that a) and b) imply that the anchoring bias is a bias, c) is neutral, and d) implies (at least weakly) that the anchoring bias is not a bias.
How so? In a) and b), maps straight into Sim and Need. Since is random, it has no bearing on how much is needed, and on how valuable similar objects are. Therefore, it makes sense to see its contribution as noise or error.
In d), on the other hand, it is superficially plausible that a recently heard random input could have some emotional effect (if was not a number but a scream, we'd expect it to have an emotional impact). So if we wanted to argue that, actually, the anchoring bias is not a bias but that people actually derive pleasure from outputting numbers that are close to numbers they heard recently, then going into Emo would be the right place for it to go. Setup c) is not informative either way.
There's something very GOFAI about the setup above, with labelled nodes with definite functionality. You certainly wouldn't want the conclusions to change if, for instance, I exchanged the labels of Emo and Sim!
What I'm imagining here is that a structural analysis of finds this decomposition as a natural one, and then the labels and functionality of the different modules are established by seeing what they do in other circumstances ("Sim always accesses memories of similar objects...").
People have divided parts of the brain into functional modules, so this is not a completely vacuous approach. Indeed, it most resembles "symbol grounding" in reverse: we know the meaning of the various objects in the world, we know what does, and we want to find the corresponding symbols within it.
The no-free-lunch result still applies in this setting; all that's happen is that we've replaced the set of planners (which were maps from reward functions to policies), with the set of algorithms (that map reward functions to policies). Indeed is just a set of equivalence classes in , with equivalence between algorithms defined by algorithmic equivalence, and the no-free-lunch results still apply.
The above approach does not absolve us from the necessity of making normative assumptions. But hopefully these will be relatively light ones. To make this fully rigorous, we can come up with a definition which decomposes any algorithm into modules, identifies noise such as in Sim and Need, and then trims that out (by which we mean, identifies noise with the planner, not the reward).
It's still philosophically unsatisfactory, though - what are the principled reasons for doing so, apart from the fact that it gives the right answer in this one case? See my next post, where we explore a bit more of what can be done with the internal structure of algorithms: the algorithm will start to model itself.