This is a very preliminary two-part post sketching out the direction I'm taking my research now (second post here). I'm expecting and hoping that everything in here will get superseded quite quickly. This has obvious connections to classical machine intelligence research areas (such as interpretability). I'd be very grateful for any links with papers or people related to the ideas of this post.
The theory: model fragments
I've presented the theoretical argument for why we cannot deduce the preferences of an irrational agent, and a practical example of that difficulty. I'll be building on that example to illustrate some algorithms that produce the same actions, but where we nonetheless can feel confident deducing different preferences.
I've mentioned a few ideas for "normative assumptions": the assumptions that we, or an AI, could use to distinguish between different possible preferences even if they result in the same behaviour. I've mentioned things such as regret, humans stating their values with more or less truthfulness, human narratives, how we categorise our own emotions (those last three are in this post), or the structure of the human algorithm.
Those all seems rather add-hoc, but they are all trying to do the same thing: hone in on human judgement about rationality and preferences. But what is this judgement? This judgement is defined to be the internal models that humans use to assess situations. These models, about ourselves and about other humans, often agree with each other from one human to the next (for instance, most people agree that you're less rational when you're drunk).
Calling them models might be a bit of an exaggeration, though. We often only get a fragmentary or momentary piece of a model - "he's being silly", "she's angry", "you won't get a promotion with that attitude". These are called to mind, thought upon, and then swiftly dismissed.
So what we want to access, is the piece of the model that the human used to judge the situation. Now, these model fragments can often be contradictory, but we can deal with that problem later.
Then all the normative assumptions noted above are just ways of defining these model fragments, or accessing them (via emotion, truthful description, or regret). Regret is a particularly useful emotion, as it indicates a divergence between what was expected in the model, and what actually happened (similarly to temporal difference learning).
So I'll broadly categorise methods of learning human model fragments into three categories:
- Direct access to the internal model.
- Regret and surprise as showing mismatchs between model expectation and outcomes.
- Privileged output (eg certain human statements in certain circumstances are taken to be true-ish statements about the internal model).
The first method violates algorithmic equivalence and extentionality: two algorithms with identical outputs can nevertheless use different models. The second two methods do respect algorithmic equivalence, once we have defined what behaviours correspond to regret/surprise, or what situations humans can be expected to respond truthfully to. In the process of defining those behaviours and situations, however, we are likely to use introspection and our own models: a sober, relaxed rational human confiding confidentially with an impersonal computer, is more likely to be truthful than a precariously employed worker on stage in front of their whole office.
What model fragments look like
The second post will provide examples of the approach, but here I'll just list the kind of things that we can expect as model fragment:
- Direct statements about rewards ("I want chocolate now").
- Direct statements about rationality ("I'm irrational around them").
- An action is deemed better than another ("you should starts a paper trail, rather than just rely on oral instructions").
- An action is seen as good (or bad), compared with some implicit set of standard actions. ("compliment your lover often").
- Similarly to actions, observations/outcomes can be treated as above ("the second prize is actually better", "it was unlucky you broke your foot").
- An outcome is seen as surprising ("that was the greatest stock market crash in history"), or the action of another agent is seen as that ("I didn't expect them to move to France").
A human can think these things about themselves or about other agents; the most complicated variants are assessing the actions of one agent from the perspective of another agent ("if she signed the check, he'd be in a good position").
Finally, there are meta, and meta-meta, etc... versions of these, as we model other agents modelling us. All of these give a partial indication of our models of the rationality or reward, about ourselves and about other humans.