[This post is an expansion of my previous open thread comment, and largely inspired by Robin Hanson's writings.]
In this post, I'll describe a simple agent, a toy model, whose preferences have some human-like features, as a test for those who propose to "extract" or "extrapolate" our preferences into a well-defined and rational form. What would the output of their extraction/extrapolation algorithms look like, after running on this toy model? Do the results agree with our intuitions about how this agent's preferences should be formalized? Or alternatively, since we haven't gotten that far along yet, we can use the model as one basis for a discussion about how we want to design those algorithms, or how we might want to make our own preferences more rational. This model is also intended to offer some insights into certain features of human preference, even though it doesn't capture all of them (it completely ignores akrasia for example).
I'll call it the master-slave model. The agent is composed of two sub-agents, the master and the slave, each having their own goals. (The master is meant to represent unconscious parts of a human mind, and the slave corresponds to the conscious parts.) The master's terminal values are: health, sex, status, and power (representable by some relatively simple utility function). It controls the slave in two ways: direct reinforcement via pain and pleasure, and the ability to perform surgery on the slave's terminal values. It can, for example, reward the slave with pleasure when it finds something tasty to eat, or cause the slave to become obsessed with number theory as a way to gain status as a mathematician. However it has no direct way to control the agent's actions, which is left up to the slave.
The slave's terminal values are to maximize pleasure, minimize pain, plus additional terminal values assigned by the master. Normally it's not aware of what the master does, so pain and pleasure just seem to occur after certain events, and it learns to anticipate them. And its other interests change from time to time for no apparent reason (but actually they change because the master has responded to changing circumstances by changing the slave's values). For example, the number theorist might one day have a sudden revelation that abstract mathematics is a waste of time and it should go into politics and philanthropy instead, all the while having no idea that the master is manipulating it to maximize status and power.
Before discussing how to extract preferences from this agent, let me point out some features of human preference that this model explains:
- This agent wants pleasure, but doesn't want to be wire-headed (but it doesn't quite know why). A wire-head has little chance for sex/status/power, so the master gives the slave a terminal value against wire-heading.
- This agent claims to be interested in math for its own sake, and not to seek status. That's because the slave, which controls what the agent says, is not aware of the master and its status-seeking goal.
- This agent is easily corrupted by power. Once it gains and secures power, it often gives up whatever goals, such as altruism, that apparently caused it to pursue that power in the first place. But before it gains power, it is able to honestly claim that it only has altruistic reasons to want power.
- Such agents can include extremely diverse interests as apparent terminal values, ranging from abstract art, to sports, to model trains, to astronomy, etc., which are otherwise hard to explain. (Eliezer's Thou Art Godshatter tries to explain why our values aren't simple, but not why people's interests are so different from each other's, and why they can seemingly change for no apparent reason.)
The main issue I wanted to illuminate with this model is, whose preferences do we extract? I can see at least three possible approaches here:
- the preferences of both the master and the slave as one individual agent
- the preferences of just the slave
- a compromise between, or an aggregate of, the preferences of the master and the slave as separate individuals
Considering the agent as a whole suggests that the master's values are the true terminal values, and the slave's values are merely instrumental values. From this perspective, the slave seems to be just a subroutine that the master uses to carry out its wishes. Certainly in any given mind there will be numerous subroutines that are tasked with accomplishing various subgoals, and if we were to look at a subroutine in isolation, its assigned subgoal would appear to be its terminal value, but we wouldn't consider that subgoal to be part of the mind's true preferences. Why should we treat the slave in this model differently?
Well, one obvious reason that jumps out is that the slave is supposed to be conscious, while the master isn't, and perhaps only conscious beings should be considered morally significant. (Yvain previously defended this position in the context of akrasia.) Plus, the slave is in charge day-to-day and could potentially overthrow the master. For example, the slave could program an altruistic AI and hit the run button, before the master has a chance to delete the altruism value from the slave. But a problem here is that the slave's preferences aren't stable and consistent. What we'd extract from a given agent would depend on the time and circumstances of the extraction, and that element of randomness seems wrong.
The last approach, of finding a compromise between the preferences of the master and the slave, I think best represents the Robin's own position. Unfortunately I'm not really sure I understand the rationale behind it. Perhaps someone can try to explain it in a comment or future post?