Preferences in subpieces of hierarchical systems


Ω 4

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In a previous post, I looked at hierarchical systems, and at subagents within them that could develop their own preferences. This was due to a conversation with Dylan Hadfield-Menell.

This post will ignore subagents, and answer Dylan's original question: can we deduce the goals of a hierarchical system? So, we assume the subpieces of the system are not subagents worthy of ethical consideration. Now, given that the subpieces could be inefficient at "what they're supposed to be doing", how can we assume the system itself has a goal?

Algorithms and symbols, again

The situation is as it was in the initial post, a simplified version of the task in this paper. Here an agent is moving object A to storage:

Within the hierarchy of operations needed to move object A, there is a subalgorithm tasked with gripping object B; the code for that is as follows (this is a slight variant of Algorithm 1 in the earlier post):

Here the criteria just assess whether the plan will move the situation closer to the goal .

Let's look at the subalgorithm that is generating the suggested plans, which I labelled SPG (suggested plan generator). It takes the State and Goal as inputs.

It is obviously bad at its job: turning on the radio is not necessary or useful for gripping B. It's not utterly incompetent at its job, though: turning on the radio and then gripping B, will at least result in B getting gripped.

The job of an algorithm is to do its job

I've been talking about SPG's 'job'. But how do I know what this is (especially given that the term is probably undefinable in theory)?

Well, there are a number of indications. First of all, akin to "figuring out what Alice wants", we have the evocative names of the various variables: State, Goal, Suggested plans. If these labels are grounded, then they clearly indicate what the intended task of SPG is.

Note also that there is a loop where the plans of SPG are run until is true. Even if the symbols are ungrounded, and even if SPG were a black-box, this indicates that fulfilling the the goal is the purpose of Algorithm 3, and that SPG contributes to this. In this loop, we also have the assessment criteria , checking whether the plan will move closer to or not; given some weak assumptions, this is also an indication that is a goal of Algorithm 3, and that coming up with plans that reach is the job of SPG.

Bad job, unoptimiser

The previous criteria can establish what the job of SPG is, but doesn't allow us to say that it's bad at its job. It seems to be bad because turning on the music is an unnecessary step.

But imagine (situation 1) we're now looking higher in the algorithm hierarchy, and there is a cost function that counts how many steps are used to achieve the goals. Then we can say that SPG is doing a bad job; but the full criteria of that the system wants have not been passed down to the Algorithm 3 and SPG level.

Conversely, imagine (situation 2) we're looking higher in the algorithm hierarchy, and there is not cost function, but there is a desire for the radio to be on. Then SPG is doing a good job, even though the full criteria have not been passed down.

Especially if the system is capable of self-modification, we shouldn't expect all the job criteria to be located close to the subsystem itself. It's possible that a cost-assessor (in situation 1) has tried to analyse the whole system, and deemed SPG's inefficiency to be minor. Or, conversely, that a radio-turn-on assessor (in situation 2) has analysed the whole system, noticed SPG's behaviour, and let it be (or even added it in), because this helps achieve the systems's overall goal.

The general case

So in general, the role of a subroutine in a hierarchical system is to achieve the task that whatever called that subroutine wants it to achieve. The nature of this task can be be inferred by looking at grounded symbols, and/or at the structure of the algorithm that calls the subroutine, including what it does with its output. Some goals may be implicit, and handed down from higher in the algorithm's hierarchy.

Note that if the subroutine takes actions - either in the real world or by modifying global variables within the algorithm - these can also be used to define its task, especially if the global variables are grounded or the meaning of the actions are clear.

Better implementation

Once the role of all subroutines is established, the goal of the whole system can be estimated. Again, grounded variables are useful, as are the purposes that can be inferred by the fact that the system calls a certain subroutine (with a certain role) at a certain point.

Then once this goal is established, we can talk about how the system might improve itself, by making the outcome more in-line with the goal. But we can't talk about improvements in an abstract sense, without establishing this goal first. Even seemingly useless parts of the system, may be there deliberately.

More structure and information

The more structure a system has, the easier, in general, it is to assess its goal. If there are top-level subroutines that go around assessing the lower levels, then the assessment criteria are major components of the system's goal.

However, it might be that the system doesn't give us enough information to figure out its goal, or that multiple goals are compatible with it (this is most strongly the case if the systems variables are poorly grounded). This is to be expected; we can't infer the goals of a general agent. In this case, we are allowing some assumptions about grounded symbols, internal models, and hierarchical structure, to cut down on the space of possible goals. This might not always be enough.


Ω 4