Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Cross-posted.

In Stable Pointers to Value, I discussed various ways in which we can try to “robustly point at what we want” (ie, do value learning). I can tidy up the discussion there into three categories:

  1. Standard reinforcement learning (RL) frameworks, including AIXI, which try to predict the reward they’ll get and take actions which maximize that expectation. These are incentivised to hack whatever system is feeding them rewards. This is the “easy problem of wireheading”.
  2. Observation-utility agents (OU), which get around the problem by assessing the future plans with the current utility function subsystem rather than trying to predict what the subsystem will say. This removes the incentive to manipulate what the subsystem will say. The subsystem can itself have uncertainty about what the “true” utility function is; for example, representing uncertainty about what the human wants. CIRL falls into this category. There are a number of problems which can arise here depending on the setup; see for example basically everything Stuart Armstrong has ever written. In particular, you’ve got to define “how much the human likes the plan” somehow, and this is usually manipulable, often to the point of wireheading. I called this the “hard problem of wireheading”.
  3. Approval-directed agents (AD) maximize human approval of individual actions, rather than planning to maximize overall approval. Just as observation-utility agents get around the easy wireheading problem by putting the reward function inside the optimization loop, this gets around the hard wireheading problem by putting humans in the loop. Other problems are introduced, though; most prominently, (a) how do we trust the learned model of human evaluation itself, and (b) how do we make this kind of system competitive with systems which plan ahead, given the obvious advantages of planning ahead? Paul Christiano has thoughts on these problems in various posts.

I want to point at an analogy to three categories of approach to the problem of generalizable environmental goals (as defined in the alignment for advanced machine learning agenda). It’s a fairly messy analogy, and there’s probably a better way of organizing the landscape, but FWIW.

1. Supervised Learning

Imagine you’re trying to teach a system to build bridges by showing it examples. You could learn a big neural network which distinguishes cases of “successfully building a bridge” from everything else, and then use this to drive the system.

If the agent is an RL or OU agent, it is incentivised to “fool itself” by doing things like playing a video of bridge-building in front of its camera. You can try and train the classifier to notice this sort of thing, of course; you give it negative training examples in which someone puts a TV set in front of it and things thereafter appear as they do in one of the positive examples. However, you can’t figure out all the different negative training examples you need to give it ahead of time – especially if the rest of the system will continue to learn later on as the classifier remains fixed.

To me, this feels closely analogous to trying to prevent RL systems from wireheading themselves by giving them strongly negative reward for trying to mess with their reward circuits. You don’t know ahead of time what all the things you need to punish are, but you would need to, since the system keeps getting smarter as the reward circuit remains the same. (Or, if humans are managing the reward button, they need to be able to recognize any attempts to mess with the hardware or take over control of the reward button or manipulate the humans.)

2. Model-Utility Learning

One way you might try to solve this: the AI is learning a model of the world in an unsupervised way, only trying to predict well, not thinking at all about its goals. Separately, the AI is learning a classifier representing the goals. This classifier takes the model state, rather than the observations.

So, returning to the bridge-building example, the system is shown lots of examples of building bridges and not building bridges. It infers a physical model of what’s going on in those examples, plus a predicate on the physical situations which tells it whether the state of affairs corresponds to proper bridge-building.

As before, we can show it many negative training examples involving different methods of attempting to fool itself.

Now, we might reasonably expect that if the AI considers a novel way of “fooling itself” which hasn’t been given in a training example, it will reject such things for the right reasons: the plan does not involve physically building a bridge.

This can also deal with the problem of ontological crisis, even without new classifier data. As the physical model changes in response to new data, the classifier is simply re-learned so that it remains accurate on the original training examples.

Unfortunately, this approach has serious problems.

Since humans (or something) must be labeling the original training examples, the hypothesis that building bridges means “what humans label as building bridges” will always be at least as accurate as the intended classifier. I don’t mean “whatever humans would label”. I mean they hypothesis that “build a bridge” means specifically the physical situations which were recorded as training examples for this system in particular, and labeled by humans as such.

This time, there’s no way to patch the problem with negative training examples. You can’t label an example as both positive and negative!

How can we avoid simple-but-wrong hypotheses like this?

3. Human Hypothesis Evaluation

Just as approval-directed agents put more work on the humans in the control loop, we can try and do the same here.

As in model-utility systems, we build a model of the environment through unsupervised learning, and also try to learn the utility in a supervised way.

However, this time the system gets feedback on the quality of hypotheses from humans, and also tries to anticipate such feedback in its model selection. I’m not sure exactly how this should work, but one version is: ask the humans to classify made-up examples. Such examples of bridge-building can be in imaginary worlds where there are no humans evaluating whether bridge-building is going on, so as to differentiate the pathological hypothesis mentioned above from the desired hypothesis.

For this to work, though, we also have to solve the problem of providing human-understandable explanations of the AI’s learned models, which is its own pandora’s box.

Discussion

The overall point I’m trying to make here has similarities to the Reinforcement Learning with a Corrupted Reward Channelpaper, particularly section 4.1: the way the system gets feedback matters a lot. The way humans get put into the loop can be very tricky; seemingly obvious answers lead to pathological behaviors for highly capable systems. Trying to fix this behavior can lead us down a rabbit-hole of trying patch after patch after patch, until a change in perspective like observation-utility learning eliminates the need for all those patches in one fell swoop (and then we find ourselves making entirely new patches on a higher level and about more important things…).

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 9:01 AM

It's so frustrating to me that "model-utility" learning doesn't have a guarantee. It's like, you make an AI that has a good model of the world, you point (via extensional definition) at some things in the world and say "do things like that!" ... And then the AI can learn the category "things that cause the human to include them in the extensional definition," and create stimuli that would hack your brain if you were alive to see them.

It might need a better understanding of reference, and it might need breakthrougs in human-like concepts and matching the training distribution. But maybe it's still near the right track?

I can definitely tap into the "This should work!" intuition, which says that there should be a way to avoid the problem without significantly changing the feedback loop -- if only we could articulate to the system the mistake it is making. Yet, it seems like to address these sorts of failures you have to change the feedback loop.

What does it mean for an AI who knows a lot more about what the world is to do what a human wants?

Utility functions are likely the wrong concept (Stuart Armstrong has given a lot of reasons to think this). My suspicion is that the better concept is "what a human would want you to do in a situation"; IE, you try and extract a policy rather than a utility. That's a little like approval-direction in flavor. A big problem: like my "human hypothesis evaluation" above, it would require the AI to construct human-understandable explanations of its potential cognitive states to the human. ("What action do I take if I'm thinking all these things?")

What other concepts do we need to refactor? Maybe knowledge?