Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Full toy model for preference learning

4Rohin Shah

2Charlie Steiner

New Comment

2 comments, sorted by Click to highlight new comments since: Today at 7:44 AM

Planned summary:

This post applies Stuart's general preference learning algorithm to a toy environment in which a robot has a mishmash of preferences about how to classify and bin two types of objects.

Planned opinion:

This is a nice illustration of the very abstract algorithm proposed before; I'd love it if more people illustrated their algorithms this way.

This is really handy. I didn't have much to say, but revisited this recently and figured I'd write down the thoughts I *did* think.

My general feeling about human models is that they need precisely one more level of indirection than this. Too many levels of indirection, and you get something that correctly predicts the world, but doesn't contain something you can point to as the desires. Too few, and you end up trying to fit human examples with a model that doesn't do a good job of fitting human behavior.

For example, if you build your model on responses to survey questions, then what about systematic human difficulties in responding to surveys (e.g. difficulty using a consistent scale across several orders of magnitude of value) that the humans themselves are unaware of? I'd like to use a model of humans that learns about this sort of thing from non-survey-question data.

## 1. Toy model for greater insight

This post will present a simple agent with contradictory and underdefined "preferences", and will apply the procedure detailed in this research project to capture all these preferences (and meta-preferences) in a single utility function

^{[1]}.The toy model will start in a simpler form, and gradually add more details, aiming to capture almost all of sections 2 and 3 of the research agenda in a single place.

The purposes of this toy model are:

So, the best critiques are ones that increase the clarity of the toy model, those that object to specific parts of it (especially if alternatives are suggested), and those that make it more realistic for applying to actual humans.

## 1.1 How much to read this post

On a first pass, I'd recommend reading up to the end of Section 3, which synthesises all the base level preferences. Then Sections 4, 5, and 6 deal with more advanced aspects, such as meta-preferences and identity preferences. Finally, sections 7 and 8 are more speculative and underdefined, and deal with situations where the definitions become more ambiguous.

## 2. The basic setting and the basic agent

Ideally, the toy example would start with some actual agent that exists in the machine learning literature; an agent which uses multiple overlapping and slightly contradictory models. However, I haven't been able to find a good example of such an agent, especially not one that is sufficiently transparent and interpretable for this toy example.

So I'll start by giving the setting and the agent myself; hopefully, this will get replaced by another ML agent in future iterations, but it serves as a good starting point.

## 2.1 The blegg and rube classifier

The agent is a robot that is tasked with sorting bleggs and rubes.

Time is divided into discrete timesteps, of which there will be a thousand in total. The robot is in a room, with a conveyor belt and four bins. Bleggs (blue eggs) and rubes (red cubes) periodically appear on the conveyor belt. The conveyor belt moves right by one square every timestep, moving these objects with it (and taking them out of the room forever if they're on the right of the conveyor belt).

The possible actions of the robot are as follows: move in the four directions around it

^{[2]}, pick up any small object (eg a blegg or a rube) from the four directions around it (it can only do so if it isn't currently carrying anything), or drop a small object in any of the four directions around it. Only one small object can occupy any given floorsquare; if the robot enters an occupied square, it crushes any object in it. Note that the details of these dynamics shouldn't be that important, but I'm writing them down for full clarity.## 2.2 The partial preferences

The robot has many partial models appropriate for many circumstances. I am imagining it as an intelligent agent that was sloppily trained (or evolved) for blegg and rube classification.

As a consequence of this training, it has many internal models. Here we will look at the partial models/partial preferences where everything is the same, except for the placement of a single rube. Because we are using one-step hypotehticals, the robot is asked its preferences in many different phrasings; it doesn't always give the same weight for the same preference every time its asked, so each partial preference comes with a mean weight and a standard deviation.

Here are the rube-relevant preferences with their weights. A rube...

This may seem to be to have too many details for a toy model, but most of the subtleties in value learning only become apparent when there are sufficient different base level values.

We assume all these symbols and terms are grounded (though see later in this post for some relaxation of this assumption).

Most other rube partial preferences are composite from these. So, for example, if the robot is asked to compare a rube in a rube bin with a rube on the conveyor belt, it decomposed this into "rube bin vs floor tile" and "floor tile vs conveyor belt".

The formula for bleggs preferences (Pb1 through Pb4) are the same, except with the blegg bin and rube bin inverted, and the robot has −3 to each of these weights - the robot just doesn't feel as strongly about bleggs.

There is another preference, that applies to both rubes and bleggs on the converyor belt:

## 2.3 Floor: one spot is (almost) as good as another

There is one last set of preferences: when the robot is asked whether it prefers a rube (or a blegg) on one spot of the floor rather than another. Let F be the set of all floor spaces. Then the agent's preferences can be captured by an anti-symmetric function f:F×F→{−1,1}.

Then if asked its preference between a rube (or blegg) on floor space s versus it being on floorspace s′, the agent will prefer s over s′ iff f(s,s′)=1 (equivalently, iff f(s′,s)=−1); the weight of this preference, Ps,s′, is ϵ=0.001.

## 3. Synthesising the utility function

## 3.1 Machine learning

When constructing the agent's utility function, we first have to collect similar comparisons together. This was already done implicitly when talking about the different "phrasings" of hypotheticals comparing, eg, the blegg bin with the rube bin. So the machine learning collects together all such comparisons, which gives us the "mean weight" and the standard deviation in the table above.

The other role of machine learning could be to establish that one floor space is pretty much the same as another, and thus that the function f that differentiates them (of very low weight ϵ=0.001) can be treated as irrelevant.

This is the section of the toy example that could benefit most from a real ML agent as part of this example. Because in this section, I'm essentially introducing noise and "unimportant" differences in hypothetical phrasings, and claiming "yes, the machine learning algorithm will correctly label these as noise and unimportant".

The reason for including these artificial examples, though, is to suggest the role I expect ML to perform in more realistic examples.

## 3.2 The first synthesis

Given the assumptions of the previous subsection, and the partial preferences listed above, the energy-minimising formula of this post gives a utility function that rewards the following actions (normalised so that placing small objects on the floor/picking up small objects give zero reward), we get:

Those are the weighted values, derived from Pri and Pbi, 1≤i≤3 (note that these use the mean weights, not - yet - the standard deviations). For simplicity of notation, let Vr be the utility/reward function that gives +1 for a rube in the rube bin and −1 for a rube in the blegg bin, and Vb the opposite function for a blegg.

Then, so far, the agents utility is roughly 10.46Vr+8.60Vb. Note that 10.46 is between the 20/2=10 implied by Pr1 and the 12 implied by Pr2 (and similarly for 8.60, and the 17/2=8.5 implied by Pb1 and the 9 implied by Pb2).

But there are also the preferences concerning the conveyor belt and the placement of objects there, Pr4, Pb4 and P5. Applying the energy-minimising formula to Pr4 and P5 gives the following set of values:

Now, what about Pb4? Well, the weight of Pr4 is 2, and the weight of Pb4 is 3 less than that. Since weights cannot be negative, the weight of Pb4 is 0, or, more simply, Pr4 does not exist as a partial preference. So the corresponding preferences for bleggs are much simpler:

Hum, is something odd going on there? Yes, indeed. Common sense tells us that these values should go 0, −1, −2, −3, and end with −4 when going off the conveyor belt, rather than ever being positive. But the problem is that agent has no relative preference between bleggs on the conveyor belt and bleggs on the floor (or anywhere) - Pb4 doesn't exist. So the categories of "blegg on the conveyor belt" and "blegg anywhere else" are not comparable. When this happens, we normalise the average of each category to the same value (here, 0). Hence the very odd values of the blegg on the conveyor belt.

Note that, most of the time, only the penalties for leaving the room matter (−4.44 for a rube and −2 for a blegg). That's because while the objects are in the room, the agent can lift them off the conveyor belt, and thus remove the utility penalty for being on it. Only at the end of the episode, after 1000 timesteps, will the remaining bleggs or rubes on the conveyor belt matter.

Collect all these conveyor belt utilities together as Vc. Think the values of Vc are ugly and unjustified? As we'll see, so does the agent itself.

But, for the moment, all this data defines the first utility function as

Feel free to stop reading here if the synthesis of base-level preferences is what you're interested in.## 4. Enter the meta-preferences

We'll assume that the robot has the following two meta-preferences.

Now, MP2 will wipe out the Vc term entirely. This may seem surprising, as the Vc term includes rewards that vary between +2 and −4.44, while MP2 has weight only of 2. But recall that the effect of meta-preferences is to change the weights of lower level preferences. And there are only two preferences that don't involve bins: Pr4 and P5, of weight 2 and 1, respectively. So MP2 will wipe them both out, removing Vc from the utilities.

Now, MP1 attempts to make the preferences with rube and blegg symmetric, moving each preference by 1 in the direction of the symmetric version. This sets the weight of Pr1 to 19, Pr2 and Pr3 to 11, while Pb1, Pb2 and Pb3 respectively go to weights 18, 10, and 10.

So the base level preferences are almost symmetric, but not quite (they would be symmetric if the weight of MP1 was 1.5 or higher). Re-running energy minimisation with these values gets:

## 5. Enter the synthesis meta-preferences

Could the robot have any meta-preferences over the synthesis process? There is one simple possibility: the process so far has not used the standard deviation of the weights. We could argue that partial preferences where the weight has high variance should be penalised for the uncertainty. One plausible way of doing so would be:

This (combined with the other, standard meta-preferences) reduces the weight of the partial preferences to:

These weights generate U′2=9Vr+8.40Vb.

Then if w is the weight of MP3 and d the default weight of the standard synthesis process, the utility function becomes

Note the critical role of d in this utility definition. For standard meta-preferences, we only care about their weight relative to each other (and relative to the weight of standard preferences). However, for synthesis meta-preferences, the default weight of the standard synthesis is also relevant.

## 6. Identity preferences

We can add identity preferences too. Suppose that there were two robots in the same room, both classifying these objects. Their utility functions for individually putting the objects in the bins are U3 and U′3, respectively.

If asked, ahead of time, the robot would assign equal weight to them putting the rube/blegg in the bins as to the other robot doing so

^{[3]}. Thus, based on their estimate at time 0, the correct utility is:However, when asked mid-episode about which agent should put the objects in the bins, the robot's answer is much more complicated, and depends on how many each robot has put in the bin already. The robot wants to generate at least as much utility as its counterpart. Until it does that, it really prioritises boosting its own utility (U3); after reaching or surpassing the other utility (U′3), then it doesn't matter which utility is boosted.

If that result were strict, then the agent's actual utility would be the minimum of 2U3 and U3+U′3. Let's assume the data is not quite so strict, and that when we try and collapse these various hypotheticals, we get a smoothed version of the minimum, for some λ>0:

with V1=2U3 and V2=U3+U′3.

So, note that identity preferences are of the same

typeas standard preferences^{[4]}; we just expect them to be more complex to define, being non-linear.## 6.1 Correcting ignorance

So, which versions of the identity preferences are correct - U4 or U5? In the absence of meta-preferences, we just need a default process - should we prioritise the current estimate of partial-preferences, or the expected future estimates?

If the robot has meta-preferences, this allows for the synthesis process to correct for the robot's ignorance. It may have a strong meta-preference for using expected future estimates,

andit may believe currently that U4 is what would result from these. The synthesis process, however, knows better, and picks U5.Feel free to stop reading here for standard preferences and meta-preferences; the next two sections deal with situations where the basic definitions become ambiguous.## 7. Breaking the bounds of the toy problem

## 7.1 Purple cubes and eggs

Assume the robot has a utility of the type U=αVr+βVb. And, after a while, the robot sees some purple shapes enter, both cubes and eggs.

Obviously a purple cube is closer to a rube than to a blegg, but it is neither. There seems to be three obvious ways of extending the definition:

Which of U6, U7, and U8 is a better extrapolation of U? Well, this will depend on how machine learning extrapolates concepts, or on default choices we make, or on meta-preferences of the agent.

## 7.2 Web of connotation: purple icosahedron

Now imagine that a purple icosahedron enters the room, and the robot is running either U7 or U8 (and thus doesn't simply ignore purple objects).

How should it treat this object? The colour is exactly in between red and blue, while the shape can be seen close to a sphere (an icosahedron is pretty round) or a cube (it has sharp edges and flat faces).

In fact, the web of connotations of rubes and bleggs looks like this:

.

The robot's strongest connotation for the rube/blegg is its colour, which doesn't help here. But the next strongest is the sharpness of the edges (maybe the robot has sensitive fingers). Thus the purple icosahedron is seen as closer to a rube. But this purely a consequence of how the robot does symbol grounding; another robot could see it more as a blegg.

## 8. Avoiding ambiguous distant situations

Suddenly, strange objects start to appear on the conveyor belt. All sorts of shapes and sizes, colours, and textures; some change shape as they move, some are alive, some float in the air, some radiate glorious light. None of these looks remotely like a blegg or a rube. This is, in the terminology of this post, an ambiguous distant situation.

There is also a button the agent can press; if it does so, strange objects no longer appear. This won't net any more bleggs or rubes; how much might the robot value pressing that button?

Well, if the robot's utility is of the type U=αVr+βVb. Then it doesn't value pressing the button at all. For it can just ignore all the strange objects, and get precisely the same utility as pressing the button would give it.

But now suppose the agent's utility is of the type U=αVr+βVb+Vc, where Vc is the conveyor-belt utility described above. Now a conservative agent might prefer to press the button. It gets −4.44 for any rube that travels through the room on a conveyor belt (and −2 for any blegg that does so). It doesn't know if any of the strange objects count as rubes; but it isn't sure about that either. If an object enters that ranks as arube+bblegg (note that a+b need not be 1; consider a rube glued to a blegg), then it might lose 4.44a+2b if the objects exits the room. But it might lose αb or βa if it puts the object in the wrong bin.

Given conservatism assumptions - that potential loses in ambiguous situations rank higher than potential gains - pressing the button to avoid that object might be worth up to min{4.44a+2b,αb,βa} to the robot

^{[5]}.For this example, there's no need to distinguish reward functions from utility functions. ↩︎

North, East, South, or West from its current position. ↩︎

This is based on the assumption that the weights in the definitions U3 and U′3 are the same, so we don't need to worry about the relative weights of those utilities. ↩︎

If we wanted to determine the weights at a particular situation: let w be the standard weight for comparing an object in a bin with an object somewhere else (same w for U3 and for U′3). Then for the purpose of the initial robot moving the object to the bin, the weight of that comparison for given values of U3 and U′3, is we−λ2U3/(e−λ2U3+e−λ(U3+U′3)), while for the other robot moving the object to the bin, the weight of that comparison is we−λ(U3+U′3)/(e−λ2U3+e−λ(U3+U′3)). ↩︎

That's the formula if the robot is ultra-fast or the objects on the conveyor belt are sparse. The true formula is more complicated, because it depends on how easy it is for the robot to move the object to the bins without missing out on other objects. But, in any case, pressing the button cannot be worth more than 4.44a+2b, the maximal penalty the robot gets if it simply ignores the object. ↩︎