A putative new idea for AI control; index here.

Resource-gathering agent

It will often be useful to have a model of a “pure” resource gathering agent – one motivated only to gather resources, accumulated power, spread efficiently, and so on. This model could be used as behaviour not to emulate, or as a comparison yardstick for the accumulation behaviour of other agents.

The simplest design for a resource gathering agent would be to take a utility function u – one linear in paperclips, say – and give the agent the utility function X(u) + ¬X(-u), where X is some future observation that has 50% chance of occurring, and that the AI cannot affect. Some cosmological fact coming from a distant galaxy (at some point in the future) could do the trick.

This agent would behave roughly as a resource gathering agent, accumulating power in preparation for the day it would know what to do with it: it would want resources (as these could be used to create or destroy paperclips) but would be indifferent to creating or destroying paperclips currently, as the expected gain from u is exactly compensated by the expected loss from -u (and vice versa).

However, its behaviour is not independent of u: if for instance there were a Grand President of the Committee to Establish the Proper Number of Paperclips in the World (GPotCtEtPNoPitW), then the AI would desperately try to secure that position, but would not care overmuch about being the GPotCtEtPNoSitW, who deals with staples.

So a better model of a resource gathering agent is one that has a distribution P over all sorts of different utility functions, with the proviso that for all such utilities u, P(u)=P(-u). Note here that we’re talking about actual utility functions (which can be compared and summed directly), not functions-up-to-affine-transformations. This distribution P will be updated at some future date according to some phenomena outside of the agent’s control.

Then this agent, which currently has exactly zero motivations, will nonetheless accumulate resources in preparation for the day it will know what to do.

There are some distributions P which are better suited to getting a “purer” resource gathering agent (a bad P would be, eg, having a lots of utilities which are tiny variations on u, which is essentially the same as having just u – but “tiny variations” is not a stable concept under affine transformations). A simplicity prior seems a natural choice here. If u is linear in paperclips and v in staples, then the complexity penalty for w=u+v doesn’t matter so much, as the agent will already want to preserve power over paperclips and staples, because of the (simpler) u, -u, v and -v.


Pre-corriged agents

One of the problems with corrigible agents is that they are, in a sense, too good at what they do. An agent that is currently a u maximiser and will transition tomorrow to being a v maximiser (and everyone knows this) will accept the deal “give me £1,000,000, and I’ll return it tripled tomorrow if you’re still a u-maximiser” (link to corrigibility paper). Why would it accept this deal? Because a real u-maximiser would, and it behaves (almost) exactly as a real u-maximiser.

We might be able to solve that specific problem with methods that identify agents or subagents (see subsequent posts). But there are still issues with, for instance, people who want to trade their own u-valuable and v-useless resources for the agent’s u-useless and v-valuable ones – and then propose the opposite trade tomorrow, with an extra premium.

We can use the idea of a resource gathering agent to prevent such loss of utility. Assume the agent has current utility u, and will transition to some v at specific point in the future. It has a probability distribution P over what this v will be.

Then instead of having current utility u, have it instead as:

u + C Σv Q(v),

where C is some constant and Q(v)=(P(v)+P(-v))/2. Note that Q(v)=Q(-v), so this agent is currently a combination between a u-maximiser and a resource gathering agent – moreover, a resource gathering agent that cares about preserving flexibility in the (likely) correct areas for its future values. The importance of either factor (u-maximising or resource gathering) can be tuned by changing C.

What if the agent expects that their utility will get changed more than once in the future? This can be built up inductively: if there are two utility changes to come, for instance, then after the first transition  (but before the second) the agent will have a composite utility, as above, of the form “u + Σv Q(v)”. Then the agent can have a P over all such composite utilities, and use that to define its current composite-composite utility (the one it has before the first change). A composite-composite utility is really just a composite utility, so the process can then be repeated.

Corrigibility will be applied to this setup in two types of circumstances: when people physically change the utility u, as before, and when the agent updates P (and hence Q) in a way that modifies the composite utility.

Note that this setup is less exploitable, but still suffers from the weakness that Q and P are not equal (in the worst case, you could have P(v)=0 while Q(v)=0.5). However, if Q were not symmetric, then the agent wouldn’t currently be a u-maximiser, so this non-equality is essential to preserving the idea of it being a (somewhat) u-maximising agent.

This may not matter too much in practice, however. The agent is like an investor on the stock market who wants to purchase a lot of the long-term stock options, but has no current interest in any stocks. However, given that other people are interested in stocks, it would be stupid to buy and sell them at prices too divergent from the majority opinion, even if the agent doesn’t itself value them. General measures against blackmail or exploitation might also help here.

New Comment