Agents detecting agents: counterfactual versus influence

Stuart_Armstrong

A putative new idea for AI control; index here. Status: still somewhat underdeveloped.

In this post, I want to explore another avenue for controlling the ongoing influence of an AI (through subagents or other means). It is the idea, touched upon here and here, of figuring out whether the AI's existence was important or not for predicting the future. Informally, if you knew the AI was turned on or not, would you consider that information valuable or almost irrelevant?

Agent vs agent

This post rests on two insights. The first is that, though it is hard to define what an agent or subagent is, we can produce examples of them. Or at least, at the point where we are capable of producing AIs that might be dangerous, we can likely produce examples of them. This can give us a scale for comparing the importance or impact of different agents.

The second is that if we produce an agent with certain motivations, it will likely expand and fill its light cone with... stuff (of some type or other). Those who don't agree with the AI risk thesis argument may still accept that we could program universe-optimising AIs (ie "we could make a paperclipper, if we were stupid enough to do so").

Why is this relevant? Well, consider an agent E (expansionist) that fills its light cone with stuff (likely including manufacturing and tracking devices), eventually at very high speed. And imagine that there is another agent H (hidden) inside that light cone. At some point, agent E will either detect agent H (and make plans to deal with it), or find its plans hampered by agent H, or agent H will go into deep hiding that does not visibly affect agent E's plans.

Therefore, an expansionist agent is a perfect tool for detecting the influence of other agents. Either agent E will have difficulty accomplishing its plans, or agent H's influence will be so limited that agent E cannot find any trace of it. Any influential agent will get detected, sooner or later, by an expansionist agent. So if E finds H unimportant, then H almost certainly is.

Of course, we don't really want expansionist agents like E - but these can be counterfactual.

The stage and the players

We'll now try and define precisely how we can use this idea to control the influence of an agent, setting out the axioms and assumptions needed along the way. Like some mathematics paper, it might be useful to skip lightly over this section and return later.

Let S(u) be the AI whose influence we want to control. The utility u is what gives it it's preferences, and S() indicates that we wish it to be a satisficer or reduced impact AI. Let M(v) be the counterfactual agent that we wish to use to detect S(u)'s influence, for some utility v. The M() designates the agent as a maximiser, without any control on its impact. If we can define S(), we can probably define M() by simply removing its reduced impact restrictions.

Let X be an event leading to S(u) being turned on, Y(v) an event leading to M(v) being turned on. We may as well assume that X and Y(v) are outside each other's light cones (in practice, we will set X≈1 and Y(v)≈0). Then the first two assumption is:

A.1. We can define S(u), M(v), X and Y(v).

Another useful tool might be a weak maximising agent. One that attempts to maximise u, and will do so to some extent, but that can easily be thwarted by humans or others. It's likely that by the time we can build S(u), we can build a weak maximiser, designated by WM(v), with the event WY(v) denoting it's turning on:

A.2. We can define WM(v) and WY(v).

We might also be able to define an abstract, Platonic superagent. This is an idealised agent with immense computing power (but that doesn't exist in the real world). Call this AM; obviously, there is no event leading to AM being 'turned on'. We would want AM's goals to be defined in terms of 'estimating the utility/probability/expectation of certain events'.

A.3. We can define AM to estimate what we need to have estimated.

Of course, we need to be able to point to these various definitions from within S(u)'s code, so:

A.4. If we can define concepts, we can also define then within the various agents.

We also need agents to be able to say sensible things about the behaviour of the other agents:

A.5. If these agents are defined, S(u), WM(v), M(v) and AM can estimate each other's behaviours.

This could be done by using, for example, the reflexive probability of Paul Christiano and Benja Fallenstein.

Now, how about v? We cannot choose any v. We want M(v) to not think that X (hence S(u)'s existence) is an important fact to know. But if v is define by "if X, build staples; if ¬X, build paperclips", then X will be a very important! So we need to define a class V of utility functions that are indifferent to S(u) and X. Or not quite indifferent, but don't care intrinsically about them - they certainly should care if S(u) eats all their resources! See the ideas here for approaches to do this.

A.6. We can define V, a class of utility functions that do not care intrinsically about S(u) and X.

Ideally we'd want this class to be as broad as possible - certainly too large for anything in this universe to parse element by element.

Now let's turn to the issue of measuring "importance". Define e to be evidence about the universe prior to X, Y(v), or WY(v) .We want X (hence the creation of S(u)) to be unimportant, ie of little information concerning the future. It might possible to have a universal definition of importance of certain facts:

A.7. We have a universal measure I of importance, which, given background evidence e, can rank the importance of knowing certain facts versus knowing others.

If we do not have such an I (and I feel it's moderately unlikely that one exists, though we may be able to find an I' which would be good enough to work for our purposes), we can use a utility v to estimate the importance of an event W. The trick is to compare E(v|W) with E(v|¬W), which would give an (uncalibrated) estimation of the importance of W to v.

Right, we're now ready to start doing things.

Predicting... what?

What are we trying to predict here? Well, estimates of the importance of certain events. Specifically, the importance of X (which leads to the creation of S(u)) versus other events. For instance, if Y(v) designates the creation of M(v), we would want to compare the importance of of X and Y(v) in (at least) four ways:

I(Y(v) | X∧e)
I(Y(v) | ¬X∧e)
I(X | Y(v)∧e)
I(X | ¬Y(v)∧e)

These measure the importance of Y(v) (hence M(v)) and X (hence S(u)) in universes where the other agent does or does not exist. If we expect S(u) to be of low impact, we would want things like I(Y(v) | X∧e) >> I(X | Y(v)∧e) (S(u) is much less important than the maximiser M(v)) and |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether M(v) exists or not).

Now, any particular v might be misleading, so we would want to check that this result held up for all v ∈ V. We could take a maximum over V, or a softmax.

If WM(v) can be defined, we would want to do the same estimates with WY(v) instead of Y(v), and desire I(WY(v) | X∧e) >> I(X | WY(v)∧e) (S(u) is less important than the weak maximiser WM(v)) and |I(X | WY(v)∧e) - I(X | ¬WY(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether WM(v) exists or not), again maxing or softmaxing across V.

Who predicts, how?

In the previous section, we used I to compute the importance of S(u) versus other agents. But what if axiom A.7 is wrong, and we don't have such an I? Well, there we can measure the importance of these events, given a particular v, using AM, writing AM_v(X|Y(v')∧e) for |E_AM(v|X∧Y(v')∧e) - E_AM(v|¬X∧Y(v')∧e)|, the expected difference (that AM estimates) in v given X versus ¬X (ie dependent on whether S(u) was turned on or not). More precisely, we'll want to use S(u)(AM_v(X|Y(v')∧e)) - the expected difference (that S(u) estimates AM estimates) in v etc...

Here we'll want to do a double max/softmax. First over v', so that we consider all M(v') for v' ∈ V. Then over v, so that AM estimates the expected difference to all possible v. In this way, max/softmax AM_v functions like an I. But not a perfect one: just one that detects disruptions to the multitude of v's in V. Therefore, we are essentially using the set V as the twenty million question method for the impact of S(u). We need to take extra care with this, to ensure we haven't left any gaping holes where S(u) can still have large impact.

Now, if axiom A.3 was false, we need to use M(v) instead of AMv to estimate this quantity. We need to do this with some care. M(v) is (counterfactually) a real agent. Why would it want to calculate the importance of M(v') and S(u), and why would S(u) care about this? Well, we can do so by constructing the scenario carefully. Maybe M(v') and S(u) are due to be created in the future, and M(v) is a boxed agent that can choose to prevent Y(v') and X. We modify v to v#, so that M(v#) wants to maximise v, but takes specific numerical penalties if it chooses to prevent Y(v') or X. By modifying these numerical penalties, M(v#) is motivated to figure out the importance of M(v') and S(u), and to reveal its opinions through its actions. If we create a S(u#), also boxed, that could override M(v#) at a certain cost and create a specific S(u), we can motivate it to model M(v#)'s actions. From this we can construct an importance metric using the M(v).

The WM(v), on the other hand, are probably too limited to be used effectively as predictors.

Arbitrary agents

One issue with this method is that Y(v) and WY(v) refer to the creation of specific agents, in the specific ways we've setup. For instance, |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 means that the importance of X doesn't vary much given Y(v) or its opposite. What this really means is that I(X | ¬Y(v)∧e) is bounded by I(X | Y(v)∧e) - but that might be quite high. There may be details of the setup that allow S(u) to have an important role given the specific M(v) created in this way - for instance, if it reached out to try and destroy it. Then I(X | Y(v)∧e) could be high, hence I(X | ¬Y(v)∧e) could be as well. The restrictions with >> could still contain S(u), but >> is ill-defined.

More secure would be if Y(v) could be defined as the creation of an agent M(v) in an abstract way. Maybe something like "is 'this object is a v maximiser' a good model of this object's properties and impact?". We would also want to give M(v) some starting resources. Again, note it is very hard to abstractly define what "resources" mean, but considerably easier to define a set of "useful resources" we could give to an agent. We would probably want to bound this definition in time and space as well.

This gives us the next axioms:

A.8. Y(v) defines the generic existence of an agent M(v).
A.9. WY(v) defines the generic existence of an agent WM(v).

With these axioms, our measure of importance becomes more general: S(u) is less important that abstract agents M(v) or WM(v), rather than certain specific ones. Note that we don't need to worry that S(u) snuffs out any hypothetical M(v): since the importance of S(u) is bounded below that if M(v), that would reduce S(u) to zero importance.

Extending to humans

What other agents do we know in the world? How about those 7+ billion ones wandering about on its surface? We can certainly define events Z(h) that might lead to the death of a particular human h - of course, we'd want Z(h) as low as feasibly possible!

A.10. We can define events Z(h) that correspond to the death of a human h.

Then we can use Z(h) instead of Y(v) and WY(v), to ensure that S(u) has importance comparable with a human. Of course, the ideal would be to define, abstractly, the death (or birth+maturation) of a human in a generic way!

A.11. We can define events Z(h) that correspond generically to the death or birth+maturation of a human h.

To summarise the various ways we can define the importance of S(u) (and hence, by integrating them into S(u)'s code, constrain S(u)'s importance and influence), depending on the axioms we have:

5