The Intentional Agency Experiment

by Self-Embedded Agent2 min read10th Jul 20185 comments


Ω 4

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We would like to discern the intentions of a hyperintelligent, possibly malicious agent which has every incentive to conceal its evil intentions from us. But what even is intention? What does it mean for an agent to work towards a goal?

Consider the lowly ant and the immobile rock. Intuitively, we feel one has (some) agency and the other doesn't, while a human has more agency than either of them. Yet, a sceptic might object that ants seek out sugar and rocks fall down but that there is no intrinsic difference between the goal of eating yummy sweets and minimising gravitational energy.

Intention is a property that an agent has with respect to a goal. Intention is not a binary value, a number or even a topological vector space. Rather, it is a certain constellation of counterfactuals.


Let be a world, which we imagine as a causal model in the sense of Pearl: a directed acyclic graph with nodes and attached random variables . Let be an agent. We imagine to be a little robot - so not a hyperintelligent malignant AI- and we'd like to test whether it has a goal , say . To do so we are going to run an Intentional Agency Experiment: we ask to choose an action from its possible actions .

Out of the possible actions one is the 'best' action for if it has goal in the sense that for

If doesn't choose , great! We're done; doesn't have goal . If does choose , we provide it with a new piece of (counterfactual) information and offer the option of changing its action. From the remaining actions there is one next best actions . Given the information if does not choose we stop, if does we provide it with the information and continue as before.

At each round we assign more and more agency to . Rather, than a binary 'Yes, has agency' or 'No, has no agency' we imagine a continuum going from a rock, which has no possible actions, to an ant, which might pass some of the tests but not all, to humans and beyond.


Q: What if isn't acting rational? What if it doesn't know all the details of ? What if it knows more? What if has bounded computation? What if...

A: The above is merely a simple model that tries to capture intent; one can complicate it as needed. Most of these objections come down to the possible inability of to choose the best action (given infinite compute and full knowledge of ). To remedy this we might allow the Intentional Agency Experiment to continue if chooses an action that is close to optimal but not optimal. We may introduce a Time to Think parameter when we consider bounded computational agents, etc. Once again, the point is not to assign a binary value of goal intention to an agent, rather it is to assign it a degree of agency.

Q: What if knows that we are testing and tries to deceive us?

A: Yes, this breaks the model.

Q: Counterfactuals are weird and might not exist. Even if they did, the Intentional Agency Experiment is impossible to execute in practise.

Despite the Intentional Agency Experiment being an idealisation, we may approximate it in the real world. For instance, if we'd like to test the intention of an ant to seek out a sugar source (as opposed to a random walk) we might first check if it moves towards the sugar source; if it does we block off this route towards the sugar source and sees whether it tries to circumvent it. In fact, it could be argued that this is the way we test agency in real life.


Ω 4

5 comments, sorted by Highlighting new comments since Today at 12:55 PM
New Comment

How does this experiment distinguish between the rock and the ant? It seems to me like we can establish lots of times that no matter where we start the rock out in a bowl, it acts as though it wants to follow a particular trajectory towards the bottom of the bowl. [We could falsify the hypothesis that it wants to reach the bottom of the bowl as quickly as possible, as it doesn't do rush there and then stop, but surely we aren't penalizing the rock for having a complicated goal.]

It seems like you're trying to rule out the rock's agency through the size of its action set:

At each round we assign more and more agency to R. Rather, than a binary 'Yes, R has agency' or 'No, R has no agency' we imagine a continuum going from a rock, which has no possible actions, to an ant, which might pass some of the tests but not all, to humans and beyond.

This procedure doesn't establish what actions an agent has access to, just what actions they do in fact take, and whether or not those actions line up with our model of rational choice for particular goals. I don't see how this distinguishes from a rock that could either float or roll down the bowl and decides to roll down the bowl, and a rock that can only roll down the bowl.

[It seems to me that we actually determine such things through our understanding of physics and machinery; we look at a rock and don't see any actuators that would allow it to float, and thus infer that the rock didn't have that option. Or we have some 'baseline matter' that we consider as 'passive' in that it isn't doing anything the reflects agency or action, even though it does include potentially dramatic changes just through the normal update laws of physics, and then we have other matter that we consider as 'active' because it behaves quite differently from the 'passive' matter, using the same normal update laws of physics. But this gets quite troublesome if you look at it too hard.]

Dear Vaniver, thank you for sharing your thoughts. You bring up some important points.

The Intentional Agency Experiment is an idealisation, a model that tries to capture what 'intention' is supposed to be. How to translate this to the real world is often ambiguous and sometimes difficult. Similar issues crop up all over applications of pure science&mathematics. 'Real' applications usually involve implicitly and explicitly many different theoretical frameworks and highly simplified models as well as practical knowledge, various mechanical tricks, approximation schemes, etc.

When I ask that R has a set of actions, this only makes sense within a certain framework. A rock does not have 'actions', and neither does a human within a suitably deterministic framework. So we have to be careful; the setup only works when we have a model of the world that is suitably coarse-grained and allows for actions & counterfactuals. Like causality, intention & agency seems to me intensely tied up with an incomplete and coarse-grained model of the world.

To clear any misunderstandings; if we have a physical object that at our level of coarse-graining may indeterministically evolve, for example a rock balancing on a mountain peak, we would not say it has possible actions. One could be under the impression that the actions that are considered are actually instantiated; but of course that is not what is meant. In the Intentional Agency Experiment is only asked to give an action given a counterfactual (hypothetical) world. If you'd like you can read 'potential action' everywhere where I write 'action'. Actions are defined when we have an agent that we can ask to consider hypothetical scenarios and outputs a certain 'potential action' given this counterfactual world.

We cannot ask a rock to consider hypothetical scenarios. Neither can we ask an ant to do so. Only a human or sophisticated robot can. Even a human or sophisticated robot will usually not consider just the 'clean' counterfactual but will also implicitly assume many other facts about the world. When we ask the to consider we don't want it to assume other facts about . So one should consider a world where the action is instantiated but an omnipotent being keeps from happening at the last possible moment.

In practice, it is frequently difficult to ask agents to consider hypothetical counterfactuals and impossible to have them consider 'clean' counterfactuals (where all else is held fixed). Nevertheless, just like in Economics we assume Ceteris Paribus, considering highly idealised models&situations often turns out to be a useful tool.

Moreover, we may try to approximate/instantiate the Intentional Agency Experiment in the real world. However, sometimes those approximations may not be the 'right' implementation. As mentioned, an ant cannot be asked to consider hypothetical scenarios directly. Yet, we may try to 'approximate' the piece of information by putting an obstacle in its way. If the ant tries and succeeds to overcome the obstacle the conclusion shouldn't be that 'it chose a different action'; rather the correct conclusion was that putting this obstacle in its way was not a sufficient implementation of the mathematical act of asking to consider .

Yes, in practice situations arise where the implementation of a model can be ambiguous, very hard to implement etc. These are exactly the problems engineers and experimental physicists deal with; and these are interesting and important problems. But it should not prevent us from constructing highly simplified models.

Like causality, intention & agency seems to me intensely tied up with an incomplete and coarse-grained model of the world.

This seems right to me; there's probably a deep connection between multi-level world models and causality / choices / counterfactuals.

We cannot ask a rock to consider hypothetical scenarios. Neither can we ask an ant to do so.

This seems unclear to me. If I reduce intelligence to circuitry, it looks like the rock is the null circuit that does no information processing, the ant is a simple circuit that does some simple processing, and a human is an very complex circuit that does very complex processing. The rock has no sensors to vary, but the ant does, and thus we could investigate a meaningful counterfactual universe where the ant would behave differently were it presented with different stimuli.

Is the important thing here that the circuitry instantiate some consideration of counterfactual universes in the factual universe? I don't know enough about ant biology to know whether or not they can 'imagine' things in the right sense, but if we consider the simplest circuit that I view as having some measure of 'intelligence' or 'optimization power' or whatever, the thermostat, it's clear that the thermostat isn't doing this sort of counterfactual reasoning (it simply detects whether it's in state A or B and activates an actuator accordingly).

If so, this looks like trying to ground out 'what are counterfactuals?' in terms of the psychology of reasoning: it feels to me like I could have chosen to get something to drink or keep typing, and the interesting thing is where that feeling comes from (and what role it serves and so on). Maybe another way to think of this is something like "what are hypotheticals?": when I consider a theorem, it seems like the theorem could be true or false, and the process of building out those internal worlds until one collapses is potentially quite different from the standard presentation of a world of Bayesian updating. Similarly, when I consider my behavior, it seems like I could take many actions, and then eventually some actions happen. Even if I never take action A (and never would have, for various low-level deterministic reasons), it's still part of my hypothetical space, as considered in the real universe. Here, 'actions I could take' has some real instantiation, as 'hypotheticals I'm considering implicitly or explicitly', complete with my confusions about those actions ("oh, turns out that action was 'choke on water' instead of 'drink water'. Oops."), as opposed to some Platonic set of possible actions, and the thermostat that isn't considering hypotheticals is rightly viewed as having 'no actions' even tho it's more reactive than a rock.

This seems promising, but collides with one of the major obstacles I have in thinking about embedded agency; it seems like the descriptive problem of "how am I doing hypothetical reasoning?" is vaguely detached from the prescriptive question of "how should I be doing hypothetical reasoning?" or the idealized question of "what are counterfactuals?". It's not obvious that we have an idealized view of 'set of possible actions' to approximate, and if we build up from my present reasoning processes, it seems likely that there will be some sort of ontological shift corresponding to an upgrade that might break lots of important guarantees. That said, this may be the best we have to work with.

If R doesn't choose A=..., great! We're done; R doesn't have goal G.

I'm not sure we can draw this conclusion from the model you've given. Instead it feels to me like there is a missing assumption that's not spelled out in the math. Something that would allow us to say that this choice reveals something about R's relationship to G. Possibly some detail about how R operates, although I realize that's what you're trying to discover via this method. My guess is that you're going to run into something like a NFL theorem that prevents you from doing what you want to do (see for a relevant example in a slightly different domain).

I'd of course be happy to be proven wrong, as in I think you need a proof showing you can do what you want to do and why I can't draw some other conclusion as Vaniver suggests.

Dear gworley,

Thank you for the link, it seems like an interesting result. The kind of proof that you'd like, purporting to show that a mathematical model is 'always' applicable to a real world situation, is surely nigh-impossible. If I try to summarise your main objection: the paper you linked to is a form of a 'partial' no go theorem for IRL; and you think it might be likely that there are very general No Go Theorems which would apply to any possible implementation of theoretical models of agency. This is of course possible, but I find it unlikely; in the real world humans are able to discern agency. No doubt there are many hidden assumptions and subtleties but I think it shouldn't stop us from trying to understand intention&agency in highly simplified situations.

In the model this is simply stated as given; we are trying to define intention; it might be that this definition does not accord to your intuitions about intention, but inside the model one cannot reject the conclusion. Of course, someone might come along and might come up with a more sophisticated model that has even more subtle distinctions. I look forward to such a model.

In danger of repeating myself, I cannot 'prove' that I can do what I can do. There is simply a model, which might be accurate in some respects and inaccurate in others; in the model one can do whatever one likes (following the rules of the model); applying it to the real world is trickier. As stated earlier there are many situations where the exact implementation is ambiguous or needs extra-theoretic assumptions. This is normal and occurs all over science. Galileo's results can be objected to in individual situations by referring to all kinds of contingencies like air resistance etc. Indeed, I have heard it said -but cannot confirm- that one Jesuit scholar refused to look through the telescope when Galileo founds the moons of Jupiter, objecting to possible smudges on the telescope. I think you will agree that Galileo's perspective was more productive then that of the Jesuit scholar, even though of course Galileo couldn't prove that his model was applicable [the fact that the moons weren't smudges on the telescope is one of those extra-theoretical assumptions].

It is clear that explications are somehow bouncing off, as (it seems) you are the second person to object in this manner. Perhaps the nomenclature 'experiment' was ill-begotten.