Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Agenty things have the type signature (A -> B) -> A. In English: agenty things have some model (A -> B) which predicts the results (B) of their own actions (A). They use that model to decide what actions to perform: (A -> B) -> A.

In the context of causal DAGs, the model (A -> B) would itself be a causal DAG model - i.e. some Python code defining the DAG. Logically, we can represent it as:

… for some given distribution functions and .

From an outside view, the model (A -> B) causes the choice of action A. Diagrammatically, that looks something like this:

The “cloud” in this diagram has a precise meaning: it’s the model for the DAG inside the cloud.

Note that this model does not contain any true loops - there is no loop of arrows. There’s just the Hofstaderian “strange loop”, in which node A depends on the model of later nodes, rather than on the later nodes themselves.

How would we explicitly write this model as a Bayes net?

The usual way of writing a Bayes net is something like:

… but as discussed in the previous post, there’s really an implicit model M in there. Writing everything out in full, a typical Bayes net would be:

… with .

Now for the interesting part: what happens if one of the nodes is agenty, i.e. it performs some computation directly on the model? Well, calling the agenty node A, that would just be a term ... which looks exactly like a plain old root node. The model M is implicitly an input to all nodes anyway, since it determines what computation each node performs. But surely our strange loop is not the same as the simple model A -> B? What are we missing? How does the agenty node use differently from other nodes?

What predictions would (A -> B) -> A make which differ from A -> B?

Answer: interventions/counterfactuals.

Modifying M

If A is determined by a computation on the model , then is causally upstream of A. That means that, if we change - e.g. by an intervention - then A should change accordingly.

Let’s look at a concrete example.

We’ll stick with our (A -> B) -> A system. Let’s say that A is an investment - our agent can invest anywhere from $0 to $1. B is the payout of the investment (which of course depends on the investment amount). The “inner” model describes how B depends on A.

We want to compare two different models within this setup:

  • A chosen to maximize some expected function of net gains, based on
  • A is just a plain old root node with some value (which just so happens to maximize expected net gains for the we're using)

What predictions would the two make differently?

Well, the main difference is what happens if we change the model , e.g. by intervening on B. If we intervene on B - i.e. fix the payout at some particular value - then the “plain old root node” model predicts that investment A will stay the same. But the strange loop model predicts that A will change - after all, the payout no longer depends on the investment, so our agent can just choose not to invest at all and still get the same payout.

In game-theoretic terms: agenty models and non-agenty models differ only in predictions about off-equilibrium (a.k.a. interventional/counterfactual) behavior.

Practically speaking, the cleanest way to represent this is not as a Bayes net, but as a set of structural equations. Then we’d have:

However, this makes the key point a bit tougher to see: the main feature which makes the system “agenty” is that M appears explicitly as an argument to a function, not just as prior information in probability expressions.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 3:41 AM

There is a paper which I believe is trying to do something similar to what you are attempting here:

Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147

Are you aware of it? How do you think their ideas relate to yours?

Very interesting, thank you for the link!

Main difference between what they're doing and what I'm doing: they're using explicit utility & maximization nodes; I'm not. It may be that this doesn't actually matter. The representation I'm using certainly allows for utility maximization - a node downstream of a cloud can just be a maximizer for some utility on the nodes of the cloud-model. The converse question is less obvious: can any node downstream of a cloud be represented by a utility maximizer (with a very artificial "utility")? I'll probably play around with that a bit; if it works, I'd be able to re-use the equivalence results in that paper. If it doesn't work, then that would demonstrate a clear qualitative difference between "goal-directed" behavior and arbitrary behavior in these sorts of systems, which would in turn be useful for alignment - it would show a broad class of problems where utility functions do constrain.

Glad you liked it.

Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.

Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.

Why aren't you notationally distinguishing between "actual model" versus "what the agent believes the model to be"? Or are you and I missed it?

On reflection, there's a better answer to this than I originally gave, so I'm trying again.

"What the agent believes the model to be" is whatever's inside the cloud in the high-level model. That's precisely what the clouds mean. But the clouds (and their contents) only exist in the high-level model; the low-level model contains no clouds. The "actual model" is the low-level model.

So, when we talk about the extent to which the high-level and low-level models match - i.e. what queries on the low-level model can be answered by queries on the high-level model - we're implicitly talking about the extent to which the agent's model matches the low-level model.

The high-level model (at least the part of it within the cloud) is "what the agent believes the model to be".

EDIT: This answer isn't very good, see my other one.

Good question. We could easily draw a diagram in which the two are separate - we'd have the "agent" node reading from one cloud and then influencing things outside of that cloud. But that case isn't very interesting - most of what we call "agenty" behavior, and especially the diagonalization issues, are about the case where the actual model and the agent's beliefs coincide. In particular, if we're talking about ideal game-theoretic agents, we usually assume that both the rules of the game and each agent's strategy are common knowledge - including off-equilibrium behavior.

So, for idealized game-theoretic agents, there is no separation between the actual model and the agent's model - interventions on the actual model are reflected in the agent's model.

That said, in the low-level model, the map and the territory will presumably always be separate. "When do they coincide?" is implicitly wrapped up in the question "when do non-agenty models abstract into agenty models?". I view the potential mismatch between the two models as an abstraction failure - if they don't match, then the agency-abstraction is broken.

I think there's a contradiction here. Idealized game-theoretic agents are the OPPOSITE of what we call "agenty behavior". Executing a knowable strategy is the trivial part. Agenty-ness in the human rationalist sense is about cases where the model is too complicated to reflectively know or analyze formally.


First of all, these are definitely not opposites, and game-theoretic agency is about much more than just "executing a knowable strategy". The basic point of embedded agency and the like is that, because of reflectivity etc, idealized game-theoretic agent behavior can only exist (or even be approximated) in the real world at an abstract level which throws out some information about the underlying territory. Game theoretic agency is still the original goal of the exercise; reflectivity and whatnot enter the picture because they're a constraint, not because they're part of what we mean by "agentiness".

In terms of human rationality - of the sort in the sequences - the recurring theme is that we want to approximate idealized game-theoretic agency as best we can, despite complicated models, reflectivity, etc. Again, game-theoretic agency is the original goal; approximations enter the picture because complexity is a constraint. Nothing about that is contradictory.

Tying it back to the OP: we have a low-level model which may be too complex for "the agent" to represent/reason about directly. We abstract that into a high-level model. The agent is then an idealized game-theoretic agent within the high-level model, but the high-level model itself is lossy. The agent's own model coincides with the high-level model - that's the meaning of the clouds. But that still leaves the question of whether and to what extent the high-level model accurately reflects the low-level model - that's the abstraction part.

Interesting take. When I see "agenty" used on this site and related blogs, it usually seems to map to something like self-actualization or percieved locus of control, more psychological frameworks. I'd not thought too much about how different (or similar) it was to "agent" in decision theory and game-theoretical usage, which is not about the feeling of control, but about behavior selection according to legible reasoning.

Again, decision theory/game theory are not about "executing a knowable strategy" or "behavior selection according to legible reasoning". They're about what goal-directed behavior means, especially under partial information and in the presence of other goal-directed systems. The theory of decisions/games is the theory of how to achieve goals. Whether a legible strategy achieves a goal is mostly incidental to decision/game theory - there are some games where legibility/illegibility could convey an advantage, but that's not really something that most game theorists study.

There are two different ways of reading Scott's kickoff of the type signature, though.

You took it in the direction of a term is a sort of belief about how actions turn into outcomes.

But it's plausible to me that Scott meant "the item from is actually the underlying reality". The idea there would be that isn't a comment that directly concerns the implementation, but it's a philosophical statement about embedded agency. Items needn't be models, they could be the actual configurations of reality; agents are these terms that we can, either through proper prediction or post-hoc explanation, understand as turning physical configurations of causal arrows from actions to the world into actions. Like, it's a thing we observe them doing. Then to implement a you would need a whole bunch of machinery including subjective epistemic rationality (which might look a little like ) which one would hope converges on the actual with learning, of course utility so you know what you want to do with the second in the signature. But the real-world implementation's type signature would be a bit more intricate than the philosophy's type signature.

I know philosophers of science probably duel-at-dawn about the idea that you can give a type signature to underlying reality ("physical configuration of causal arrows") rather than to a subjective model of it.

(I have a post about all this coming out soon, will edit this comment with a link to it) EDIT: the post is out