Unsupervised Agent Discovery

Gunnar_Zarncke

finding agents in raw dynamics

The Dataset

Suppose you are given time series data from a simulated ecology (such as Smallville). You can usually tell what the agents in there are doing; at least if the variables are properly labeled. If they are not labeled but just numeric values, your first task would be to reverse-engineer the mapping to the conceptual model behind the simulation (or whatever generated the time series data). That might not be feasible, but let's ignore that for now. Because there is a bigger problem and that is a result of any labelling where existing or reverse engineered: You introduce an ontology. Once you call something an agent (label="player1_position"), you assume that those variables belong to that agent. That works well because our intuitions about agents are pretty good.

Until it doesn’t. Our intuitions attribute agency where it isn't. Our ancestors anthropomorphized nature. We may attribute intent where there’s only coupling. It is more dangerous to overlook an agent than to see one too many.

But we may miss intent when it is distributed or doesn't neatly fit into our ontologies of physically bounded agents.

If you need to find an agent, esp. a potentially powerful agent that may work very differently from our intuitions, we need a method that can discover agents in raw unlabeled data without using a prior ontology (or, in other words, without a unit that is already known to be an agent).

That's two problems: 1. without prior ontology and 2. finding in raw data.

What is or is not an agent?

For the first, as part of agent foundations work, johnswentworth, and others have proposed modeling an agent in terms of its boundary that shields inner states from outer change. A natural modeling seemed to be a Markov blanket.

From Andrew_Critch's «Boundaries», Part 3a: Defining boundaries as directed Markov blankets (which uses a slightly non-standard naming of the variables, V ≝ I and P ≝ S).

Quick recap: The blanket is a partitioning^[1] of all variables into four sets:

: sensory variables aka inputs
$A_{t}$ : active variables aka outputs
$I_{t}$ : internal variables (this is where it gets interesting later)
$E_{t}$ : external variables, aka everything else

There are some problems with Markov blankets, but these seem solvable:

Agents aren't stationary, meaning a fixed set of raw observables can't form a blanket. This seems solvable by using invariants (as discussed here). Such invariants might be found with methods like Slot Attention.
There are no true Markov Blankets. The only true physical blanket is the lightcone. All real boundaries leak information. And they need to leak, otherwise the agent couldn't learn from the environment or communicate or cooperate with other agents. The Markov blanket is more like an ideal the agent "tries to achieve." This seems solvable by working with blankets up to an ε of mutual information.

Thus, it may make more sense to talk about invariant ε-boundaries when referring to such agents. Where the Markov blanket is not determined from raw observables, but over parameters invariant^[2] under transformation, and the understanding that the ε is chosen to minimize predictive regret for predicting internal from external (invariant) variables.

Where is the agent in the data?

Give such a formalization of an agent as a Markov blanket, how can we find it in the raw data? While there are efficient algorithms to find the blanket of a single variable, and there is some recent work on the relation between Markov Blanket Density and Free Energy Minimization, I'm not aware of any implementation that uses blankets to find agents. One problem is that we do not have a nice causal graph^[3] that we could inspect structurally for the blanket property. We have to check a lot of variables statistically.

Thus as a first contribution, I offer an implementation of Unsupervised Agent Discovery (UAD), i.e., of how it might be possible to find such agents in unlabeled raw time series data.

The implementation takes a raw dataset and executes these steps:

Filter variables for activity.
Build N clusters of variables using a mutual information graph (standard sklearn.cluster.AgglomerativeClustering).
Test the candidate clusters for the blanket condition up to an ε of remaining mutual information.
If tests fail, reduce the number N of clusters and retry from step 2.
Classify variables as S, A, or I of each cluster (and as E for the remaining external variables).

The implementation also includes a simulation of a configurable number of agents (simple state machines for simple controllers) in a simplified shared environment from which a dataset of all their observables is generated.

Lessons Learned

From implementing and running these simulations, I learned

If agents share resources or otherwise interact, the current algorithm quickly joins them into a single agent. Which makes sense because they are not, in fact, causally independent as required by a strict blanket property with ε as low as practically feasible. Effective agents will have a boundary that is not perfectly tight, but their boundary will be in a place where the agent can have effective input-output policies.
Even for simple agents, you need quite a number of samples, or variables that rarely change, will likely be misclassified. I have adapted the code of the finite state machines to avoid rare state changes and to allow useful classification of all the involved variables. In a more real-world experiment, variables that rarely change during the recording will not be classified correctly and a more practical algorithm should flag them as such.
If multiple agents do exactly the same thing, then they will be classified as a single agent and not as multiple agents. For example, multiple spatially separate solar panels that respond to the external variable of solar radiation all in the same way, updating their internal states the same way, and setting actuator values the same, are classified as a single cluster and the Markov blanket condition confirms them as a single agent. Which made sense to me in the end, but was counterintuitive at first.

Fun with Agents

Discovering agents is nice. Now that we can do that, what else can we do? We can try to understand the internal states better. We can try to understand how the agent represents itself and its environment. This is the second contribution of the post (Note: These is not yet part of the implementation).

Memory

We can treat memory as a compression^[4] of past inputs that inform future actions.

For each internal variable m∈I, and lag k, compute $Δ_{m} (k) = I (m_{t - k}; I_{t + 1} ∣ S_{t}, A_{t}, I_{t} ∖ {m})$ . If Δm(k) is large, then the past of m predicts the agent’s next internal state. We have to be careful because a variable may look like memory just because it is changing slowly.

An internal variable I is a memory if its past states (effectively its past inputs from the environment) predict its future states (and the corresponding actions) and thus its effect back on the environment.

Goals

If we can track inputs and outputs of an agents, we can try to infer what the implied policy behind this input-output relation is. We can infer R(I,A) with inverse reinforcement learning (IRL). Given a P(R) that weighs reward functions^[5], we can use the standard formulation $^R=argmaxRP({At}∣{It},R)P(R)$ . We could call these the agent's goals. Though with simple modeling, we can't yet represent issues like Mesa Optimization.

Agents Modeling Agents

If we can find two agents and their memory, we can ask if one of them represents the other, or, technically, if some memory $m \subset I^{X}$ predict $A_{t + Δ t}^{Y}$ beyond what X already does: $Δ_{m, Y} (k) = I (m_{t - k}; A_{t}^{Y} ∣ S_{t}^{X}, A_{t}^{X}, I_{t}^{X} ∖ {m})$ . A large Δm means that some part of X is predicting Y.

I'm not clear if this refutes How an alien theory of mind might be unlearnable or not.

Cooperation

Once you can track what agents know about each other and what objectives they follow, you can, in principle, derive how well these agents do or do not cooperate.

You can calculate:

b: benefit to X per bit of information gained about Y’s action,
c: cost to Y per bit disclosed (as measured by the implied reward function),
p: probability that Y’s cooperative act actually reaches X’s sensory channel,
ρ: relatedness (residual^[6] mutual information between the agents).

Cooperation is favored when the fraction $κ = \frac{b p ϱ}{c} > 1$ . We can call κ a cooperativity index; a generalization of Hamiltons's rule^[7]. If we look at the full induced cooperation graph between agents, where edges are weighted by κ, we can use percolation theory to determine at which level cooperation will become universal (a giant component) due to The Evolution of Trust.

The promise of this approach is that we could look at a complex system with many potential agents large and small, human, legal, or artificial, and determine at least an approximation of the cooperation structure and whether there are adversarial tendencies.

Problems

With all these promises, there are potentially serious problems.

The best blanket algorithm fails if we cannot observe sufficient details and especially the internals of agents of interest - for sure we can't look into humans. The theoretical arguments above even model why agents wouldn't want transparency in some cases. And if we use proxies we loose reliability. As a consolation, as we are most interested in powerful AI, we might have access to the internals of them at least.

If any variable we overlook is a common cause we can mistake agent boundaries and fail to separate and thereby fail to identify the crucial agents. I'm mostly thinking about complex LLM-like agents here. An LLM is distributed across multiple computers and may depend on human operators for its function but still be powerful. Agents running on LLMs are also hard to find with slot-based approaches

Things may change. The method, as currently conceived, is sample hungry, and if anything changes during the sampling period, this change will be covered and may interfere with the discovery. And also, some types of learning that change the policy of the agent (which might even be in response to the discovery process) may manifest only after the sampling period.

There are concerns that the calculations might not be statistically stable and esp. not computationally feasible. Currently, the algorithm is combinatorial in the number of variables. My current argument here is that we have an existence proof: Humans have learned to identify agents in complex data fairly effectively, and it should be possible to reproduce that in a specialized algorithm.

Next Steps

At this point, Unsupervised Agent Discovery doesn't give you an algorithm that discovers agents in real-world data yet. But it provides a precise way to talk about agents, their goals and cooperation, and many other things we care about that usually require an a priori notion of an agent, but can now be grounded in physics.

Github repo: https://github.com/GunnarZarncke/agency-detect/tree/master

My initial dense LaTeX/PDF writeup of UAD can be found here.

Many thanks to the reviewers Jonas Hallgren, Chris Pang, and Peter Kuhn. Additional thanks go to the team at AE studio that supported the development of the experiments and write-up with time and compute.

Feedback welcome, especially from the agent foundations people.

^{^}
by using the condition
$P (I_{t + 1}^{C}, E_{t + 1}^{C} ∣ I_{t}^{C}, S_{t}^{C}, A_{t}^{C}) = P (I_{t + 1}^{C} ∣ I_{t}^{C}, S_{t}^{C}, A_{t}^{C}) \cdot P (E_{t + 1}^{C} ∣ I_{t}^{C}, S_{t}^{C}, A_{t}^{C})$
from M. D. Kirchhoff, T. Parr, E. Palacios, K. Friston, and J. Kiverstein, “The Markov blankets of life: autonomy, active inference and the free energy principle,” J. R. Soc. Interface, vol. 15, no. 138, 2018.
^{^}
The invariants may themselves depend on the agents' dynamics, making a simple layer-by-layer inference infeasible.
^{^}
If we could intervene on the simulation/experiment, we could determine the causal structure as done in the Deepmind Discovering Agents paper. That is also how humans check whether something is an agent or not: we prod it and see if it evades. Is is a promising direction but was beyond the scope of this work.
^{^}
In evolutionary environments, agents with memory of past input that may be relevant to future outputs that affect survival will outcompete agents without. And agents with more compact memory will outcompete agents with larger memory but the same predictive effect.
^{^}
This weighing P(R) is often seen as arbitrary or in need of justification, but here we are closer to the underlying substrate. In most environments of interest, we can argue that there will be entropic forces that select for simpler policies and lower prediction errors.
^{^}
This residual mutual information between agents’ actions, internal models, or rewards is not indicate a failure of separation. It captures alignment without leakage, e.g. from shared task structure, common external drivers, following the same conventions, or algorithmic similarity.
^{^}
Hamilton's rule says that genes for a particular behavior should increase in frequency when rB>C where r = the genetic relatedness, B = reproductive benefit, and C = the reproductive cost
W. D. Hamilton, “The genetical evolution of social behaviour,” J. Theor. Biol., vol. 7, no. 1, pp.1–16, 1964.

LESSWRONG
LW