Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: Posting for blog day at MSFP! More trying to figure out what the right definitions are than saying anything concrete. I still don’t really know what agents are, and none of this is math yet. I’m hoping to develop these (and other) ideas more in the future, so any feedback is greatly appreciated.

My Naive Agent Pre-model

Y’know -- agents are, like, things that do things. They have utility functions and stuff. They make choices, whatever that means.

Unfortunately this ‘definition’ isn’t sufficient for making any concrete claims about how agents behave, so I’ve been thinking about some models that might be, and this post contains ideas that came out of that.

Irreducible Agents vs Agent Clusters

Irreducible Agents

An irreducible agent is what I’m calling something that optimizes a really simple utility function in some straightforward sense -- maybe it just does gradient descent or something. If it has a choice of two actions, it picks the one that results in higher utility every time. (This concept needs a precise definition, but I’m not sure what the right definition is yet, so I’m just trying to point at the thing).

It seems like when people talk about agents in the abstract, this is the kind of agent they often mean. But people also sometimes talk about things like humans as agents, and we aren’t really like that. Humans seem to be at least partially made up of smaller agenty parts that have different and sometimes conflicting goals -- more like what I’m calling ‘agent clusters’

Agent clusters

If you glue together a bunch of irreducible agents in a reasonable way, you could still get something that looks agenty. I can think of a couple of ways to think about gluing agents together; there are also probably better framings:

One way is by having a sort of meta-agent that turns agents on and off by some criteria, and the subagent that is turned on gets to decide what to do. I’m not sure this framing makes sense; if you can think of the meta-agent as having a utility function, it seems like it collapses to just be an irreducible agent after all. But maybe having the meta-agent use some simple rules that don’t constitute a true utility function could work to build agent clusters that efficiently approximate a more complex utility function than they could explicitly represent.

Another way is to think of the agents as voters which rank actions by their utility function and use some voting system to decide what the action of the resulting system will be. (I'm pretty sure Arrow’s Impossibility Theorem doesn’t kill this as a possible structure - it just says that you can’t do it with every set of agents, which is not surprising)

(Note that these two framings could be combined -- the meta-agent could activate multiple subagents at the same time and aggregate the opinions of only the activated subagents via some voting system. I’m not sure if this is useful)

Some questions:

Q: Are these two framings (the meta-agent framing and the electorate framing) equivalent? Are there other options or can any reasonable cluster be described in this way?

Q: Given a description of a (purported) agent as a cluster of irreducible agents in this way, what should it mean for it to be coherent and how can we tell whether it is coherent? One idea is that using the electorate model, ‘coherent’ could mean that it is possible to satisfy the desiderata from Arrow’s theorem.

Q: If you use the electorate model, are there simple conditions on the utility functions of the voters under which a coherent agent can be formed?

New Comment
3 comments, sorted by Click to highlight new comments since:

The idea of agent clusters seems closely related to my idea about modeling dynamically inconsistent agents. In my model, each subagent controls particular states. More general game-theoretic models can be considered, by they seem to have worse properties.

Regarding the broader question, about how to bridge the gap between agents that are "ideal, perfectly rational consequentialists" and "realistic" agents (e.g. humans), more factors that can be relevant are:

  • Realistic agents have computational resource bounds.

  • Realistic agents might be learning algorithms with suboptimal sample complexity (i.e. it takes them longer to learn that a perfect agent).

  • Realistic agents might only remain coherent within some subspace of the state space (although that might be possible to model using dynamic inconsistency).

  • We can also consider agents that have some "Knightian uncertainty" about their own utility function. For example, we can consider a convex set in the space of utility functions, and have the agent follow the maximin policy w.r.t. this convex set. As a more specific example, we can consider an instrumental reward function that is only defined on some affine subspace of the instrumental state space, and consider all extensions of it to the entire instrumental state space that don't increase its range.

You should check out "Why Subagents?". That post starts with the usual argument that acyclic preferences imply existence of a utility function, then shows that if we relax some of the assumptions, we actually get committees of utility-maximizers. Markets are my go-to example: they satisfy exactly the same "inexploitability" notions used by utility-existence proofs, but a market doesn't have a utility function in general, because it has internal degrees of freedom which result in path-dependent aggregate preferences.

Ah, MIRI summer fellows! Maybe that's why there's so many posts today.

I think that if there's a dichotomy, it's "abstract/ideal agents" vs. "physical 'agents'".

Physical agents, like humans, don't have to be anything like agent clusters - there doesn't have to be any ideal agent hiding inside them. Instead, thinking about them as agents is a descriptive step taken by us, the people modeling them. The key philosophical technology is the intentional stance.

(Yeah, I do feel like "read about the intentional stance" is this year's "read the sequences")

On to the meat of the post - agents are already very general, especially if you allow preferences over world-histories, at which point they become really general. Maybe it makes more sense to think of these things as languages in which some things are simple and others are complicated? At which point I think you have a straightforward distance function between languages (how surprising is one language one average to another), but no sense of equivalency aside from identical rankings.