Meta: this project is wrapping up for now. This is the second of probably several posts dumping my thought-state as of this week.
It is an empirical fact that we can predict the day-to-day behavior of the world around us - positions of trees or buildings, trajectories of birds or cars, color of the sky and ground, etc - without worrying about the details of plasmas roiling in any particular far-away star. We can predict the behavior of a dog without having to worry about positions of individual molecules in its cells. We can predict the behavior of reinforced concrete without having to check it under a microscope or account for the flaps of butterfly wings a thousand kilometers away.
It didn’t have to be this way. We could imagine a universe which looks like a cryptographic hash function, where most bits are tightly entangled with most other bits and any prediction of anything requires near-perfect knowledge of the whole system state. But empirically, our universe does not look like that.
Given that we live in a universe amenable to abstraction, what sorts of agents should we expect to evolve? What can we say about agency structure and behavior in such a universe? This post comes at the question from a few different angles, looking at different properties I expect evolved agents to display in abstraction-friendly universes.
Convergent Instrumental Goals
The basic idea of abstraction is that any variable X is surrounded by lots of noisy unobserved variables, which mediate its interactions with the rest of the universe. Anything “far away” from X - i.e. anything outside of those noisy intermediates - can only “see” some abstract summary information f(X). Anything more than a few microns from a transistor on a CPU will only be sensitive to the transistor’s on/off state, not its exact voltage; the gravitational forces on far-apart stars depend only on their total mass, momentum and position, not on the roiling of plasmas.
One consequence: if an agent’s goals do not explicitly involve things close to X, then the agent cares only about controlling f(X). If an agent does not explicitly care about exact voltages on a CPU, then it will care only about controlling the binary states (and ultimately, the output of the computation). If an agent does not explicitly care about plasmas in far-away stars, then it will care only about the total mass, momentum and position of those stars. This holds for any goal which does not explicitly care about the low-level details of X or the things nearby X.
This sounds like instrumental convergence: any goal which does not explicitly care about things near X itself will care only about controlling f(X), not all of X. Agents with different goals will compete to control the same things: high-level behaviors f(X), especially those with far-reaching effects.
Natural next question: does all instrumental convergence work this way?
Typical intuition for instrumental convergence is something like “well, having lots of resources increases one’s action space, so a wide variety of agents will try to acquire resources in order to increase their action space”. Re-wording that as an abstraction argument: “an agent’s accessible action space ‘far away’ from now (i.e. far in the future) depends mainly on what resources it acquires, and is otherwise mostly independent of specific choices made right now”.
That may sound surprising at first, but imagine a strategic video game (I picture Starcraft). There’s a finite world-map, so over a long-ish time horizon I can get my units wherever I want them; their exact positions don’t matter to my long-term action space. Likewise, I can always tear down my buildings and reposition them somewhere else; that’s not free, but the long-term effect of such actions is just having less resources. Similarly, on a long time horizon, I can build/lose whatever units I want, at the cost of resources. It’s ultimately just the resources which restrict my action space, over a long time horizon.
(More generally, I think mediating-long-term-action-space is part of how we intuitively decide what to call “resources” in the first place.)
Coming from a different angle, we could compare to TurnTrout’s formulation of convergent instrumental goals in MDPs. Those results are similar to the argument above in that agents tend to pursue states which maximize their long-term action space. We could formally define an abstraction on MDPs in which X is the current state, and f(X) summarizes the information about the current state relevant to the far-future action space. In other words, two states X with the same long-run action space will have the same f(X). “Power”, as TurnTrout defined it, would be an increasing function of f(X) - larger long-run action spaces mean more power. Presumably agents would tend to seek states with large f(X).
Fun fact: biological systems are highly modular, at multiple different scales. This can be quantified and verified statistically, e.g. by mapping out protein networks and algorithmically partitioning them into parts, then comparing the connectivity of the parts. It can also be seen more qualitatively in everyday biological work: proteins have subunits which retain their function when fused to other proteins, receptor circuits can be swapped out to make bacteria follow different chemical gradients, manipulating specific genes can turn a fly’s antennae into legs, organs perform specific functions, etc, etc.
One leading theory for why modularity evolves is “modularly varying goals”: essentially, modularity in the organism evolves to match modular requirements from the environment. For instance, animals need to breathe, eat, move, and reproduce. A new environment might have different food or require different motions, independent of respiration or reproduction - or vice versa. Since these requirements vary more-or-less independently in the environment, animals evolve modular systems to deal with them: digestive tract, lungs, etc. This has been tested in simple simulated evolution experiments, and it works.
In short: modularity of the organism evolves to match modularity of the environment.
… and modularity of the environment is essentially abstraction-friendliness. The idea of abstraction is that the environment consists of high-level components whose low-level structure is independent (given the high-level summaries) for any far-apart components. That’s modularity.
Coming from an entirely different direction, we could talk about the good regulator theorem from control theory: any regulator of a system which is maximally successful and simple must be isomorphic to the system itself. Again, this suggests that modular environments should evolve modular “regulators”, e.g. organisms or agents.
I expect that the right formalization of these ideas would yield a theorem saying that evolution in abstraction-friendly environments tends to produce modularity reflecting the modular structure of the environment. Or, to put it differently: evolution in abstraction-friendly environments tends to produce (implicit) world-models whose structure matches the structure of the world.
Finally, we can ask what happens when one modular component of the world is itself an evolved agent modelling the world. What would we expect this agent’s model of itself to look like?
I don’t have much to say yet about what this would look like, but it would be very useful to have. It would give us a grounded, empirically-testable outside-view correctness criterion for things like embedded world models and embedded decision theory. Ultimately, I hope that it will get at Scott’s open question “Does agent-like behavior imply agent-like architecture?”, at least for evolved agents specifically.