Epistemic status: this is the result of me trying to better understand the idea of mesa optimizers. It's speculative and full of gaps, but maybe it's interesting and I'm not realistically going to have time to improve it much in the near future.
Humans are often presented as an example of "mesa optimisers" - organisms created to "maximise evolutionary fitness" that end up doing all sorts of other things including not maximising evolutionary fitness and transforming the world in the process. This analogy is usually accompanied by a disclaimer like this:
We do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy (and, to some extent, an existence proof) that explains the key concepts.
I am proposing that if we focus on evolutionary "influence" instead of "fitness", we can flip both claims on their head:
- Humans are extremely evolutionarily influential
- We should take the evolution analogy seriously
Evolution is about how things change
I think evolutionary fitness is to some extent not the interesting thing about evolution. If the world was always full of indistinguishable rabbits and always will be full of indistinguishable rabbits, then rabbits are in some sense "fit", but evolution is also a boring theory because all it says is "rabbits". The interesting content of evolution is in what it says about how things change: if the world is full of regular rabbits plus one extra-fit rabbit, then evolution says that in a few years, the world will have few regular rabbits and lots of extra-fit rabbits.
I want to propose a rough definition of evolutionary influence that generalises this idea. There are a few gaps in the definition which I hope can be successfully resolved, but I haven't had the time to do this yet.
First, we need an environment. I currently think of an environment as "a universe at a particular point in time". The universe is:
- A set of "points in time"
- A set of configurations that the universe can have at a particular time
- An update rule that probabilistically maps the configuration at one point in time to the configuration at the next ( means "the set of probability distributions on ").
Given a time , the environment is a probability distribution on .
An organism is a particular configuration of a "small piece" of a universe. We can specify a function that evaluates whether the universe contains the organism, and is somehow restricted to evaluating "small parts" of the universe (what I mean by a "small part" is currently a gap in the definition). We can condition an environment on the presence of an organism to get the environment and similarly is the environment without the organism .
A feature is a "large piece" of a universe. Like organisms, I'm not sure what I mean by "large piece". In any case, there's a function that tells us whether a feature is present, and it must in some sense be "big and obvious".
An organism at has a large evolutionary influence at if the probability of some large feature is very different in the future environment with the organism than in the future environment without it
Intuition: If at time the environment is full of grass and there are also a pair of rabbits there, then in the future the environment will be full of rabbits and have a lot less grass. On the other hand, if there are no rabbits at time then the future environment will still be mostly grass.
Intuition 2: Perhaps if humans had not appeared when they did, highly intelligent life would have taken much longer to appear on Earth/never done so. Then the Earth without humans 300k years ago wouldn't have any cities etc. today.
Relevance to AI
The fundamental question I'm asking here is: are AI research efforts likely to produce highly influential "organisms". A second important question is whether this influence is aligned with the creator's aims, but this seems to me to add a lot of complication.
My basic thinking here is that an AI system in training is embedded in two "universes". Think of a large neural network in training. One "universe" in which it lives is the space of network weights, and the update rule is given by the training algorithm and the loss incurred on the data at each step of training. It's not clear that it's meaningful to talk about "influence" in this universe. Maybe there is some "small" feature of the initialisation that determines whether it converges to something useful or not, but that is speculative (and I don't know what I mean by "small").
It also exists in the real universe - i.e. it's a configuration in the storage of some computer somewhere. Here there's a more intuitive sense that we can talk about "influentialness" - if it produces useful outputs somehow, people will be excited by their new AI and publish papers about it, create products using it and so forth, whereas if it doesn't then none of that will happen and it will be forgotten.
Given the way neural networks are trained, a trained neural network basically has to be something that performs reasonably well in the training universe. However, influence in the real universe trumps performance in the training universe - a real-universe-influential AI that isn't actually good on the training set is still real-universe-influential.
Two postulates about AI and influence:
- There is a training universe that, when run for long enough, produces a highly influential organism in our universe
- An example of this would be a very high-performance reinforcement learner whose reward is based on some "large feature" of our universe
- AI training that we actually do has a reasonable chance of creating such a training universe
- For example, maybe it's not too hard to repurpose existing techniques (or future developments of them) to create the reinforcement learner mentioned in 1
Speculatively, there is some sense in which we can talk about the compatibility of the training environment and the real universe in the sense that "high performance" in the training environment is correlated with influence in the real universe. For a hypothetically "optimal" reinforcement learner rewarded based on large features of the real universe, this compatibility is maximal. However, even more pragmatic AI training regimes might exhibit high compatibility.
Also, pragmatic AI training regimes might exhibit high compatibility without much transparency about what kind of real-world influence is compatible with the training. Recalling that real-world influence screens off performance on the training objective, good behaviour assuming optimality with respect to the training objective may not be enough to guarantee good behaviour. This seems to me to be one of the key insights of the idea of mesa-optimisation. On the other hand, it's completely plausible to me that good behaviour assuming optimality could imply good behaviour for near-optimality too. It is also quite mysterious to me how to actually characterise "good behaviour".
One thing we get from the "compatibility" framing that we don't get from the "optimization" framing is that compatibility arises because people want AIs that can do useful stuff in the real world. This is true for technology in general, of course, but AI stands out as being a unique combination of
- Search/"optimization pressure" (compared to designing a shovel, training an AI involves a lot more searching)
- Training environment compatibility (compared to a shortest path search, training an AI involves a lot more signal from the real world)
I've sketched a few rough ideas here that might be useful for better understanding AI risk. If they are actually going to be useful, they really need more development. Some questions:
- How should influence actually be defined?
- Can "small parts" and "big parts" of the universe be defined in some way that leads to influence being non-trivial?
- Should influence reduce to evolutionary fitness in under appropriate assumptions?
- If we do have a definition, how should "compatibility between environments" be defined?
- Can we actually derive any results along the lines of the speculative proposals above?