Epistemic status: this is the result of me trying to better understand the idea of mesa optimizers. It's speculative and full of gaps, but maybe it's interesting and I'm not realistically going to have time to improve it much in the near future.

Humans are often presented as an example of "mesa optimisers" - organisms created to "maximise evolutionary fitness" that end up doing all sorts of other things including not maximising evolutionary fitness and transforming the world in the process. This analogy is usually accompanied by a disclaimer like this:

We do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy (and, to some extent, an existence proof) that explains the key concepts.

I am proposing that if we focus on evolutionary "influence" instead of "fitness", we can flip both claims on their head:

  • Humans are extremely evolutionarily influential
  • We should take the evolution analogy seriously

Evolution is about how things change

I think evolutionary fitness is to some extent not the interesting thing about evolution. If the world was always full of indistinguishable rabbits and always will be full of indistinguishable rabbits, then rabbits are in some sense "fit", but evolution is also a boring theory because all it says is "rabbits". The interesting content of evolution is in what it says about how things change: if the world is full of regular rabbits plus one extra-fit rabbit, then evolution says that in a few years, the world will have few regular rabbits and lots of extra-fit rabbits.

 I want to propose a rough definition of evolutionary influence that generalises this idea. There are a few gaps in the definition which I hope can be successfully resolved, but I haven't had the time to do this yet.

First, we need an environment. I currently think of an environment as "a universe at a particular point in time". The universe is:

  • A set  of "points in time"
  • A set  of configurations that the universe can have at a particular time
  • An update rule  that probabilistically maps the configuration at one point in time to the configuration at the next ( means "the set of probability distributions on ").

Given a time , the environment  is a probability distribution on .

An organism is a particular configuration of a "small piece" of a universe. We can specify a function   that evaluates whether the universe contains the organism, and  is somehow restricted to evaluating "small parts" of the universe (what I mean by a "small part" is currently a gap in the definition). We can condition an environment  on the presence of an organism  to get the environment  and similarly  is the environment without the organism .

A feature is a "large piece" of a universe. Like organisms, I'm not sure what I mean by "large piece". In any case, there's a function  that tells us whether a feature is present, and it must in some sense be "big and obvious".

An organism  at  has a large evolutionary influence at  if the probability of some large feature  is very different in the future environment with the organism  than in the future environment without it 

Intuition: If at time  the environment  is  full of grass and there are also a pair of rabbits  there, then in the future  the environment  will be full of rabbits and have a lot less grass. On the other hand, if there are no rabbits at time  then the future environment  will still be mostly grass.

Intuition 2: Perhaps if humans had not appeared when they did, highly intelligent life would have taken much longer to appear on Earth/never done so.  Then the Earth without humans 300k years ago wouldn't have any cities etc. today.

Relevance to AI

The fundamental question I'm asking here is: are AI research efforts likely to produce highly influential "organisms". A second important question is whether this influence is aligned with the creator's aims, but this seems to me to add a lot of complication.

My basic thinking here is that an AI system in training is embedded in two "universes". Think of a large neural network in training. One "universe" in which it lives is the space of network weights, and the update rule is given by the training algorithm and the loss incurred on the data at each step of training. It's not clear that it's meaningful to talk about "influence" in this universe. Maybe there is some "small" feature of the initialisation that determines whether it converges to something useful or not, but that is speculative (and I don't know what I mean by "small").

It also exists in the real universe - i.e. it's a configuration in the storage of some computer somewhere. Here there's a more intuitive sense that we can talk about "influentialness" - if it produces useful outputs somehow, people will be excited by their new AI and publish papers about it, create products using it and so forth, whereas if it doesn't then none of that will happen and it will be forgotten.

Given the way neural networks are trained, a trained neural network basically has to be something that performs reasonably well in the training universe. However, influence in the real universe trumps performance in the training universe - a real-universe-influential AI that isn't actually good on the training set is still real-universe-influential.

Compatibility

Two postulates about AI and influence:

  1. There is a training universe that, when run for long enough, produces a highly influential organism in our universe
    • An example of this would be a very high-performance reinforcement learner whose reward is based on some "large feature" of our universe
  2. AI training that we actually do has a reasonable chance of creating such a training universe
    • For example, maybe it's not too hard to repurpose existing techniques (or future developments of them) to create the reinforcement learner mentioned in 1

Speculatively, there is some sense in which we can talk about the compatibility of the training environment and the real universe in the sense that "high performance" in the training environment is correlated with influence in the real universe. For a hypothetically "optimal" reinforcement learner rewarded based on large features of the real universe, this compatibility is maximal. However, even more pragmatic AI training regimes might exhibit high compatibility.

Also, pragmatic AI training regimes might exhibit high compatibility without much transparency about what kind of real-world influence is compatible with the training. Recalling that real-world influence screens off performance on the training objective, good behaviour assuming optimality with respect to the training objective may not be enough to guarantee good behaviour. This seems to me to be one of the key insights of the idea of mesa-optimisation. On the other hand, it's completely plausible to me that good behaviour assuming optimality could imply good behaviour for near-optimality too. It is also quite mysterious to me how to actually characterise "good behaviour".

One thing we get from the "compatibility" framing that we don't get from the "optimization" framing is that compatibility arises because people want AIs that can do useful stuff in the real world. This is true for technology in general, of course, but AI stands out as being a unique combination of

  • Search/"optimization pressure" (compared to designing a shovel, training an AI involves a lot more searching)
  • Training environment compatibility (compared to a shortest path search, training an AI involves a lot more signal from the real world)

Conclusion

I've sketched a few rough ideas here that might be useful for better understanding AI risk. If they are actually going to be useful, they really need more development. Some questions:

  • How should influence actually be defined?
    • Can "small parts" and "big parts" of the universe be defined in some way that leads to influence being non-trivial?
    • Should influence reduce to evolutionary fitness in under appropriate assumptions?
  • If we do have a definition, how should "compatibility between environments" be defined?
  • Can we actually derive any results along the lines of the speculative proposals above?

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 3:30 PM

To some extent this sounds like it's already captured by the notion of intelligence as being able to achieve goals in a wide range of environments - mesa-optimizers will have some edge if they're intelligent (or else why would they arise?). And this edge grows larger the more complicated stuff they're expected to do.

Contrary to the middle of your post, I would expect the training environment to screen off the deployment environment - the influentialness of a future AI is going to be because the training environment rewarded intelligence, not because influentialness on the deployment environment somehow reaches back to bypass the training environment and affect the AI.

The Legg-Hutter definition of intelligence is counterfactual ("if it had X goal, it would do a good job of achieving it"). It seems to me that the counterfactual definition isn't necessary to capture the idea above. The LH definition also needs a measure over environments (including reward functions), and it's not obvious how closely their proposed measure corresponds to things we're interested in, while influentialness in the world we live in seems to correspond very closely.

The mesa-optimizer paper also stresses (not sure if correctly) that they're not talking about input-output properties.

Influentialness in the real world definitely screens off training environment performance WRT impact in the real world, because that's what the definition of influentialness is. If training environment performance fully screens off real world influentialness then this means that the outcome of training is essentially deterministic, which is not obviously always going to be true. Even if it is, kind of real world influence any given training environment maps to will be a bit random, which might ultimately yield a similar result.

When you say "we'll reward intelligence", your meaning is similar to when I say "training environment and real world might be highly compatible".

I think an idea like Legg-Hutter intelligence could be useful for arguing that the environments are likely to be compatible.