Optimization happens inside the mind (map), not in the world (territory). Reflecting about this made me noticeably less confused about how powerful AI agents capable of model-based planning and self-modification will act.

My model of future AI

An AI that is a powerful optimizer will probably do some kind of model-based planning. It will search on its world-model for more powerful actions or moves.

Because the real-world is very complicated, the real state and decision spaces will now be known. Instead, the AI will probably create a generative model  that contain states and actions at different levels of abstraction, and that can sample states and rewards  

It may have a strong intuition, or generative policy network , that can be sampled as a generative model of its own actions. 

But it will probably also have an amplification algorithm leading to a more powerful search policy . Similar to AlphaGo, which runs a MCTS algorithm using its neural network policy and the transition probabilities for the game to generate a distribution of stronger moves, a powerful AI may also run some kind of search algorithm using its generative policy network  and its generative world-model  to come up with stronger actions, represented by its search policy .

Powerful search with a perfect world-model

The search algorithm is what lets the AI think for longer, getting potentially more expected rewards the more compute power it uses. It is also a potential source of self-improvement for the intuition policy , which can be self-trained to maximize similarity to .

If we assume that the generative world-model  contains a perfect abstraction and simulation of the world, then this search process is going to optimize the world itself, and if  happens to return  based on how many more paperclips are in  compared to , then the AI agent, with a powerful enough search algorithm and with enough computing power, will kill everyone and turn the entire light-cone into paperclips.

However, the abstractions may be inadequate, and under given abstractions the predicted rewards and world-model transitions from  may be quite distinct from the real reward and state transitions, and for a given  this is more likely to happen in practice the stronger the search is.

Search under imperfect world-models

When watching strong computer engines play chess against each other, one thing that you sometimes see is both engines estimating a small advantage to themselves.

This makes sense, as engine  is trying to push to a state  where  is high, engine B is trying to get to a state where  is high, and the value estimates  and  are not perfect inverses of each other. Even if  for most , the game dynamic is likely to push the world precisely into the special states  where the estimates disagree the most.

I expect this effect to be even stronger when you have two agents competing against each other in the real world, when they have different generative world-models that predict different state transitions.

I expect a somewhat similar effect when a single agent is trying to solve a difficult problem in the real-world. If it does a deep search using its own world-model looking for solutions, it may find not a solution that works in the real-world, but a solution that "works" exclusively in the world-model instead.

Of course, this is not an insurmountable problem for the agent. It can try to make its generative model more robust, making it "aware" of the search, so that it can try to account for the expected adverse selection and overfitting caused by different kinds of search in different circumstances. The generative model  can include the agent itself and its own search algorithms, such that the search decisions are included as part of the actions.

In fact, I see much of science and rationality in general as methods for generating more robust elements or harder elements in world-models, which can be used for deeper search. The generative model can then learn that, if your search was based exclusively on hard elements of the world-model, then otherwise naively-predicted rewards are actually real. Rationality exists so you can think deep without concluding nonsense.

However, there is an important catch. Beyond whatever is already in , including both state transitions and internal rewards, the agent may not care about further differences between  and the real-world.

That is,  may include the knowledge of how it differs from the real-world, and yet the agent will keep searching for ways to achieve rewards  predicted from its model, and acting according to what maximizes that, rather than trying to maximize what it knows will be the "real" reward signal.

Avoiding wireheading

For example, suppose the agent is given the choice of the wirehead action .

Based on a sufficiently accurate internal world-model, it can predict, not from experience, but from generalization, that taking action  will lead to a very strong "real" reward, or positive reinforcement signal.

This reinforcement signal will presumably act upon the outer optimization loop, modifying  such that  returns higher , and also reinforcing elements in the search algorithm and policy that are credited for finding  among possible actions. As a result, the agent will increasingly choose actions similar to , and after a few more rounds of positive reinforcement the agent may end up in a state where  can be selected exclusively. Positive reinforcement will be given indiscriminately forever, updating all its internal weights  in the same direction until its entire mind is effectively erased.

The above consequences can all be predicted from running the world-model  during the search (like an AlphaZero rollout), and because they imply the agent may not succeed at achieving any of its other goals that  current attributes value to, the search policy may conclude that , is a blunder.

In the above example the agent is not outer aligned to its reward circuitry. This may be inevitable, but may be also a positive thing!

Avoiding fixing its own world-model

Disclaimer: the following paragraphs may appear to describe an alignment strategy, but this is not the intention.

More confusingly, the agent may not necessarily care about the real-world accuracy of the  transition probabilities either.

Imagine there exists a set of forbidden states , and that  whenever the agent-state (as part of the world-state) is being deceitful, or modifying itself secretly, or planning against humans, or doing something else with itself that we don't want or wouldn't want it to do.

Suppose that the agent estimates  as the probability that , and that  is modified to increasingly lead to a terminal state with very negative rewarded potentially crucially for alignment, whenever   increases. This is separate and independent from the real forbidden signal, used to train , but no longer relevant during "production".

Suppose also that the outer optimization loop actually checks for  and refrains from "correcting"  in this case. As far as the outer optimization loop is concerned, the generative model will maintain the "belief" that expected forbidden actions lead to negative terminal states, despite extensive training runs in which this doesn't happen. The world model , being powerful enough and having enough data on itself and on its own training, contains all the above information.

Now consider the action , the action to self-modify and "fix"  so that it no longer "mistakenly" believes that forbidden actions lead to terminal states. Suppose also that this action  is subtle enough that it bypasses the estimate , and is therefore seen in the moment as not being forbidden (that is,  is not taken as leading immediately to a forbidden state).

Based on the generalization of its world-model , the agent predicts that, after "fixing" , a new updated agent will be able to go on and out of multiple currently forbidden states. This realization will be encoded in the current "unfixed"  as likely resulting in a sequence of states , where  may be significant for some or many of these states. If this is the case, then the current generative model G, which is still the one being used during the search to evaluate , will abort the sequence by leading to a terminal state, returning negative reward. As a result, action  is rejected and not taken.

Conclusions

Strong world-optimization only happens if there is a robust and strong correlation between the world-model and reality.

AI agents may be able to notice divergences between their world-model and reality, not care, and continue to optimize according to the world-model.

In particular, decisions to update the world-model to better match reality are taken based on goals defined upon the current world-model, not over anything the real-world. Decisions to wirehead or self-modify should also be considered in this way.

It may even be possible, if very challenging, to build AI agents with rich world-models that are reflexively stable, yet don't match reality well outside of some domains.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 5:32 PM
[-]beren11mo50

I think this is a really good post. You might be interested in these two posts which explore very similar arguments on the interactions between search in the world model and more general 'intuitive policies' as well as the fact that we are always optimizing for our world/reward model and not reality and how this affects how agents act.

Thank you very much for linking these two posts, which I hadn't read before. I'll start using the direct vs amortized optimization terminology as I think it makes things more clear.

The intuition that reward models and planners have an adversarial relationship seems crucial, and it doesn't seem as widespread as I'd like.

On a meta-level your appreciation comment will motivate me to write more, despite the ideas themselves being often half-baked in my mind, and the expositions not always clear and eloquent.

Strong world-optimization only happens if there is a robust and strong correlation between the world-model and reality.

 

Humans and corporations do not have perfect world models. Our knowledge of the world and therefore our world models are very limited. Still humans and corporations manage to optimize. Mostly this happens by trial and error (and copying succesful behaviors of others). 

So I wonder if strong world-optimization could occur as an interative process based on an imperfect model of the world. This however assumes interaction with the world and not a "just in your head" process.

As a thought experiment I propose a corporation evading tax law. Over time corporations always manage to minimize the amount of taxes paid. But I think this is not based on a perfect world model. It is an iterative process whereby people predict, try things and learn along the way. (another example could be the scientific method, also iterative and not in your head but there is an interaction with the world).

My claim however assumes that optimization not occuring just in your head, but interaction with the real world is neccessary for optimization. So maybe I am missing the point of your argument here.

[-]Viliam11mo31

I didn't understand the technical details of the article, but this seems correct.

If you have a perfect model with zero uncertainty, you can solve the entire situation in your head, and then when you actually do it, the result should be the same... or the assumptions were wrong somehow.

Otherwise, I think it makes sense to distinguish two types of situations:

a) During the execution of the plan, something completely unexpected happens. Oops, you have to update, and start thinking again, considering the new information.

b) Your model has some uncertainty, but you know the statistical distributions. For example, with probability 80% the world is in state X, with probability 20% it is in state Y, but you cannot immediately check which option it is. But you can, for example, create a five-step plan based on the (more likely) assumption that it was state X, and if the assumption is wrong, you know it will become visible during step 3, in which case you will switch to alternative steps 4b and 5b. Or if the switch would be too expensive, maybe you could instead add a step zero, some experiment which will figure out whether it is X or Y.

The difference is between "I didn't expect this -- must update and think again" and "I expected this could happen -- switching to (already prepared) Plan B". The former requires iteration, but the latter does not.

An analogy in computer programming would be a) the programmer finding out that the program has a bug and trying to fix it; vs b) the programmer including an "if" statement or exception handling in the program.

In real life the distinction can be less clean. For example, even if you have exact statistical distributions, the resulting number of combinations may be too large to handle computationally, so you might prepare plans for the three (or thirty, if you are a superintelligence) most likely scenarios in advance, and stop and think again if something else happens.

On the other hand, even when unexpected things happen, we often do not immediately freeze and start thinking, but try to stabilize things first. (If I suddenly find during my walk that I am walking in a wrong direction, I will first put both my feet on the ground; and maybe if I am in the middle of a road, I will get to the sidewalk first... and only then I will look at the map and start thinking.) This again assumes that I have correct probability distributions at least about some things (that putting both feet on the ground will increase my stability; that standing on the sidewalk is safer than standing in the middle of a road) even if I got something else wrong (the street I am currently on).

Your model has some uncertainty, but you know the statistical distributions. For example, with probability 80% the world is in state X, with probability 20% it is in state Y.

Nice way of putting it. 

Disclaimer: Low effort comment.

The word "optimization" seems to have a few different related meanings so perhaps it would be useful to lead with a definition. You may enjoy reading this post by Demski if you haven't seen it.

  • Mathematical definition: Optimization is the process of finding the best possible solution to a problem, given a set of constraints.
  • Practical definition: Optimization is the process of improving the performance of a system, such as by minimizing costs, maximizing profits, or improving efficiency.

In my comment I focused on the second interpretation (by focussing on iteration). The first definition does not require a perfect model of the world. 

In the real world we always have limited information and compute and so the best possible solution is always an approximation. The person with the most compute and information will probably optimize faster and win.  

I agree that this is a very good post and it helps me sharpen my views. 

[-]qbolec11mo10

Upon seeing the title (but before reading the article) I thought it might be about a different hypothetical phenomenon: one in which an agent which is capable of generating very precise models of reality might completely lose any interest in optimizing reality whatsover - after all it never (except "in training" which was before "it was born") cared about optimizing the world - it just executes some policy which was adaptive during training to optimize the world, but now, these are just some instincts/learned motions, and if it can execute them on a fake world in his head, it might be easier to feel good for it.

For consider: porn. Or creating neat arrangements of buildings when playing SimCity. Or trying to be polite to characters in Witcher. We, humans, have some learned intuitions on how we want the world to be, and then try to arrange even fake worlds in this way, even if this disconnected from real world outside. And we take joy from it.

Can it be, that a sufficiently advanced AGI will wire-head in this particular way: by seeing no relevant difference between atomic-level model of reality in its head and atomic-level world outside?

I see no contradictions with a superintelligent being mostly motivated to optimize virtual worlds, and it seems an interesting hypothesis of yours that this may be a common attractor. I expect this to be more likely if these simulations are rich enough to present a variety of problems, such that optimizing them continues to provide challenges and discoveries for a very long time.

Of course even a being that only cares about this simulated world may still take actions in the real-world (e.g. to obtain more compute power), so this "wire-heading" may not prevent successful power-seeking behavior.

The key thing to notice is that in order to exploit this scenario, we have to have a world-model that is precise enough to model reality much better than humans, but not be so good at modelling a reality that it's world models are isomorphic to a reality.

This might be easy or challenging, but it does mean we probably can't crank up the world-modeling part indefinitely while still trapping it via wireheading.