Transparency is vital for ML-type approaches to AI alignment, and is also an important part of agent foundations research. In this post, we lay out an agenda for formalizing transparency which we'll call the Optimization Provenance Agenda.
In particular, the goal is to create a notion of transparency strong enough that an attempted deception would be completely transparent. The basic idea is that at any point, not only should an agent's world model and subgoals be legible, but the entire provenance of all of the optimization processes which are part of the agent should be legible as well.
This agenda is a joint development between me and Evan Hubinger. Special thanks to Hjalmar Wijk and Evan Hubinger for their comments and feedback on this post!
In order to discuss the notions here, it will be helpful to have working definitions of the key concepts involved.
Intuitively, legibility means that a human should be able to look at something, and be able to understand it easily and correctly. If the thing in question is very large, we might at best only be able to have local legibility, where any given part below a certain size is legible, but the whole thing is not understandable by a human. In an amplification scenario, local legibility may be sufficient, with an amplified human being capable of understanding the global structure. For the purposes of this post, we'll consider legibility to also include these cases. One major issue with the concept of legibility is that it seems very difficult to create something legible through a legible process.
It seems plausible to me that existing ML techniques could be combined and extended to produce natural language descriptions of learned world models. However, this process itself would most likely be very illegible. Even in the case of human communication, it is possible for a human to produce a legible plan, but humans do not seem to be very capable of producing a legible explanation of the process which produced that plan. So it seems likely to me that we may have to decide some illegible process to trust in order to get off the ground with this approach. This could simply be trusting illegible human mental processes, or it could be something like trusting models produced in a mathematically simple way.
A world model belongs to a decision making process, and is used by the process to predict what the result of various decisions would be, so that it can make the best choice. It's important that the world model includes everything going into its decision making process
Due to our motivations in transparency, we will typically think of world models as being made of highly composable models, which each model an aspect of the entire world (including very abstract aspects). I believe that Goguen's sheaf semantics is a promising framework for formalizing this type of world model. It's important to note that world models in current ML methods are not composable this way, which makes these models much less legible.
World models can also be implicit or explicit. The canonical example of an implicit world model is that of a thermostat, where the world model is implicitly represented by the thermistor or bimetallic strip. An explicit world model is represented in a modeling framework, such as a sheaf model. The exact line between explicit and implicit world models seems to be nebulous. For our purposes, it is much better to have an explicit world model, since it is more legible to have something explicitly. Note that implicit models can still be legible though, such as in the thermostat example.
We will consider an optimizer to be made of three components:
1. A world model
2. An objective
3. An optimization process
As an example, consider AlphaGo, which uses Monte-Carlo Tree Search. The world model is distributed, with an implicit part in the selection network, and an explicit part in the expansion and simulation steps of the search. The objective is to maximize the probability of winning the game, and the optimization process is simply backpropagation over the tree.
In order to be legible, it is best if the objective is defined in terms of explicit parts of the world model. While I do not believe this is strictly necessary, I feel very skeptical that implicit objectives will be adequately transparent.
The provenance of an object is the entire history of how that object came to be. As an example, art museums place high value on knowing the complete provenance of a famous painting, as it greatly increases their confidence that the painting is indeed legitimate.
In this context, we want to understand the provenance of all optimization processes running in the agent. In particular, for each optimizer, we want to understand the optimization process which created the optimizer.
In addition to generally improving our understanding of transparency, there are two main motivations going into the optimization provenance agenda. So before we get into more details of the agenda, we will introduce these motivations here.
In order to perform a treacherous turn, or other kinds of deceptive alignment, an agent must at some point conceive of this plan. This conception will be the result of an optimizer, and so this should be in the optimization provenance. If this provenance is legible, then this should be detected by an overseer before it comes into fruition. So we expect that an agent with optimization provenance will be much less likely to achieve or even attempt a treacherous turn.
More generally, we want to prevent Goodhart's curse scenarios, and ensure inner alignment. If the agent is optimizing for a proxy to a goal, this is potentially legible. However, we will typically expect a training process to produce a mesa-optimizer which is optimizing for some proxy, since values are likely to be too complex to optimize for directly. So how do we distinguish between robustly aligned mesa-optimizers and pseudo-aligned mesa-optimizers (i.e. those that Goodhart)? If we understand the optimizer which created the mesa-optimizer in addition to the mesa-optimizer itself, we believe this will give insight into whether this is likely to happen or not. Understanding the provenance of optimizers could thus give us techniques to help us solve this problem.
We're now ready to understand the three key pieces of optimization provenance agenda. We believe that a solution to these would represent a large step in solving transparency.
In order for this to work, the optimizers running as part of the agent need to be legible. I believe the hardest part of this is ensuring that the world model of the optimizer is legible. Once we have a legible world model, I expect it will be relatively easy to ensure the objective is also legible.
Ensuring that the optimization process is legible falls more naturally into the next section.
1. Find an adequate framework for explicit world models
2. Explicitly understand the objective
3. Understand what makes something an implicit world model
The optimization processes used in the agent must all be capable of controlling whether it creates a mesa-optimizer. In particular, a mesa-optimizer should only be formed when its three components are all known and assured. This requires the optimization process to be legible in the sense that we understand under what conditions the process will create a new mesa-optimizer.
1. When are mesa-optimizers created?
2. How do we ensure that the mesa-optimizer will be legible?
3. Can such mesa-optimizers be made corrigible?
4. Does this apply to other forms of optimization daemons?
Having legible provenance of every optimizer occurring in the agent is a good step towards solving transparency. But this is not enough by itself, because provenance may itself be forged. In order for the provenance to be assured, an overseer must investigate any changes made to the optimization provenance, and verify that there is no optimization toward unaligned goals.
1. What constitutes adequate oversight?
2. How possible is it for accidental treachery to be produced?
3. To what extent can Goodhart's curse be avoided with optimization provenance?
It seems likely that progress on the optimization provenance agenda could be leveraged to make progress on other subproblems. In particular, I think that developing frameworks for explicit models will make it easier to solve the inner alignment problem. I also believe that the idea of optimization provenance is a useful handle for thinking about mesa-optimizers in general.
I'm having trouble building an intuitive picture of what kind of thing an optimization provenance is. Would it be possible give some toy examples of optimizers and what their optimization provenance might look like?
The idea of explicit world models reminds me of my research direction concerning cellular decision processes. Note that the "grid" of such a process can be replaced by an arbitrary graph, which can also evolve dynamically, and that makes it quite close to the notion of, representing the world as a collection of objects and relationships/interactions. I did not know about the sheaf theoretic angle (added to reading list, thank you!), it might be interesting to see whether these two combine in a natural way.
(1) You might give some thought to trying to copy (or at least understand) the world model framework of the human brain. There's uncertainty in how that works, but a lot is known, and you'll at least be working towards something that we know for sure is capable of getting built up to a human level world-model within a reasonable amount of time and computation. As best as I can tell (and I'm working hard to understand it myself), and grossly oversimplifying, it's a data structure with billions of discrete concepts, and transformations between those concepts (composition, cause-effect, analogy, etc...probably all of those are built out of the same basic "transformation machinery" with different contexts acting as metadata). All these concepts are sitting in the top layer of some kind of loose hierarchy, whose lowest layer consists of (higher-level-context-dependent) probability distributions over spatiotemporal sequences of sensory inputs. See my Jeff Hawkins post for one possible point of departure. I've found a couple other references that are indirectly helpful, and like I said, I'm still trying to figure it out. I'm still trying to understand the "sheaves" approach , so I won't comment on how these compare.
(2) "This conception will be the result of an optimizer, and so this should be in the optimization provenance" - this seems to be important and I don't understand it. Better understanding the world consists (in part) of chunking sequences of events and actions, suppressing intermediate steps. Thus we say and think "I'll put some milk in my coffee," leaving out the steps like unscrewing the top of the jug. The process of "explore the world model, chunking sequences of events when appropriate" is (I suspect) essential to making the world-model usable and powerful, and needs to be repeated millions of times in every nook and cranny of the world model, and thus this is a process that an overseer would have little choice but to approve in general, I think. But this process can find and chunk manipulative causal pathways just as well as any other kind of pathway. And once manipulation is packaged up inside a chunk, you won't need optimization per se to manipulate, it will just be an obvious step in the process of doing something, just like unscrewing the top of the jug is an obvious step in putting-milk-into-coffee. I'm not sure how you propose to stop that from happening.
The optimization processes used in the agent must all be capable of controlling whether it creates a mesa-optimizer.
I'm confused about this sentence - my understanding is that the term mesa-optimizer refers to the agent/model itself when it is doing some optimization. I think the term "run-time optimization" (which I've seen in this slide, seemingly from a talk by Yann LeCun) refers to this type of optimization.
4. Does this apply to other forms of optimization daemons?
Isn't every optimization daemon a mesa-optimizer?
I was under the impression that the term "optimization daemon" was used to describe a mesa-optimizer that is a "consequentialist" (I don't know whether there's a common definition for the term "consequentialist" in this context; my own tentative fuzzy definition is "something that has preferences about the spacetime of the world/multiverse".)
In particular, the goal is to create a notion of transparency strong enough that an attempted deception would be completely transparent.
Is the idea here that