Transparency is vital for ML-type approaches to AI alignment, and is also an important part of agent foundations research. In this post, we lay out an agenda for formalizing transparency which we'll call the Optimization Provenance Agenda.
In particular, the goal is to create a notion of transparency strong enough that an attempted deception would be completely transparent. The basic idea is that at any point, not only should an agent's world model and subgoals be legible, but the entire provenance of all of the optimization processes which are part of the agent should be legible as well.
This agenda is a joint development between me and Evan Hubinger. Special thanks to Hjalmar Wijk and Evan Hubinger for their comments and feedback on this post!
In order to discuss the notions here, it will be helpful to have working definitions of the key concepts involved.
Intuitively, legibility means that a human should be able to look at something, and be able to understand it easily and correctly. If the thing in question is very large, we might at best only be able to have local legibility, where any given part below a certain size is legible, but the whole thing is not understandable by a human. In an amplification scenario, local legibility may be sufficient, with an amplified human being capable of understanding the global structure. For the purposes of this post, we'll consider legibility to also include these cases. One major issue with the concept of legibility is that it seems very difficult to create something legible through a legible process.
It seems plausible to me that existing ML techniques could be combined and extended to produce natural language descriptions of learned world models. However, this process itself would most likely be very illegible. Even in the case of human communication, it is possible for a human to produce a legible plan, but humans do not seem to be very capable of producing a legible explanation of the process which produced that plan. So it seems likely to me that we may have to decide some illegible process to trust in order to get off the ground with this approach. This could simply be trusting illegible human mental processes, or it could be something like trusting models produced in a mathematically simple way.
A world model belongs to a decision making process, and is used by the process to predict what the result of various decisions would be, so that it can make the best choice. It's important that the world model includes everything going into its decision making process
Due to our motivations in transparency, we will typically think of world models as being made of highly composable models, which each model an aspect of the entire world (including very abstract aspects). I believe that Goguen's sheaf semantics is a promising framework for formalizing this type of world model. It's important to note that world models in current ML methods are not composable this way, which makes these models much less legible.
World models can also be implicit or explicit. The canonical example of an implicit world model is that of a thermostat, where the world model is implicitly represented by the thermistor or bimetallic strip. An explicit world model is represented in a modeling framework, such as a sheaf model. The exact line between explicit and implicit world models seems to be nebulous. For our purposes, it is much better to have an explicit world model, since it is more legible to have something explicitly. Note that implicit models can still be legible though, such as in the thermostat example.
We will consider an optimizer to be made of three components:
1. A world model
2. An objective
3. An optimization process
As an example, consider AlphaGo, which uses Monte-Carlo Tree Search. The world model is distributed, with an implicit part in the selection network, and an explicit part in the expansion and simulation steps of the search. The objective is to maximize the probability of winning the game, and the optimization process is simply backpropagation over the tree.
In order to be legible, it is best if the objective is defined in terms of explicit parts of the world model. While I do not believe this is strictly necessary, I feel very skeptical that implicit objectives will be adequately transparent.
The provenance of an object is the entire history of how that object came to be. As an example, art museums place high value on knowing the complete provenance of a famous painting, as it greatly increases their confidence that the painting is indeed legitimate.
In this context, we want to understand the provenance of all optimization processes running in the agent. In particular, for each optimizer, we want to understand the optimization process which created the optimizer.
In addition to generally improving our understanding of transparency, there are two main motivations going into the optimization provenance agenda. So before we get into more details of the agenda, we will introduce these motivations here.
In order to perform a treacherous turn, or other kinds of deceptive alignment, an agent must at some point conceive of this plan. This conception will be the result of an optimizer, and so this should be in the optimization provenance. If this provenance is legible, then this should be detected by an overseer before it comes into fruition. So we expect that an agent with optimization provenance will be much less likely to achieve or even attempt a treacherous turn.
More generally, we want to prevent Goodhart's curse scenarios, and ensure inner alignment. If the agent is optimizing for a proxy to a goal, this is potentially legible. However, we will typically expect a training process to produce a mesa-optimizer which is optimizing for some proxy, since values are likely to be too complex to optimize for directly. So how do we distinguish between robustly aligned mesa-optimizers and pseudo-aligned mesa-optimizers (i.e. those that Goodhart)? If we understand the optimizer which created the mesa-optimizer in addition to the mesa-optimizer itself, we believe this will give insight into whether this is likely to happen or not. Understanding the provenance of optimizers could thus give us techniques to help us solve this problem.
The optimization provenance agenda
We're now ready to understand the three key pieces of optimization provenance agenda. We believe that a solution to these would represent a large step in solving transparency.
In order for this to work, the optimizers running as part of the agent need to be legible. I believe the hardest part of this is ensuring that the world model of the optimizer is legible. Once we have a legible world model, I expect it will be relatively easy to ensure the objective is also legible.
Ensuring that the optimization process is legible falls more naturally into the next section.
1. Find an adequate framework for explicit world models
2. Explicitly understand the objective
3. Understand what makes something an implicit world model
The optimization processes used in the agent must all be capable of controlling whether it creates a mesa-optimizer. In particular, a mesa-optimizer should only be formed when its three components are all known and assured. This requires the optimization process to be legible in the sense that we understand under what conditions the process will create a new mesa-optimizer.
2. How do we ensure that the mesa-optimizer will be legible?
3. Can such mesa-optimizers be made corrigible?
4. Does this apply to other forms of optimization daemons?
Having legible provenance of every optimizer occurring in the agent is a good step towards solving transparency. But this is not enough by itself, because provenance may itself be forged. In order for the provenance to be assured, an overseer must investigate any changes made to the optimization provenance, and verify that there is no optimization toward unaligned goals.
2. How possible is it for accidental treachery to be produced?
3. To what extent can Goodhart's curse be avoided with optimization provenance?
It seems likely that progress on the optimization provenance agenda could be leveraged to make progress on other subproblems. In particular, I think that developing frameworks for explicit models will make it easier to solve the inner alignment problem. I also believe that the idea of optimization provenance is a useful handle for thinking about mesa-optimizers in general.