# 20

Epistemic staus: Exploratory

Summary: In this post I will decompose the alignment problem into subproblems and frame existing approaches in terms of their relations to the subproblems. I will try to place a larger focus on the epistemic process as opposed to results of this particular problem factorization, where the aim is to obtain an epistemic strategy that can be generalized to new problems.

# The case for problem decomposition

## Degrees of freedom

One way to frame the advantage of factoring a problem is that doing so allows degrees of freedom to add up instead of multiply. If the solution space of a problem space P contains n degrees of freedom, then without decomposing the problem, we need to search through all possible combinations to find a solution. However, if we can decompose  into two independent subproblems  and where the degrees of freedom for each subproblem do not affect the other subproblem, then we get to independently search through the solutions of and , which means the solution spaces of  and  add up instead of combinatorially multiply. It’s important to note that

• The subproblems of our factoring needs to be approximately independent, but which degrees of freedom can be independently varied without affecting other subproblems is a feature of the problem space itself, we don’t get to choose the problem factorization

## Combining forward chaining with backward chaining

Problem factoring is a form of backchaining from desired end states. In addition to this approach, we can also forward-chain from the status quo to gain information about the problem domain, which may be helpful for finding new angles of attack. However, forward chaining is most effective when we have adequate heuristics that guide our search towards insights that are more useful and generalizable. One way to develop heuristics about what insights are generalizable is to keep a wide variety of problems on which to apply new techniques to, and bias our search towards insights that are helpful for multiple problems

• We can do this by having a wide variety of problems from different fields, but we can also do this by having a wide variety of subproblems that come from factorizing the same problem
• Searching for insights that are useful for multiple subproblems can help us identify robust bottlenecks to alignment

Concretely, decomposing the alignment problem into subproblems means that whenever we stumble upon a new insight that may be relevant to alignment, we can try to apply it to each of the subproblems, & gain a more concrete intuition about what sorts of insights are useful. In addition, we can frame existing approaches in terms of how they can help us address subproblems of alignment, so that when we consider similar approaches, we can direct our focus onto the same set of subproblems.

# Scope

In this post we will focus on a narrow class of transformative AIs that can be roughly factored into three components:

• A world model
• general purpose search (GPS) module which takes a goal/optimization target and returns a plan for achieving that goal
• A targeting process which maps variables in the world model to the optimization target of general purpose search

While I do believe that it’s important to figure out how to align AIs with other possible architectures, we will not discuss them in this post. Nevertheless, the following are some justifications for focusing on TAIs that can be factored into a world model, a GPS module, and a targeting process:

• A world model seems necessary for a TAI as it allows the AI to respond to unobserved parts of the world
• General purpose search is instrumentally convergent:
• An AI’s world model at any given point is likely incomplete: There are some causal relationships between the variables in its world model that the AI doesn’t know about yet, because the AI is smaller than the world
• The AI may discover new instrumental subgoals when it learns about new causal relationship between variables in its world model
• Concretely, the AI discovers a new instrumental subgoal upon learning a new causal pathway from some variable C in its world model to its terminal values
• A powerful AI should be able to optimize for an instrumental subgoal upon discovering that subgoal as it is an effective way of achieving its terminal goals
• For the AI to be able to optimize for a new instrumental subgoal upon discovering it, it must be capable of optimizing for a wide variety of goals beforehand, since many goals can turn out to be an instrumental subgoal
• In addition, the AI should be able to flexibly set its optimization target so that it can optimize for new instrumental subgoals on the fly. This entails the existence of something like general purpose search.
• If we can control the optimization target of general purpose search, we can sidestep the inner alignment problem by retargeting the search
• We can leverage TAI’s model of human values to decide the optimization target:
• Powerful TAIs will likely have a mechanistic model of the behavior and goals of humans, as this is helpful for making accurate predictions about humans. We might want to use information from that mechanistic model to decide what goals the AI should optimize for, since our ideal target for alignment ultimately depends on human values, and a mechanistic model of humans contains information about those values. An example of this approach is for the AI to point to human goals inside its world model, and let the AI optimize for the pointer of those goals.
• In order to accommodate this class of approaches, the optimization target should be able to depend on variables in the world model, we call this mapping from variables in the world model to the optimization target the targeting process
• We might also want to consider AIs that optimize for a single fixed goal independent of variables in the world model, for this type of AI we simply model its targeting process as a constant function
• World model is dual-use:
• On one hand, a better world model advances capabilities as it is used by general purpose search to select more effective actions
• On the other hand, a better world model can allow the AI to have a better model of “what humans want”, which can lead to a more accurate optimization target given an adequate targeting process
• Alignment is mostly about designing the targeting process, but considerations about targeting process may also influence design decisions of the world model and general purpose search

# Where does the information come from, and how do we plan on using it?

### Not all problems should be framed as optimization problems

There’s a tempting style of thinking which tries to frame all problems as optimization problems (see here and here). This style of thinking seems to make sense for dualistic agents: Afterall, the dualistic agent has preferences over the environment, it has well-defined input and output channels, and it can hold an entire model of the environment inside its mind. All that’s left to do is to optimize the environment against its preferences using the output channel.

However, we run into issues when we try to translate this style of thinking to embedded agents: The embedded agent has some degree of introspective uncertainty, including over its own preferences, which means it doesn’t always know what objective function to optimize for; the goals of the embedded agent may depend on information in the environment that isn’t fully accessible to the agent. For instance, an embedded agent might try to satisfy the preferences of another agent, and because the agent is logically non-omniscient and smaller than the environment, it’s not straightforward to simply calculate expected utilities over all possible worlds. As a result, embedded agents can face many problems where most of the difficulty stems from finding an adequate set of criteria to optimize against, as opposed to finding out how to optimize against a known criteria. The agent cannot just optimize against an arbitrary proxy for its objectives either, as that can lead to Goodhart failures.

The alignment problem is a central example where the main bottleneck hinges upon defining an objective as opposed to optimizing against it. And because framing all problems as optimization problems assumes that we already know the objectives, we need to find an alternative framework which helps us think about the task of formulating the problem itself.

### Desiderata, sources and bridges

One way to think about alignment which I find helpful is that we have human values on one hand, and the goals or optimization targets of the AI on the other, and we want to establish a bridge which allows information to flow from the former to the latter. We might need to formulate properties that we want this bridge to have, drawing inspirations from many different places, or try to implement properties that we already think are desirable.  The following are some important features of this picture that are different from the dualistic optimization viewpoint:

• Desiderata vs objective functions: In this picture, we want to come up with desiderata which tells us things like “how can I recognize an adequate solution if I see one?” or “how can I recognize an adequate formalization of the problem if I see one?”.  Although it seems like both desiderata and objective functions are to be optimized against, there are some important differences:
• Defeasibility: For a dualistic agent, the objective function which it optimizes against can never be ‘wrong’. However, as embedded agents, we have introspective uncertainty over our own values, which means our proposed desiderata can be subjected to revision. Desiderata can be used to narrow our search space,  but we should also test them by searching for counterexamples
• Meta-ness: Desiderata doesn’t have to specify what constitutes a good solution, it can also specify what constitutes a good formulation of the problem, or specify a way to specify a good formulation of the problem, and so on and so on
• For alignment, we can picture this as establishing a sequence of bridges, where bridge  allows information to flow from humans to the goals of AI, and bridge  allows information to flow from humans to bridge
• Allowing our desiderata to be “meta” allows us to consider approaches such as indirect normativity, which may be important when it’s infeasible to formulate the object-level problems ourselves
• Sources and bridges:  For an optimization problems in the dualistic context, we have input variables which we get to vary, and our only job is to find the input which maximizes our objective function.  For embedded agents, however, the problem definition itself may depend on variables in the environment which we don’t get to directly perceive, which means we not only need to consider the degrees of freedom which we get to control, but also the sources of information about where to find good solutions and desiderata. Problem solving can be thought of as establishing a bridge which flows from the sources of information and the degrees of freedom we get to control to the desiderata
• Note that this “bridge” of information flow doesn’t have to route through us: For instance, we might design an auction with the intention of achieving efficiency, and this objective depends on the preferences of the participants. However, when we run the auction, we never observe the full preference of any participant, we merely established a way such that that information can be used to satisfy our desiderata
• Focusing on sources of information also allows us to create more realistic bounds for an embedded agent’s performance, where we consider the best we can do given not just what we can control but also what we know.
• Main Benefits of this framing:
• For problems without an adequate formalization yet, this framing highlights that desiderata can be defeasible, and that we might want to use indirect approaches which operate at a meta-level
• For high-dimensional optimization problems, this framing places the focus on identifying information about where to find good solutions
• For problems whose definitions depend on unknown variables, this framing puts emphasis on identifying those variables using sources of information that are available to us
• In certain cases, the problem definitions depend on variables that are unobservable to us, using this framing allows us to nevertheless consider solutions which route through those unobservables but not through us

# Decomposing the AI alignment problem

Our main objective is to find optimization targets that lead to desirable outcomes when optimized against, and there are different sources of information which tell us what properties we want our optimization targets to have. To factorize the alignment problem, a natural place to start is to factorize these sources of information which can help us narrow our search space for our optimization targets.

One axis of factorization is the information that we have a priori vs a posteriori, that is, what information do we have before the AI starts developing a world model, vs after we have access to its world model? These two cases seem to be mostly independent because gaining access to an AI’s world model gives us new information that isn’t accessible to us a priori.

## A priori

When we haven’t started training an AI and we don’t have access to the AI’s world model, there are two constraints that limit the information we have about what optimization targets are desirable:

• We don’t have access to the AI’s ontology of the world, which means that if we have certain preferences over real world objects, we can’t make assumptions about how that real world object will be represented by the AI’s world model
• The AI hasn’t developed a world model, which means it doesn’t have a mechanistic model of humans yet. As a result, we cannot leverage the AI’s model of human values to determine properties of the optimization target

When we don’t have access to certain types of information, we want to seek considerations which don’t make assumptions about them. As a result, when we don’t have access to the AI’s ontology and its model of human values, we should seek ontology-invariant and value-free considerations:

### Value-free considerations

The main benefit of allowing the optimization target of an AI to depend on variables in the world model is that we can potentially “point” to human values inside the world model & set it as the optimization target. However, that information isn’t available when the AI hasn’t developed a world model yet, and our introspective uncertainty bars us from directly specifying our own values in the AI’s ontology, which means at this stage we should seek desirable properties of the optimization target that don’t depend on contingent features of human values. We call such considerations “value-free”.

Since value-free considerations don’t make assumptions about contingent properties of human values, they must be universal across a wide variety of agents. In other words, to search for value-free properties, we should focus on properties of the optimization target which are instrumentally convergent for agents with diverse values.

Examples of value-free considerations

• Natural latents are features of the environment which a wide variety of agents would convergently model as latent variables in their ontologies, and having something as a latent variable in your ontology is a prerequisite for caring about that thing. As a result, figuring out what properties of the environment are natural latents can help us narrow down the space of things that we might want our AIs to care about, which would give us a better prior over the space of possible optimization targets. Since natural latents are instrumentally convergent, we don’t need to make assumptions about contingent properties of human values to discover them.
• Corrigibility/impact measures/mild optimization: For agents with introspective uncertainty over their own values, there may be features in the environment that they “unconsciously” care about and have optimized for, but they are not fully aware of that. This means that techniques such as Corrigibility/impact measures/mild optimization that systematically avoid side-effects can be convergently useful for agents with introspective uncertainty, as they can help preserve the features that the agent is unaware that it cares about. Insofar as these properties are value-free, we can imbue them in the optimization target before we have a specification of human values

Ontology-invariant considerations

Not having access to the AI’s world model means that we don’t know how the internal representations of the AI correspond to physical things in the real world. This means that when we have preferences about real world objects, we don’t know how that preference should be expressed in relation to the AI’s internal representations. In other words, we don’t know how to make the AI care about apples and dogs when we don’t know which parts of the AI’s mind point to apples and dogs.

When we face such limitations, we should seek properties of the optimization target that are desirable regardless of what ontology the AI’s might end up developing; when we don’t know how the AI will describe the world, we can still implement the parts of our preferences which don’t depend on which description of the world will end up being used.

Examples of ontology invariant considerations

• Staying in distribution: We have a preference for AIs to operate within contexts that they have been in before so that it can avoid out of distribution failures. The hope is that the concept of “out of distribution” is expressible over a wide range of world models. Insofar as this is true, we can implement this preference without knowing what specific ontology the AI will end up using beforehand.
• Optimizing worst case performance: When we’re especially concerned about the worst-case outcome, we can design our AIs to optimize for its performance in the worst possible world. This preference can be implemented in most world models which are capable of representing uncertainties, which means we don’t need to know what specific representation our AI will use as we implement it.

The main benefit of a priori properties of the optimization target is that they can be deployed before the AI starts developing a sophisticated world model. In other words, they are more robust to scaling down

## A posteriori

We’ve discussed two limitations in the a priori stage when we don’t have access to the AI’s world model, which means the main question we should ask in the a posteriori stage is what opportunities are unlocked once those limitations are lifted? What new sources of information do we gain access to which we previously didn’t?

### The AI gets to observe us

Once the AI develops a sophisticated world model, that world model will likely contain information about human values. This means that a key consideration in the a posteriori stage is how we can leverage that information to determine properties of the optimization target.

Examples

• The Pointers Problem: We want the AI to optimize for the real world things that we care about, not just our evaluation of outcomes. In order to achieve that, we need to figure out the correspondence between latent variables of our world models and the real world variables that they represent. Formalizing this correspondence in the AI’s ontology is a prerequisite for translating our preferences into criteria about which real world outcomes are desirable
• Ontology identification: The variables or even the structure of the world model might change as the AI receives new observations, which means that if our optimization target is expressed in terms of variables of the AI’s world model, then it needs to be robust against possible changes in the way that the AI represents the world. In other words, we need to figure out how to robustly “point” to things in the territory even when our maps can change over time
• Simulated long reflection: In addition to translating our current values to the AI’s optimization target, we might also want to use AI’s to help us find our ideal values which we would endorse upon reflection. This will become more feasible if we can isolate a mechanistic model of humans from the AI’s world model and use that to simulate our reflection process
• Active value learning: Science isn’t just about building models using existing observations, we also take actions or conduct experiments to gain new information about the domain we’re interested in. Given that the AI’s model of us can be imperfect, how can the AI ask the right questions/choose the right actions to gain information about our values?
• Type signatures and true names: In order to leverage information about human values using variables in the AI’s world model, we need to be able to locate them inside the world model and interpret them in the right ways. In other words, we need to understand the type signatures of concepts such as “values” or “agents”, so that we can look for structures in the AI’s world model which match that type signature, and decode those structures correctly

### We get to observe the AI’s world model

The second limitation that’s lifted when the AI starts developing a world model is that we get to inspect the world model and gain information about the AI’s ontology. This means that in addition to the AI gaining a better understanding of our values, we can also become better at designing the targeting process ourselves by understanding the AI’s world model

Examples

• Interpretability: If we can figure out the relationship between variables in the AI’s world model and the real-world things they correspond to, we can manually design the optimization target to point to the real world things we care about
• This can be viewed as the dual of the pointers problem, where for interpretability we are figuring out the relationship between the AI’s world model and real world, while in the pointer’s problem we want the AI to understand the relationship between latent variables in human’s world model and real world variables
• Accelerated reflection: Although simulated long reflection should be faster than our actual reflection process, it relies on a human model which may be inaccurate, causing possible deviations from our actual reflection process. This suggests that comparative advantages are present in both using the AI’s model of our minds for reflection & using our actual minds for reflection. We might want to combine the benefits from both using techniques such as debatemarket making and cyborgism

# Backpropagation

Our discussions mainly focused on considerations about the targeting process, but the targeting process is entangled with the world model and the general purpose search module. This means that we should backpropagate our desiderata for the targeting process to inform design decisions about the rest of the components. For instance, if we want our optimization target to be robust to ontology shifts, we should try to design world models which are capable of modeling the world at multiple levels of abstractions and explicitly representing the relationships between different levels.

# 20

Mentioned in
New Comment

I do note that General Purpose Search can be almost reduced to learning, in that most things that you want General Purpose Search to do can also be done by learning, though I do think General Purpose Search will at least be the foundation/bootstrapping for learning:

https://x.com/nc_znc/status/1532040663302381568

https://x.com/andy_l_jones/status/1532048580747309056

Interesting! I have to read the papers in more depth but here are some of my initial reactions to that idea (let me know if it’s been addressed already):

• AFAICT using learning to replace GPS either requires:1) Training examples of good actions or 2) An environment like chess where we can rapidly gain feedback through simulation. Sampling from the environment would be much more costly when these assumptions break down, and general purpose search can enable lower sample complexity because we get to use all the information in the world model
• General purpose search requires certain properties of the world model that seem to be missing in current models. For instance, decomposing goals into subgoals is important for dealing with a high-dimensional action space, and that requires a high degree of modularity in the world model. Lazy world-modeling also seems important for planning in a world larger than yourself, but most of these properties aren’t present in the toy environments we use
• Learning can be a component of general purpose search (eg as a general purpose generator of heuristic), where we can learn to rearrange the search ordering of actions so that more effective actions are searched first
• I think using a fixed number of forward-passes to approximate GPS will eventually face limitations in environments that are complexed enough, because the space of programs which can dedicate potentially unlimited time to find solutions is strictly more expressive than the space of programs that has a fixed inference time

Agree, learning can't entirely replace General Purpose Search, and I agree something like General Purpose Search will still in practice be the backbone behind learning, due to your reasoning.

That is, General Purpose Search will still be necessary for AIs, if only due to bootstrapping concerns, and I agree with your list of benefits of General Purpose Search.

My piece on Steering subsystems is highly overlapping with your decomposition into a targeting process. I argue that effective AGI will need to have such a thing for similar reasons to those you present, but your presentation is different and quite possibly better than mine was.

Thanks! I recall reading the steering subsystems post a while ago & it matched a lot of my thinking on the topic. The idea of using variables in the world model to determine the optimization target also seems similar to your "Goals selected from learned knowledge" approach (the targeting process is essentially a mapping from learned knowledge to goals).

Another motivation for the targeting process (which might also be an advantage of GLSK) I forgot to mention is that we can allow the AI to update their goals as they update their knowledge (eg about what the current human values are), which might help us avoid value lock-in.

Right! I'm pleased that you read those posts and got something from them.

I worry less about value lock-in and more about The alignment stability problem which is almost the opposite.

But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s). It's also the more appealing option to people actually in charge of AGI projects. They like being in charge, and of course everyone likes their values better than the average of all humanity's values.

Good point!

But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s).

I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.

But a particular regime I'm worried about (for both PIA & VA) is when the AI has an imperfect model of the users' goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.

Instruction-following AI can also help with this, though I think it might imply a higher alignment tax, but it’s also probably easier to build.

I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.

Good point. I agree that the wrong model of user's preferences is my main concern and most alignment thinkers'.  And that it can happen with a personal intent alignment as well as value alignment.

This is why I prefer instruction-following to corrigibility as a target. If it's aligned to follow instructions, it doesn't need nearly as much of a model of the user's preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like "Okay, I've got an approach that should work. I'll engineer a gene drive to painlessly eliminate the human population". "Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let's try another approach that accomplishes that too...". I describe this as do-what-I-mean-and-check, DWIMAC.

The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it's got a more elaborate model of the user's preferences to help in interpreting instructions correctly, and it's supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.

Also, accurately modeling short-term intent - what the user wants right now - seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it's also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.

Absent all of that, it seems like there's still two advantages to modeling just one person's values instead of all of humanity's.  The smaller one is that you don't need to understand as many people or figure out how to aggregate values that conflict with each other. I think that's not actually that hard since lots of compromises could give very good futures, but I haven't thought that one alal the way through. The bigger advantage is that one person can say "oh my god don't do that it's the last thing I want" and it's pretty good evidence for their true values. Humanity as a whole probably won't be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.

Doesn't easier to build mean lower alignment tax?

The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it's got a more elaborate model of the user's preferences to help in interpreting instructions correctly, and it's supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.

Note that the link to the Harms version of corrigibility doesn't work.

Thank you! Fixed.

Good point. I agree that the wrong model of user's preferences is my main concern and most alignment thinkers'.  And that it can happen with a personal intent alignment as well as value alignment.

This is why I prefer instruction-following to corrigibility as a target. If it's aligned to follow instructions, it doesn't need nearly as much of a model of the user's preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like "Okay, I've got an approach that should work. I'll engineer a gene drive to painlessly eliminate the human population". "Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let's try another approach that accomplishes that too...". I describe this as do-what-I-mean-and-check, DWIMAC.

Yes, I also think that is a consideration in favor of instruction following. I think there’s an element of IF which I find appealing, it’s somewhat similar to bayesian updating: When I tell an IF agent to “fill the cup”, on one hand it will try to fulfill that goal, but it also thinks about the “usual situation” where that instruction is satisfied, & it will notice that the rest of the world remains pretty much unchanged, so it will try to replicate that. We can think of the IF agent as having a background prior over world states, and it conditions that prior on our instructions to get a posterior distribution over world states, & that’s the “target distribution” that it’s optimizing for. So it will try to fill the cup, but it wouldn’t build a dyson sphere to harness energy & maximize the probability of the cup being filled, because that scenario has never occurred when a cup has been filled (so that world has low prior probability).

I think this property can also be transferred to PIA and VA, where we have a compromise between “desirable worlds according to model of user values” and “usual worlds”.

Also, accurately modeling short-term intent - what the user wants right now - seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it's also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.

Absent all of that, it seems like there's still two advantages to modeling just one person's values instead of all of humanity's.  The smaller one is that you don't need to understand as many people or figure out how to aggregate values that conflict with each other. I think that's not actually that hard since lots of compromises could give very good futures, but I haven't thought that one alal the way through. The bigger advantage is that one person can say "oh my god don't do that it's the last thing I want" and it's pretty good evidence for their true values. Humanity as a whole probably won't be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.

Agreed, I also favor personal intent alignment for those reasons, or at least I consider PIA + accelerated & simulated reflection to be the most promising path towards eventual VA

Doesn't easier to build mean lower alignment tax?

It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).

Re this:

It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).

We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.

We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.

Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment

I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it's still desirable).

But a particular regime I'm worried about (for both PIA & VA) is when the AI has an imperfect model of the users' goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.

I agree with both of those methods being used, and IMO a third way or maybe a way to improve the 2 methods is to use synthetic data early and fast on human values and instruction following, and the big reason for using synthetic data here is to both improve the world model, plus using it to implement stuff like instruction following by making datasets where superhuman AI always obey the human master despite large power differentials, or value learning where we offer large datasets of AI always faithfully acting on the best of our human values, as described here:

https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/

The biggest reason I wouldn't be too concerned about that problem is that assuming no deceptive alignment/adversarial behavior from the AI, which is likely to be enforced by synthetic data, I think the problem you're talking about is likely to be solvable in practice, because we can just make their General Purpose Search/world model more capable without causing problems, which means we can transform this in large parts into a problem that goes away with scale.

More generally, this unlocks the ability to automate the hard parts of alignment research, which lets us offload most of the hard work onto the AI.

Yes, I think synthetic data could be useful for improving the world model. It's arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).

I think using synthetic data for corrigibility can be more or less effective depending on your views on corrigibility and the type of AI we’re considering. For instance, it would be more effective under Christiano’s act-based corrigibility because we’re avoiding any unintended optimization by evaluating the agent at the behavioral level (sometimes even at thought level), but in this paradigm we’re basically explicitly avoiding general purpose search, so I expect a much higher alignment tax.

If we’re considering an agentic AI with a general purpose search module, misspecification of values is much more susceptible to goodhart failures because we’re applying much more optimization pressure, & it’s less likely that synthetic data on corrigibility can offer us sufficient robustness, especially when there may be systematic bias in human filtering of synthetic data. So in this context I think a value-free core of corrigibility would be necessary to avoid the side effects that we can’t even think of.

Note that whenever I say corrigibility, I really mean instruction following, ala @Seth Herd's comments.

Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans, and my view is that we will likely be able to get AI that is both close enough to our values plus very highly aimable because of the very large amounts of synthetic data, which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.

Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:

https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/

Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans

I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals

which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.

I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)

Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:

Yes, I consider this to be the central crux.

I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs

I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing

Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)

This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.

Yes, I consider this to be the central crux.

I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs

I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing

I kind of agree, at least at realistic compute levels say through 2030, lack of search is a major bottleneck to better AI, but a few things to keep mind:

People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:

https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type

Also, I don't buy that it was refuted, based on this, which sounds like a refutation but isn't actually a refutation, and they never directly deny it:

https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type#ECyqFKTFSLhDAor7k

Re today's AIs being weak evidence for alignment generalizes further than capabilities, I think that the theoretical reasons and empirical reasons for why alignment generalizes further than capabilities is in large part (but not the entire story) reducible to why it's generally much easier to verify that something has been done correctly than actually executing the plan yourself:

2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.

'3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.

4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.

5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.

Re this:

I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing

I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.

More concretely, I think that the adequate optimization target is actually deferrable, because we can mostly just rely on instruction following and not have to worry too much about adequate optimization targets for the physical world, since we can use the first AGIs/ASIs to do interpretability and alignment research that help us reveal what optimization targets to choose for.

This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.

Agreed

People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:

https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type

Also, I don't buy that it was refuted, based on this, which sounds like a refutation but isn't actually a refutation, and they never directly deny it:

https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type#ECyqFKTFSLhDAor7k

Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment

it's generally much easier to verify that something has been done correctly than actually executing the plan yourself

Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it's much harder to come up with a specification  such that  but once we have such a specification it'll be much easier to come up with an implementation (likely with the help of AI)

2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.

Yes, I think optimizing worst-case performance is one crucial part of alignment, it's also one

I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.

Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself.  I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we'd get to control the optimization target for automating interpretability without worrying about unintended optimization).

Agreed

Then we've converged almost completely, thanks for the conversation.

Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment

So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?

Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it's much harder to come up with a specification X such that X ⟹ alignment but once we have such a specification it'll be much easier to come up with an implementation (likely with the help of AI)

While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,

I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.

Yes, I think optimizing worst-case performance is one crucial part of alignment, it's also one

My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I'm not that comfortable with using infrabayesianism, even if it actually worked.

I also don't believe it's necessary for alignment/uncertainty either.

Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself.  I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we'd get to control the optimization target for automating interpretability without worrying about unintended optimization).

I wasn't totally thinking of simulated reflection, but rather automated interpretability/alignment research.

Yeah, a big thing I admit to assuming is that I'm assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.

Then we've converged almost completely, thanks for the conversation.

Thanks! I enjoyed the conversation too.

So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?

yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.

While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,

I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.

Agreed.

My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I'm not that comfortable with using infrabayesianism, even if it actually worked.

I also don't believe it's necessary for alignment/uncertainty either.

yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don't think it's necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).

I wasn't totally thinking of simulated reflection, but rather automated interpretability/alignment research.

I intended "simulated reflection" to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.

Yeah, a big thing I admit to assuming is that I'm assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.

Thanks!

This comment is to clarify some things, not to disagree too much with you:

yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.

Then we'd better start cracking on how to get GPS into LLMs.

Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don't think it's all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.

And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.

yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don't think it's necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).

Perhaps so, though I'd bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.

I intended "simulated reflection" to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.

Thanks for clarifying that, now I understand what you're saying.