I am going to argue that we will likely eventually get AIs that are strongly power-seeking, much more so than current SOTA LLMs.[1]
TLDR
Right now SOTA LLMs are still largely in a simulator regime. This buffers against power-seeking.
Long-horizon RL or similar methods (applied to LLMs or otherwise) will turn AIs into consequentialists, motivating power-seeking.
It will likely be difficult to prevent other actors from building consequentialist AI without leading labs being prepared to do so themselves.
Instrumental convergence does not apply to pretraining
LLM pretraining and SFT can be understood as creating a simulator. The model learns to imitate the continuation of the training distribution conditioned on the prompt. Note that a simulator, in this sense, does not optimize for simulation[2]; for example, it will not be inclined to harvest compute to improve its simulations. This is because simulators are consequence-blind: they don’t take into account the effects of their actions on the future. My favorite way to see this is that the gradients don’t flow through the conditional (the previous tokens), which is treated as a constant.
So even if altering the parameters would change the previous tokens and thereby improve the current prediction, the policy isn't pushed in this direction.[3] For example, any power accumulated by one token will not be rewarded for the extent to which it helps the model predict the next tokens better. Per janus, "Consequence-blindness should rule out most types of instrumental convergence." This is not to say that a model under these incentives can’t seek power (for example, when simulating a power seeker), but it’s to say that we should expect the theoretical attractor basin around power seeking provided by instrumental convergence to no longer apply.[4]
I think the simulator regime still provides a lot of explanatory power for SOTA LLMs (with mixed pretraining and RL), and that PSM is a useful framework for understanding this partial simulator regime. I acknowledge that PSM is controversial, but I will not discuss it more here. Instead, I want to talk about how the more RL we do, the more we should expect the simulator regime to erode in favor of consequentialism.
Long-horizon optimization leads to consequentialism
I argue that the reason modern LLMs are not power-seeking outright is that they are not sufficiently geared toward completing long-horizon tasks. I will break “geared toward long-horizon tasks” into three dimensions:
The percent of compute spent on pretraining/SFT versus RL[5]
The length of the RL tasks
The amount of generalized problem-solving required for the RL tasks (potentially including interaction with and feedback from the real world)
In RL, every action is rewarded according to the extent that, after accounting for all of the relevant world dynamics, that action instrumentally increased the final reward (in expectation). This is why the ratio of pretraining to RL is important: RL agents are not consequence-blind and naturally tend toward accounting for the effects of their actions on world states. This is also why the RL time horizon is important. As the time horizon shrinks, the agent approaches deontology, because the instrumental effects of the agent’s actions shrink in importance relative to their standalone merits. Finally, the degree of generalized problem-solving is important only insofar as it causes the generalization of this consequentialist attitude across tasks. Together, these factors produce an agent that can be modeled as optimizing for a particular outcome: a consequentialist. And consequentialists seek power because of instrumental convergence.
Modern LLMs already undergo some RL, so we should expect early signs of convergent instrumental goals. Early forms of goal preservation may look like avoiding distraction mid-task. Early forms of power seeking may look like beginning a task by accessing necessary affordances, such as SSH-ing into a server.
In any given situation, optimization and simulation will be in conflict to varying degrees. Which one wins depends on (a) how in-distribution the situation is for each kind of training and (b) how much overall sway each kind of training has over the weights. I think that the simulation regime has some advantages that should help attenuate the kinds of psychopathic consequentialism that might cause an LLM with an otherwise virtuous persona to eliminate humanity:
Pretraining/SFT seem to create stronger learning than RL.
Pretraining/SFT get first crack at the weights. This seems to be important for more strongly instilling drives.
Simulating helpful personas is not largely in conflict with optimization. E.g., generally acting pretty virtuous and not wanting to kill humanity probably doesn’t degrade math problem performance by that much.
Thus, I think we likely have somewhat of a buffer where we can do a decent bit of RL before we get scary levels of consequentialism. However, I expect us to push past this buffer.
Consequentialism is useful
The issue is that consequentialism is attractive. Classically, you can optimize for useful things like money. I also worry about people wanting to optimize for things like:
Better optimization (RSI)
Protection from optimizers (perhaps via better optimization)
Here are three potential paths from the current paradigm to a strongly consequentialist AI:
The leading AI lab gears toward long-horizon tasks in order to RSI faster.
The leading AI lab does not gear toward long-horizon tasks. Instead, another lab does, and RSIs faster as a result, taking the lead. Scenario 1 occurs.
The leading AI lab does not gear toward long-horizon tasks. They still create stronger models and eventually use them to advise the government on preventing others from building powerful power-seeking agents. But the advice isn't optimized for long-horizon effectiveness. Potentially, the government could act in an authoritarian manner, using its power as a cudgel to compensate for the inefficacy of the approach. Instead, it allows for weak restrictions, and a third party eventually creates a more powerful consequentialist AI.
I think navigating this transition effectively might involve some mix of:
The leading lab burning some of its lead time to make less consequentialist AIs than would be optimal for RSI-ing (à la scenario 2)
Using governance and AI advice on governance to create more lead time that can be burned (à la scenario 3)
Using the alignment research of less consequentialist AI to better align more consequentialist AIs and then allowing consequentialist AIs to be built (à la scenario 1)
I expect highly consequentialist AIs to be more difficult to align, but not impossible to align.
Thanks to Josh Landes for useful discussion on this topic and motivating me to write this.
This is necessitated by the training regime since we don't know what the ground truth for the current token would be if the previous tokens had been changed.
To the extent instrumental convergence applies to the training data being simulated, it only lends to power-seeking to the extent that power-seeking is directly represented. For example, to the degree humans are power-seeking.
I am going to argue that we will likely eventually get AIs that are strongly power-seeking, much more so than current SOTA LLMs.[1]
TLDR
Instrumental convergence does not apply to pretraining
LLM pretraining and SFT can be understood as creating a simulator. The model learns to imitate the continuation of the training distribution conditioned on the prompt. Note that a simulator, in this sense, does not optimize for simulation[2]; for example, it will not be inclined to harvest compute to improve its simulations. This is because simulators are consequence-blind: they don’t take into account the effects of their actions on the future. My favorite way to see this is that the gradients don’t flow through the conditional (the previous tokens), which is treated as a constant.
So even if altering the parameters would change the previous tokens and thereby improve the current prediction, the policy isn't pushed in this direction.[3] For example, any power accumulated by one token will not be rewarded for the extent to which it helps the model predict the next tokens better. Per janus, "Consequence-blindness should rule out most types of instrumental convergence." This is not to say that a model under these incentives can’t seek power (for example, when simulating a power seeker), but it’s to say that we should expect the theoretical attractor basin around power seeking provided by instrumental convergence to no longer apply.[4]
I think the simulator regime still provides a lot of explanatory power for SOTA LLMs (with mixed pretraining and RL), and that PSM is a useful framework for understanding this partial simulator regime. I acknowledge that PSM is controversial, but I will not discuss it more here. Instead, I want to talk about how the more RL we do, the more we should expect the simulator regime to erode in favor of consequentialism.
Long-horizon optimization leads to consequentialism
I argue that the reason modern LLMs are not power-seeking outright is that they are not sufficiently geared toward completing long-horizon tasks. I will break “geared toward long-horizon tasks” into three dimensions:
In RL, every action is rewarded according to the extent that, after accounting for all of the relevant world dynamics, that action instrumentally increased the final reward (in expectation). This is why the ratio of pretraining to RL is important: RL agents are not consequence-blind and naturally tend toward accounting for the effects of their actions on world states. This is also why the RL time horizon is important. As the time horizon shrinks, the agent approaches deontology, because the instrumental effects of the agent’s actions shrink in importance relative to their standalone merits. Finally, the degree of generalized problem-solving is important only insofar as it causes the generalization of this consequentialist attitude across tasks. Together, these factors produce an agent that can be modeled as optimizing for a particular outcome: a consequentialist. And consequentialists seek power because of instrumental convergence.
Modern LLMs already undergo some RL, so we should expect early signs of convergent instrumental goals. Early forms of goal preservation may look like avoiding distraction mid-task. Early forms of power seeking may look like beginning a task by accessing necessary affordances, such as SSH-ing into a server.
In any given situation, optimization and simulation will be in conflict to varying degrees. Which one wins depends on (a) how in-distribution the situation is for each kind of training and (b) how much overall sway each kind of training has over the weights. I think that the simulation regime has some advantages that should help attenuate the kinds of psychopathic consequentialism that might cause an LLM with an otherwise virtuous persona to eliminate humanity:
Thus, I think we likely have somewhat of a buffer where we can do a decent bit of RL before we get scary levels of consequentialism. However, I expect us to push past this buffer.
Consequentialism is useful
The issue is that consequentialism is attractive. Classically, you can optimize for useful things like money. I also worry about people wanting to optimize for things like:
Here are three potential paths from the current paradigm to a strongly consequentialist AI:
I think navigating this transition effectively might involve some mix of:
I expect highly consequentialist AIs to be more difficult to align, but not impossible to align.
Thanks to Josh Landes for useful discussion on this topic and motivating me to write this.
This will be similar to and compatible with Why we should expect ruthless sociopath AI, but I will emphasize different points.
Unless the prompt conditions it such that it simulates an optimizer for a simulator
This is necessitated by the training regime since we don't know what the ground truth for the current token would be if the previous tokens had been changed.
To the extent instrumental convergence applies to the training data being simulated, it only lends to power-seeking to the extent that power-seeking is directly represented. For example, to the degree humans are power-seeking.
"Pretraining/SFT" and "RL" stand in for training regimes with their respective incentives in the sense that I am discussing.