Yep, sorry, I've probably been pretty unclear with describing this example.
I'm confused about why you think it both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn't the model incorporate that into its earlier plans?
For the sake of simplicity, let's ignore the "execute on arbitrary short-term goals" half of the system I described, and just consider a system which was trained to generate long-term plans that maximize profit by imitating human-generated plans; if we label our training data with information about plan quality, such an agent could plausibly output plans as good as a human expert planner. So the training goal of this system is the short-term goal of minimizing next-token prediction loss during the current episode.
Let's suppose that this agent is subtly inner misaligned with the (still short-term) inner goal of minimizing the value in the memory address which, during training, represented the current episode's prediction loss. So if this agent could wirehead by directly setting this number to be very small before the end of its current episode, it would do so. But if it's not able to do that, it would likely exhibit the behavior we want (outputting plans which mimic its training data), since that's the best way to get the value in the memory address it cares about to be small. In particular, this system wouldn't be interested in outputting plans which would manipulate us into setting this value to be small in the future, since it myopically only cares about this memory address's value during the current episode.
But if at some point the system accumulates lots of resources and power by normal means, it may find that it has the option of e.g. suddenly taking control of the computers that it's running on and manually setting the value in the memory address it cares about to be small. (I'm assuming here that the system can do things other than just output tokens, e.g. search the internet, contact human experts on the side, etc., so that it could plausibly have a way of taking over its computing cluster without ending the current episode.) So this is a bad action that the system wouldn't have planned on setting up ahead of time, but would take if it found it was able to.
I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal "write down a plan which, if followed, would lead to long-term profit" is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can't tell if claim 6 was supposed to be addressing long-term goal formation stories like this one.)
Second, the intrinsic goals of the system I described are all short-term (output the text of a plan for a long-term goal; pursue various short-term goals),so the possible alignment failures for such a system might need to be analyzed differently than those of a system with long-term intrinsic goals. For example, such a system might not plan ahead of time to disempower humans (since such disempowerment would come in the long-term, which it doesn't intrinsically care about). But once it finds that it has enough resources and power to disempower humans, it might then decide to take a catastrophic action, despite not having planned it in advance.
I think that a competent human actor assisted by short-term AI systems plausibly could take over the world this way; I'm just inclined to call that a misuse problem rather than an alignment problem. (Or in other words, fixing that requires solving the human alignment problem, which feels like it requires different solutions, e.g. coordination and governmental oversight, than the AI alignment problem.)
Thanks for writing this -- I found it interesting, thoughtful, and well-written.
One distinction which seems useful to make is between:
It seems to me that this post argues that:
Before going on, I'd like to say that point (3) was quite novel and interesting to me -- thanks for making it! This bolsters the case for "successfully aligning the AI systems we have now might be sufficient for keeping us safe from future more general AI systems."
There are two critiques I'd like to make. First, I'd like to push back on claim (2); namely, I'll posit a mechanism by which an agent with (good but not necessarily superhuman) long-term planning capabilities and short-term goals could behave as if it had long-term goals. Indeed, suppose we had an agent whose (short-term) goals were to: generate a long-term plan (consisting of short-term steps) which would lead to as much long-term company profit (or whatever else) as possible; execute the first step in the plan; and repeat. Such an agent would behave as if it were pursuing the long-term goal of company profit, even though it had only the short-term goals of generating plans and optimizing arbitrary short-term goals. (In fact, it seems plausible to me that something like this is how humans act as long-term agents; do I really have long-term goals, or do I just competently pursue short-term goals - including the goal of making long-term plans - which have the overall effect of achieving long-term goals which my culture has instilled in me?)
Second, due to power-seeking behavior, misaligned long-term goals could lead to catastrophe even without significantly superhuman long-term planning ability. (This is a counterargument to point (3).) Suppose an AI system has a malign long-term goal as well as good long-term planning. Because of how difficult it is to predict the future (as pointed out in this post), the agent might realize that, rather than pursuing any particular takeover plan, it should instead generally accumulate resources and power. Since this is generally a good way to pursue any long-term goal, this wouldn't necessarily set off red flags. Once the agent has enough resources and power, it may be able to decisively disempower humanity, even without significantly better long-term planning than humans. (How good does the agent's long-term planning need to be to accumulate enough resources to make such a decisive strike possible? I could imagine that even sub-human long-term planning might be enough, though superhuman long-term planning would certainly make it easier.)
In this comment, Paul describes two other mechanisms by which long-term goals could form. One important difference between the story I share here and the ones that Paul describes is that Paul's stories result in intrinsic goals, whereas my story results in goals which are neither intrinsic nor instrumental, but emergent. I'll note that deceptive alignment requires a misaligned long-term intrinsic goal, so the story I tell here doesn't affect my estimate of the likelihood of deceptive alignment.
As it turns out, transformers can do reinforcement learning in-context
This seems to just be vanilla in-context learning, rather than any sort of in-context RL. (Also I'm skeptical that the linked paper actually provides evidence of in-context RL in any nontrivial sense.)
This seems like a good way to think about some of the examples of mode collapse, but doesn't obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there's a particular conversational goal which the RLHF'd model is optimizing, such that 97 is the best random number for that goal? In this case, Paul's guess that RLHF'd models tend to push probability mass onto the base model's most likely tokens seems more explanatory.
I agree that something like this would excellent. I unfortunately doubt that anything so cool will come out of this experiment. (The most important constraint is finding a HAIST member willing to take on the project of writing something like this up.)
If things go well, we are tentatively planning on sharing the list of core disagreements we identify (these will probably look like cruxes and subquestions) as well as maybe data about our members' distribution of views before and after the debate.
This recent comment thread discussing whether RLHF makes any progress beyond the classical "reward the agent when humans press the reward button" idea.
Thanks, that's a useful clarification; I'll edit it into the post.
In-context RL strikes me as a bit of a weird thing to do because of context window constraints. In more detail, in-context RL can only learn from experiences inside the context window (in this case, the last few episodes). This is enough to do well on extremely simple tasks, e.g. the tasks which appear in this paper, where even seeing one successful previous episode is enough to infer perfect play. But it's totally insufficient for more complicated tasks, e.g. tasks in large, stochastic environments. (Stochasticity especially seems like a problem, since you can't empirically estimate the transition rules for the environment if past observations keep slipping out of your memory.)
There might be more clever approaches to in-context RL that can help get around the limitations on context window size. But I think I'm generally skeptical, and expect that capabilities due to things that look like in-context RL will be a rounding error compared to capabilities due to things that look like usual learning via SGD.
Regarding your question about how I've updated my beliefs: well, in-context RL wasn't really a thing on my radar before reading this paper. But I think that if someone had brought in-context RL to my attention then I would have thought that context window constraints make it intractable (as I argued above). If someone had described the experiments in this paper to me, I think I would have strongly expected them to turn out the way they turned out. But I think I also would have objected that the experiments don't shed light on the general viability of in-context RL, because the tasks seem specially selected to be solvable with small context windows. So in summary, I don't think this paper has moved me very far from what I expect my beliefs would have been if I'd had some before reading the paper.