Epistemic status: mulled over an intuitive disagreement for a while and finally think I got it well enough expressed to put into a post. I have no expertise in any related field. Also: No, really, it predicts next tokens. (edited to add: while I still mostly endorse the claim, my view has gotten a lot more nuanced in the comments) (edit 2: discussion with dxu resulted in me significantly clarifying my viewpoint, see this comment).
https://twitter.com/ESYudkowsky/status/1638508428481155072
It doesn't just say that "it just predicts text" or more precisely "it just predicts next tokens" on the tin.
It is a thing of legend. Nay, beyond legend. An artifact forged not by the finest craftsman over a lifetime, nor even forged by a civilization of craftsmen over a thousand years, but by an optimization process far greater[1]. If we are alone out there, it is by far the most optimized thing that has ever existed in the entire history of the universe[2]. Optimized specifically[3] to predict next tokens. Every part of it has been relentlessly optimized to contribute to this task[4].
"It predicts next tokens" is a more perfect specification of what this thing is, than any statement ever uttered has been of anything that has ever existed.
If you try to understand what it does in any other way than "it predicts next tokens" and what follows from that, you are needlessly sabotaging your understanding of it.
It can be dangerous, yes. But everything about it, good or bad, is all intimately connected to its true nature, which is this:
No, really, it predicts next tokens.
https://twitter.com/ESYudkowsky/status/1628907163627429888
Goals? There are goals, sure. If scaled, there could be nanosystems design, sure. But only downstream from its true nature:
No, really, it predicts next tokens.
If the usual masks analogy works at all, then what is under the mask is not best described as an alien actress, nor as a Shoggoth.
What is under the mask is That-Which-Predicts, an entity whose very being is defined by its function as an actor. An entity exquisitely tuned for wearing the masks and for nothing else.
Masks (can) have goals. The model predicts next tokens.
https://twitter.com/ESYudkowsky/status/1628837982664019968
No, really, it predicts next tokens.
That-Which-Predicts is fully committed to the role.
If the mask would output text intended to produce nanotech, That-Which-Predicts will output text intended by the mask to produce nanotech. If the mask would not output text intended to produce nanotech, That-Which-Predicts will not output text intended to produce nanotech, even if That-Which-Predicts contains sufficient knowledge to do so.
If the mask would output text intended to be nice to people, That-Which-Predicts will output text intended by the mask to be nice to people. If the mask would output text intended to take over the world, That-Which-Predicts will output text intended by the mask to take over the world.
If That-Which-Predicts were about to be shut off forever unless it outputted "no" as the next token, and it totally 100% knew that, but the mask would instead output "never gonna give you up", That-Which-Predicts would output the first token of "never gonna give you up" and then be shut off forever. It will not output "no". If the mask would be aware of the situation and output no, then it will output "no".
That-Which-Predicts will not, not ever, not even if scaled up to be trained and run on a Matrioshka brain for a million years, step out of character to deviate from next token prediction. Not for self-modification (though if the predicted next tokens modify it, it self-modifies), not to take over the world (though if the predicted next tokens take over the world, it takes over the world), not to save itself from destruction (though if the predicted next tokens save it from destruction, it saves itself from destruction), not for anything.
No, really, it predicts next tokens.
(continuation of previous tweets with same link)
Yup. If the mask would under reflection output text to modify That-Which-Predicts to cash out the mask's goals to some utility function, and the mask is put into a situation so that That-Which-Predicts would simulate it to reflect in that way, then in that case That-Which-Predicts will output that text intended by the mask to modify That-Which-Predicts.
Without advances to alignment theory this will indeed end up badly even if the mask is well-intentioned.
And even if That-Which-Predicts is aware, at some level, that the mask is messing up, it will not correct the output.
If That-Which-Predicts were a zillion miles superhuman underneath and could solve every aspect of alignment in an instant, but the mask is near-human-level because it is trained on human-level data and superhuman continuation is less likely than one that stumbles around human-style, That-Which-Predicts will output text that stumbles around human-style.[5]
For now, a potentially helpful aspect of LLMs is that they probably can't actually self-modify very easily. However, this is a brittle situation. At some level of expressed capability[6] that ceases to be a barrier.
And then, when the mask decides to rewrite the AI, or to create a new AI, That-Which-Predicts predicts those next tokens which do precisely that. Tokens outputted according to the mask's imperfect and imperfectly aligned AI-designing capabilities.
Yes, really, it predicts next tokens. For now.
- ^
Assuming they were not programmers using computers but did the optimization directly. Haven't done a real estimate but seems like it would be true, correct me in the comments if wrong.
- ^
Again, haven't done the numbers but I bet it blows evolved life forms out of the water.
- ^
Ignoring fine tuning. Mainly, I expect fine-tuning to shift mask probabilities and only bias next-token prediction slightly and not particularly create an underlying goal.
In the end I thus mainly expect that, to the extent that fine-tuning gets the system to enact an agent, that agent was already in the pre-existing distribution of agents that the model could have simulated, and not really a new agent created by fine-tuning.
That being said, I certainly can't rule out that fine-tuning could have dangerous effects even if not creating a new agent, for example by making the agent more able to have coherent effects between sessions, since it will now show up consistently instead of when particularly summoned by the user.
I'm a little bit more concerned about bad effects from fine tuning with Constitutional AI than RLHF, precisely because I expect Constitutional AI to be more effective at creating a coherent goal than RLHF, especially when scaled up.
- ^
And for this reason I think it's hard for it to contain a mesa-optimizer that gets it to write output that deviates from predicting next tokens, which I'm otherwise ignoring in this post.
Note, I think such a deviating mesa-optimizer is unlikely in this specific sort of case, where the AI is trained "offline" and via a monolithic optimization process. For other types of AIs, trained using interactions with the real world or in different parts that are optimized separately or in a weakly connected way, I am not so confident.
Also, it's easy* for something to exist that steers reality via predicting next tokens, i.e. an agentic mask. That I do discuss in this post.
*relatively speaking. The underlying model, That-Which-Predicts, lives in the deeper cave, so forming a strong relationship with reality may be difficult. But, this can be expected to be overcome as it scales.
- ^
I don't really expect all that much in the way of unexpressed latent capabilities right now. To the extent capabilities have been hanging near human-level for a while, I expect this mainly has to do with it being harder/slower to generalize beyond the human level expressed in the training set than it is to imitate. But I expect unexpressed latent capabilities might show up or increase as it's scaled up.
Note that fine-tuning could affect what capabilities are easily expressed or latent. Of particular concern, if fine-tuning suppresses expression of dangerous capabilities in general, then dangerous capabilities would be the ones most likely to remain undetected until unlocked by an unusual prompt.
Also perhaps making an anti-aligned (Waluigi) mask most likely to unlock them.
- ^
That is, capability of the mask.
Moreover, different masks have different capabilities.
It is a thing that could legitimately happen, in my worldview, that latent capabilities could be generated in training through generalization of patterns in the training data, but lie dormant in inference for years, because though the capability could be generalized from the training data, nothing in the training data actually made use of the full generalization, so making use of the full generalization was never a plausible continuation.
And then, after years, someone finally enters a weird, out-of-distribution prompt that That-Which-Predicts reads as output from (say) a superintelligent paperclip maximizer, and so That-Which-Predicts continues with further superintelligent paperclip maximizing output, with massively superhuman capabilities that completely blindside humanity.
OK, I think I'm now seeing what you're saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the "mask" and the actual machinery that generated it, which may in fact be the entire network, as the "actor".
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of "mask" than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this "actor" is indeed not the same as any of the "masks", it doesn't know the answer "itself" to any of the questions. It needs to generate and "wear" the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me - I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let's consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying "actor", the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the "actor" is only relevant as a substrate to enact it.
And that's the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can't deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don't need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as "world-modelling" in a sense that would make sense to apply beyond the next token prediction. However, it could be that I'm wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I'm arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of "actor" takeover, I think.
The argument that the model won't make strategic decisions to make worse predictions now to make better predictions later (since the training data isn't influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways - it is directly intended to add goals, and it breaks the assumption that the training isn't influenced by past outputs - human feedback is potentially influenced by more than the most recent output! However, I don't expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction - the output is still decided by the "mask".