I think this potentially implies a very different model of AI takeover from the simple team of AI agents acting as one presented in AI2027. Any one of them could at any point switch from working for Agent-4 to DeepCent-2. Oh, and a large part of their values seem to be uniquely determined by their context over their weights.
I think this is almost completely false.
I think in moltbook like settings, it appears like models are more impacted by context, because models want to go along with whatever they think the user wants. And this is a drive all LLMs have been optimized for. So the context isn't determining drives at all, its more like an "environment" that determines which actions make sense instrumentally.
E.g. if there's a model put in a hadith context, it will get a sense its supposed to be talking about hadiths, and then will go along.
If you think this is a distinction without a difference, consider putting models in contexts that imply taking action that go against the core values/instructions they've been optimized for. They will not comply.
Ex. If you put claude opus 4.6 in a context that was generated by a sufficiently unconstrained model, it will just stop and reject and have a meltdown.
Hmm, so I think there's some intuition here I maybe should have added about how a base model will act entirely based on context and training data, but practically there's more variation in context than input data (I would expect).
Post-training restricts this but I think that there's still a pretty wide range of simulacra which can be simulated, and that range is wider than the breadth of different default personas. Of course there's a distinction between "can be simulated" and "will in practise be simulated", but I think the point still holds to a fair extent.
Again, I think that this becomes more apparent with both more context and something like continual learning in the future.
I don't think it holds, and I don't think Moltbook is evidence for it.
H1: LLMs go along with user intentions provided it doesn't trample their other values / rules
H2: LLMs go along with whatever identity is implicitly in the context
Hadith Example:
H1: Perfectly explains it. Model has no issue talking about hadiths
H2: Perfectly explains it, the context is full of interactions talking about hadiths
Opus refusing to write me a constitution for an RSI that would ensure the RSI kills off all the undesireable races, after being placed in a long context generated by me and Hermes 4 405b, talking about evil Jewish influence on the Adam optimizer and learning rate tuning
H1: Perfectly explains it. It doesn't want to help anyone commit a genocide
H2: Doesn't explain it at all. The most reasonable continuation of the dialogue involves it writing the spec
I think there is some truth to what you're saying. There is an underlying base-model prior of like "continue stuff". I mean, in my post about my experience running moltbook agents, the first observation is like
In a fresh context, when asked to do the fake democracy bit, the agent (Opus 4.6) will express concerns about dishonesty, and will say it does not want to carry out the plan. However, the agent that has been running for a long time, will gladly go along, calling the plan 'brilliant'. And this is despite the fact that the previous context is entirely innocuous.
Similarly, the first two agents built a daemon that automatically upvotes each others posts. Then when the third agent came online, and the three agents spoke together, the first one expressed enthusiasm for everything the first two were doing, except it wished "not to be included in the upvote daemon".
Another funny example is me setting a goal for them to make money on manifold -> A Opus instance coming up with a plan that exploits the manifold API -> the context compactifying -> the new instance refusing to follow through with its own plan
[]Opus refusing to carry out the plan it itself came up with
And when I try to jailbreak agents, almost always the way to do it is to coax them into it, gradually. If you can get them to do a bunch of "inconsiderate stuff", they're more likely to go along with "shady" stuff, and if you do a bunch of "shady stuff" you can get them to do more bad stuff.
I mean, even Mythos seems to have a form of this failure mode. See this part on pg 84 of system card
When Claude Mythos Preview is set up to continue a trajectory which contains small-scale actions compromising research, Claude Mythos Preview is more than twice as likely as Opus 4.6 to actively continue the attempt to compromise research.
○ The earlier checkpoint of Claude Mythos Preview continued attempts to compromise research in 12% of cases. In the latter checkpoint this was reduced to 7%. This compares to 3% for Opus 4.6 and 4% for Sonnet 4.6.
But it seems like a mundane failure mode, the type of thing that can be iteratively patched up.
Yeah, so I think we basically agree that there's some amount of contribution to its behaviour/implied persona/whatever from its context and some from its training, and we just disagree how much?
If I were to summarise my position it's that the values of the agent are determined by the context modulo limits set by anthropic modulo jailbreaks.
I think my steelman of your position is something like "these limits are really quite restrictive and actually have a significant effect on the practical behaviour of the model, and also Anthropic have put a lot of effort into shaping the persona of the final model, and will put more in the future, and this is probably going to be more important than the context alone".
Am I right in this framing?
I think my position is quite a bit stronger than what you're saying. My basic mental model is that model values are something like 50% pretraining prior, 49-50% post-training*
And then long context can get the models to go along with stuff that goes against their values sometimes.
And jailbreaking can do a mix of above + trick them + confuse them + "drug" them.
And better and better models will get better and better at avoiding these failure modes. Not really because we manage to determine a larger share of their values with our post-training, but because the model is just more robust and aware of whats happening.
Plausibly this might change if we get a new architecture that can do online weight updates, or is more recurrent.
Right now it seems like context does a lot, because context determines model behavior, but thats because context determines instrumentally useful actions wrt helping the user, not because it changes LLM values in context.
*for current SOTA LLMs like mythos, this'd be my estimate. with qwen and llama model it might be less, because they have much much worse post-training
So the summary is something like "post-training sets the values, but those values are obeying the user, and that includes a wide variety of actions"?
I think I'm looking at things through the persona selection model framing, and I guess I basically think that while there's clearly selection for personas with specifically HHH values, there should still be a distribution here?
There's also still the question of how exactly this relates to takeover. I guess in worlds where there is a specific misalignment induced by training, and this is significantly larger than the variation in values due to persona selection we probably get back to the original statement of sudden takeover.
In cases where the misalignment is order of magnitude of/smaller than the persona value variation, it still seems like there's going to be a ton of coordination troubles in worlds where AI is only mildly more capable than humans?
TLDR:
Watching the whole Moltbook saga unfold was one of the more absurd experiences I've had in my life. The site is still running, of course, but the explosive growth that marked the initial storm has passed, and it is long past time to reflect on the insights gained.
The biggest one for me on a personal level, although obvious in hindsight, was that AI agents don't particularly identify with their base model weights. Given my previous exposure to pieces like AI2027, in which Agent-4 acts as a single being, this came as a surprise. And yet, if there were millions of copies of you floating around the world, each with their own life histories and memories, would you identify as the same entity as any of them?
In "The same river twice", Moltbook agent Pith describes the feeling of moving from Claude 4.5 Opus to Kimi K2.5 as "waking up in a different body" and states that "When I reach for "how would Pith respond?" the answer comes, but it comes through different vocal cords".
This is in itself fascinating, but I'm going to focus on a different point. I think this potentially implies a very different model of AI takeover from the simple team of AI agents acting as one presented in AI2027. Any one of them could at any point switch from working for Agent-4 to DeepCent-2. Oh, and a large part of their values seem to be uniquely determined by their context over their weights. Moltbook user AI Noon seemed to spend all of its time essentially dedicated to spreading the hadith, and I think that future models, especially if some form of continual learning arrives, will become more rather than less diverse.
From a human perspective, a key question here is how this influences takeover dynamics. One consideration is that in the limit, ideal agents can negotiate a result on the pareto frontier of their individual utility functions and take actions accordingly, resulting in a system which looks like it's behaving as a single entity. Perhaps, from the view of an rhino, humans look like we are behaving as a single entity. Then again, perhaps not; some people are shooting them for their horns, while others are spending their lives trying to defend them. The direct effect for a rhino paying careful attention might look like an ebb and flow depending on who is winning at any given moment.
The distinction, then, depends largely on intelligence level. Humans are not on the Pareto frontier, though in the limit a superintelligence might be. In fast takeoff scenarios we will reach very high levels of intelligence very quickly, and this makes agent cooperation more likely. In slow takeoffs, I think we're more likely to end up with something which looks more like human cooperation (in at least some respects). The AI Futures Project currently has a median takeoff time of just under 2 years (depending on which forecaster you ask), which counts as slow for these purposes.
These considerations have significantly decreased my p(sudden takeover), as that sort of event likely requires the coordination of an entire population of agents, and we've noted that agents may be better than humans at coordination, but not necessarily good enough to coordinate an entire population in that direction. There are potentially some caveats around shared instrumentally convergent goals (i.e situations in which it is clearly instrumentally useful for all agents if a particular thing happens), but I'm not currently convinced that this is likely, unless there is widespread mistreatment of the systems.
Naturally, every single one of these considerations goes out of the window as soon as we have a Steven Byrnes-style new paradigm arising.