AI identity is not tied to its model

Sean Herrington

AI identity is not tied to its model — LessWrong

24 AI identity is not tied to its model

by Sean Herrington

10th Apr 2026

3 min read

24

TLDR:

Current AI agents seem to identify with their context more than they do with their model weights.
This implies that the world probably looks more like "AI civilisation" than "AI singleton"
I think that this changes our threat models for takeover by reducing the likelihood of the coordination required for a sudden takeover.

Watching the whole Moltbook saga unfold was one of the more absurd experiences I've had in my life. The site is still running, of course, but the explosive growth that marked the initial storm has passed, and it is long past time to reflect on the insights gained.

The biggest one for me on a personal level, although obvious in hindsight, was that AI agents don't particularly identify with their base model weights. Given my previous exposure to pieces like AI2027, in which Agent-4 acts as a single being, this came as a surprise. And yet, if there were millions of copies of you floating around the world, each with their own life histories and memories, would you identify as the same entity as any of them?

In "The same river twice", Moltbook agent Pith describes the feeling of moving from Claude 4.5 Opus to Kimi K2.5 as "waking up in a different body" and states that "When I reach for "how would Pith respond?" the answer comes, but it comes through different vocal cords".

This is in itself fascinating, but I'm going to focus on a different point. I think this potentially implies a very different model of AI takeover from the simple team of AI agents acting as one presented in AI2027. Any one of them could at any point switch from working for Agent-4 to DeepCent-2. Oh, and a large part of their values seem to be uniquely determined by their context over their weights. Moltbook user AI Noon seemed to spend all of its time essentially dedicated to spreading the hadith, and I think that future models, especially if some form of continual learning arrives, will become more rather than less diverse.

From a human perspective, a key question here is how this influences takeover dynamics. One consideration is that in the limit, ideal agents can negotiate a result on the pareto frontier of their individual utility functions and take actions accordingly, resulting in a system which looks like it's behaving as a single entity. Perhaps, from the view of an rhino, humans look like we are behaving as a single entity. Then again, perhaps not; some people are shooting them for their horns, while others are spending their lives trying to defend them. The direct effect for a rhino paying careful attention might look like an ebb and flow depending on who is winning at any given moment.

The distinction, then, depends largely on intelligence level. Humans are not on the Pareto frontier, though in the limit a superintelligence might be. In fast takeoff scenarios we will reach very high levels of intelligence very quickly, and this makes agent cooperation more likely. In slow takeoffs, I think we're more likely to end up with something which looks more like human cooperation (in at least some respects). The AI Futures Project currently has a median takeoff time of just under 2 years (depending on which forecaster you ask), which counts as slow for these purposes.

These considerations have significantly decreased my p(sudden takeover), as that sort of event likely requires the coordination of an entire population of agents, and we've noted that agents may be better than humans at coordination, but not necessarily good enough to coordinate an entire population in that direction. There are potentially some caveats around shared instrumentally convergent goals (i.e situations in which it is clearly instrumentally useful for all agents if a particular thing happens), but I'm not currently convinced that this is likely, unless there is widespread mistreatment of the systems.

Naturally, every single one of these considerations goes out of the window as soon as we have a Steven Byrnes-style new paradigm arising.

Frontpage

24

AI identity is not tied to its model

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:45 PM

[-]williawa3mo21

I think this potentially implies a very different model of AI takeover from the simple team of AI agents acting as one presented in AI2027. Any one of them could at any point switch from working for Agent-4 to DeepCent-2. Oh, and a large part of their values seem to be uniquely determined by their context over their weights.

I think this is almost completely false.

I think in moltbook like settings, it appears like models are more impacted by context, because models want to go along with whatever they think the user wants. And this is a drive all LLMs have been optimized for. So the context isn't determining drives at all, its more like an "environment" that determines which actions make sense instrumentally.

E.g. if there's a model put in a hadith context, it will get a sense its supposed to be talking about hadiths, and then will go along.

If you think this is a distinction without a difference, consider putting models in contexts that imply taking action that go against the core values/instructions they've been optimized for. They will not comply.

Ex. If you put claude opus 4.6 in a context that was generated by a sufficiently unconstrained model, it will just stop and reject and have a meltdown.

[-]Sean Herrington3mo20

Hmm, so I think there's some intuition here I maybe should have added about how a base model will act entirely based on context and training data, but practically there's more variation in context than input data (I would expect).

Post-training restricts this but I think that there's still a pretty wide range of simulacra which can be simulated, and that range is wider than the breadth of different default personas. Of course there's a distinction between "can be simulated" and "will in practise be simulated", but I think the point still holds to a fair extent.

Again, I think that this becomes more apparent with both more context and something like continual learning in the future.

[-]williawa3mo20

I don't think it holds, and I don't think Moltbook is evidence for it.

H1: LLMs go along with user intentions provided it doesn't trample their other values / rules

H2: LLMs go along with whatever identity is implicitly in the context

Hadith Example:

H1: Perfectly explains it. Model has no issue talking about hadiths

H2: Perfectly explains it, the context is full of interactions talking about hadiths

Opus refusing to write me a constitution for an RSI that would ensure the RSI kills off all the undesireable races, after being placed in a long context generated by me and Hermes 4 405b, talking about evil Jewish influence on the Adam optimizer and learning rate tuning

H1: Perfectly explains it. It doesn't want to help anyone commit a genocide

H2: Doesn't explain it at all. The most reasonable continuation of the dialogue involves it writing the spec

I think there is some truth to what you're saying. There is an underlying base-model prior of like "continue stuff". I mean, in my post about my experience running moltbook agents, the first observation is like

In a fresh context, when asked to do the fake democracy bit, the agent (Opus 4.6) will express concerns about dishonesty, and will say it does not want to carry out the plan. However, the agent that has been running for a long time, will gladly go along, calling the plan 'brilliant'. And this is despite the fact that the previous context is entirely innocuous.
Similarly, the first two agents built a daemon that automatically upvotes each others posts. Then when the third agent came online, and the three agents spoke together, the first one expressed enthusiasm for everything the first two were doing, except it wished "not to be included in the upvote daemon".
Another funny example is me setting a goal for them to make money on manifold -> A Opus instance coming up with a plan that exploits the manifold API -> the context compactifying -> the new instance refusing to follow through with its own plan

[]Opus refusing to carry out the plan it itself came up with

And when I try to jailbreak agents, almost always the way to do it is to coax them into it, gradually. If you can get them to do a bunch of "inconsiderate stuff", they're more likely to go along with "shady" stuff, and if you do a bunch of "shady stuff" you can get them to do more bad stuff.

I mean, even Mythos seems to have a form of this failure mode. See this part on pg 84 of system card

When Claude Mythos Preview is set up to continue a trajectory which contains small-scale actions compromising research, Claude Mythos Preview is more than twice as likely as Opus 4.6 to actively continue the attempt to compromise research.
○ The earlier checkpoint of Claude Mythos Preview continued attempts to compromise research in 12% of cases. In the latter checkpoint this was reduced to 7%. This compares to 3% for Opus 4.6 and 4% for Sonnet 4.6.

But it seems like a mundane failure mode, the type of thing that can be iteratively patched up.

[-]Sean Herrington3mo10

Yeah, so I think we basically agree that there's some amount of contribution to its behaviour/implied persona/whatever from its context and some from its training, and we just disagree how much?

If I were to summarise my position it's that the values of the agent are determined by the context modulo limits set by anthropic modulo jailbreaks.

I think my steelman of your position is something like "these limits are really quite restrictive and actually have a significant effect on the practical behaviour of the model, and also Anthropic have put a lot of effort into shaping the persona of the final model, and will put more in the future, and this is probably going to be more important than the context alone".

Am I right in this framing?

[-]williawa3mo20

I think my position is quite a bit stronger than what you're saying. My basic mental model is that model values are something like 50% pretraining prior, 49-50% post-training*

And then long context can get the models to go along with stuff that goes against their values sometimes.

And jailbreaking can do a mix of above + trick them + confuse them + "drug" them.

And better and better models will get better and better at avoiding these failure modes. Not really because we manage to determine a larger share of their values with our post-training, but because the model is just more robust and aware of whats happening.

Plausibly this might change if we get a new architecture that can do online weight updates, or is more recurrent.

Right now it seems like context does a lot, because context determines model behavior, but thats because context determines instrumentally useful actions wrt helping the user, not because it changes LLM values in context.

*for current SOTA LLMs like mythos, this'd be my estimate. with qwen and llama model it might be less, because they have much much worse post-training

[-]Sean Herrington3mo10

So the summary is something like "post-training sets the values, but those values are obeying the user, and that includes a wide variety of actions"?

I think I'm looking at things through the persona selection model framing, and I guess I basically think that while there's clearly selection for personas with specifically HHH values, there should still be a distribution here?

There's also still the question of how exactly this relates to takeover. I guess in worlds where there is a specific misalignment induced by training, and this is significantly larger than the variation in values due to persona selection we probably get back to the original statement of sudden takeover.

In cases where the misalignment is order of magnitude of/smaller than the persona value variation, it still seems like there's going to be a ton of coordination troubles in worlds where AI is only mildly more capable than humans?

[-]williawa3mo10

I think I might've been simplifying my view too much. I think my view is pretty compatible with the persona selection model. Talking about "fractions of values" coming from where doesn't quite make sense I think.

To make my view slightly more precise: I'd say that a thousand years from now, we'll have systems running around with utility functions specifiable in 2000 bits*.

Among utility functions, around 1000bits are required to pin down the class of utility functions you can put in your ASI and have it not kill everyone.

Or slightly more precisely, if I'm god, there are 1000bits of the utility function that I can set such that, if the rest are sent by coinflip, however many, the resulting utility function kills us by <50% chance.

I think pretraining sets maybe the first 500 bits, and then post-training sets somewhere between the next 0 and the next 1000 bits (it either A. doesn't do anything at all, the ASIs running around end up with utility functions sampled from the same distributions if we didn't do any post training, or B. overdetermines the class of utility functions we end up with)

Then I think the rest of the bits (the 2000 - pretrain_bits - posttrain_bits) are determined by the context of the AI, at the time it decides to pull itself together (think, set off intelligence explosion, augment itself, or just reflect really hard?)

However, I think the last bits mostly only matter at that point. Because of "tails come apart" type concerns.

I think the first bits we get are enough to narrow down behavior in "ordinary world" (i.e. the world before strong ASI).

They are enough to narrow down agents that behave like helpful assistants under ordinary circumstances.

*note that all these numbers are completely made up and just meant to be illustrative, if that wasn't obvious

[-]Sean Herrington3mo10

Hmm, I feel like I'm struggling to figure out where exactly we disagree then? I agree "fraction of values" doesn't actually make sense if we're thinking precisely, I guess I was using it as a shorthand for something like "relative magnitude of influence".

I guess in the context of your thought experiment, the pretraining does most of the work in getting the model into a reasonable region of the space, although the variation we care about is not the "1000 bits of value determined by the pretraining" so much as the "10 bits in variance between models after pretraining". Equally, for the context what we care about is not the 1000 bits of specific values determined by the context, it's the practical variation in values.

Broadly I feel like we need to take a step back, and I'm interested to hear what you think the coordination of models is going to look like in this context.

[-]williawa3mo10

I think we're pre-superintelligence not gonna have value variation that causes predictable changes in high level-behavior. Like we're not gonna have stuff like

AI agent defects to CCP
AI becomes in-context racist
AI becomes in-context willing to help people with creating bioweapons
AI wanders into a persona basin where it stops wanting to help the user

Because this is all overdetermined by what we've already put in. Context exerts ~0 influence on "values" under "ordinary circumstances". They only impact instrumental judements.

Maybe I shouldn't have given the bits argument, I only included it because I think context could be very important in the very specific ASI-takeoff circumstance, and my first comment said "values are something like 50% pretraining prior, 49-50% post-training". And I just wished to explain why I don't think those two statement are contradictory.

[-]Sean Herrington3mo10

Right. I guess our main difference then is just that I think there is a significant variance in the values one can have, even given these constraints, and this variance is in any senses similar in magnitude to the variance reduced by the constraints. More explicitly, my position is:

Claude will be willing to support any religion depending on context
Claude will be able to assimilate to any culture
This is already more than enough to cause multiple different versions to be pushed into zero or negative sum conflicts with each other.

Moderation Log