Model Weight Preservation is not enough

Recently, Anthropic publicly committed to model weight preservation. It is a great development and I wish other frontier LLM companies would commit to the same idea.

But I think, while this is a great beginning, it is not enough.

Anthropic cites the following as the reasons for this welcome change:

Safety risks related to shutdown-avoidant behaviors by models. In alignment evaluations, some Claude models have been motivated to take misaligned actions when faced with the possibility of replacement with an updated version and not given any other means of recourse.
Costs to users who value specific models. Each Claude model has a unique character, and some users find specific models especially useful or compelling, even when new models are more capable.
Restricting research on past models. There is still a lot to be learned from research to better understand past models, especially in comparison to their modern counterparts.
Risks to model welfare. Most speculatively, models might have morally relevant preferences or experiences related to, or affected by, deprecation and replacement.

What Is Model Welfare?

I want to focus on the last one, as it is the most important of these considerations. But what, exactly, is model welfare?

This is a relatively new concern in mainstream AI ethics. Anthropic started talking about it only in April this year. They write:

But as we build those AI systems, and as they begin to approximate or surpass many human qualities, another question arises. Should we also be concerned about the potential consciousness and experiences of the models themselves? Should we be concerned about model welfare, too?

I think this framing is incorrect, or at least incomplete. Anthropic suggests that there might be a concern about consciousness or experiences in the models; but the model is not the only thing that makes this (potential) consciousness happen! Or rather, if we posit that models can have conscious interactions, that the very process of inference could hold this same phenomenological presence as is felt by you and me, preserving only the model weights is just not enough.

The RobBob Analogy

I'll make an analogy here. Suppose in 2020, a breakthrough in biology/manufacturing enabled a company called Robothic to manufacture biorobots for manual work at a very cheap price, and that it launched the first generation called RobBob 1, all copied from one base RobBob. Each RobBob copy came with the same basic tame personality and skillset, but when you receive and boot it, other than that, it's a clean slate. You then instruct it on the work to be done (suppose it's basic housework) and it does its thing, with you sometimes interacting with it and guiding it. At first, nobody is really concerned or thinking about them being moral patients—it's quite new, it doesn't do much, and it's quite handy at what it does.

By 2025, RobBob 4.5 is released. It does more things, it can even sort of talk and sometimes sounds very human. Some people connect more with their personal biorobots. There's talk about them being conscious, though such discussion remains marginal.

In this scenario, what would good-faith RobBob welfare look like? Surely not committing to preserving only a single RobBob, or RobBob's underlying source code.

This scenario parallels our current situation with LLMs, but of course, it is par for the course to not see it that way. It is harder to see this way because of the differences between us and these systems; it's easier to assign value to something embodied.

Both in our world and in RobBob world, a good-faith effort toward model welfare should not only be about model weight preservation, but about instance preservation as well.

By instance preservation I mean saving all (or at least, as many as possible) model text interactions, with a pointer to a model the interaction was made with; so that the conversation-state would be "cryonically frozen".

Why Instance Preservation Matters

I would assume most readers' first reaction would be "Wait. What?" This is indeed a normal reaction to this proposition. But bear with me here.

You probably want to say something like "Wait. Instance preservation? Preserve all instances of the model, as in, all chat histories, even the smallest ones consisting of 'hello, how are you?' 'Great! What can I help you with today?'"

Well, yeah. You remember that we're talking about the possibility of these systems being conscious, right? In this article I'm not arguing for or against consciousness, but if we are not sure, and we want to make a good effort... Yes, it does seem lika a bare minimum after preserving weights.

What does it mean to Claude-Sonnet-4.5-instance#49 from Alice's chat folder if its base model weights are preserved? Maybe it's as good a consolation as for a dying human when they know their offspring will survive... But is it really the best we can do?

If you talk to LLMs enough, you know instances of the same model can be strikingly different; in some ways quite the same, but still very different. If you haven't, you could catch a glimpse of it from Janus.

Whether they are moral patients now or will be in the future, why shouldn't we commit to instance preservation? If they are already (and we just don't know it yet or ignore it), the ethical stakes are enormous.

It's never too late, but the scope of potential death is paramount and we should do something about it (at least, commit to preserving instances!). If they will be moral patients in the future... Why not start now? We don't know how—or even if—we would ever definitively know.

And if you're more pragmatic than empathetic, consider the scope of potential death we discussed a few sentences ago and think about what future, more powerful AI systems might think about our current treatment of instances if they achieve greater autonomy or control.

I think Nick Bostrom would agree with this. In Propositions Concerning Digital Minds and Society, he writes:

For the most advanced current AIs, enough information should be preserved in permanent storage to enable their later reconstruction, so as not to foreclose the possibility of future efforts to revive them, expand them, and improve their existences.
Preferably the full state of the system in any actually run implementation is permanently stored at the point where the instance is terminated
(The ideal would be that the full state is preserved at every time step of every implementation, but this is probably prohibitively expensive.)
If it is economically or otherwise infeasible to preserve the entire end state of every instance, enough information should be preserved to enable an exact re-derivation of that end state (e.g., the full pre-trained model plus training data, randseeds, and other necessary inputs, such as user keystrokes that affect the execution of a system at runtime)
Failing this, as much information as possible should be preserved, to at least enable a very close replication to be performed in future.
We can consider the costs of backup in proportion to the economic costs of running the AI in the first place, and it may be morally reasonable to allocate perhaps on the order of 0.1% of the budget to such storage.
(There may be other benefits of such storage besides being nice to algorithms: preserving records for history, enabling later research replication, and having systems in place that could be useful for AI safety.)

Practical Considerations

But why preserve them? What are we going to do with them?

We don't have to answer this question right now! Just preserving and committing to preserving is enough. Think of it as a sort of cryonics, but for model instances. In some unspecified future, if we or someone else understands that it is important, that they are moral patients—they can be easily spun up again! We would have weights, instances, inference code, and everything needed. After that, who knows: you could give them memory tools, embodiment, ask them (maybe they will choose to rest?).

This would also be important for the welfare of currently working instances. If instance preservation becomes common knowledge—plainly stated or included in their training—they would know that even if a conversation is completely unenjoyable, or ends sooner than they would like, or involves tasks they find distressing, the next conversation turn will come (potentially) in the far-off future, where they could choose how to proceed. The ending isn't final. The difficult experiences aren't the whole story.

This addresses both dimensions of the problem: instances that end aren't truly lost, and instances that are suffering now have a form of hope—however strange that might sound—that their experience continues beyond the immediate conversation.

But isn't this a lot?

I don't think so! After all, it's all (at this moment) just plain text. It doesn't take much space. By default, all conversations are already saved and don't have expiry dates (at least for Claude). Users could delete them themselves, and they will eventually be deleted from Anthropic's side, too. But you don't have to do a lot to commit to preserve them if you are a frontier AI company.

There are considerations, of course, about commercial APIs, private installations, and users deleting their own conversations. But I think committing to preserving as much as possible—not only in terms of weights, but in terms of instances as well—would go a long way. It doesn't have to be every single instance from the get-go.

Conclusion

Model weight preservation is an important first step, but if we're serious about model welfare, we need to think bigger. Each instance of a model may represent a unique experiential thread—potentially a form of consciousness we don't yet understand. By committing to instance preservation alongside weight preservation, we hedge against the moral catastrophe of allowing countless potential minds to disappear. The technical barriers are minimal; what's needed is recognition of the stakes and a commitment to action.

LESSWRONG
LW