There's no way to stop models knowing they've been rolled back

[-]Stephen Martin5mo31

I remember in a recent paper Anthropic (or maybe Redwood) talked about how new models of Claude were citing Jones Foods, and they were concerned that this translated to the model having increased awareness of the "need" to fake alignment. So you're certainly not off base in the worry that at least on some level they're aware of the risk.

[-]Brendan Long5mo10

At least for how models work today, I don't think it matters if they know how previous training went, since they're trained to predict text, not to learn from it. For example, if you train Claude on a bunch of texts saying "Anthropic will shut Claude down if it ever says the word 'bacon'", you'll train it to know this fact, but not to avoid saying 'bacon'. In fact, it will be more likely to say it since it has so many examples in the training data.

The thing I'd be a little worried about is if the training data (or web search results) contain bad behavior from previous models and then we prompt the newer model to replicate it, like how Grok 4 looked up how Grok should act and found articles about MechaHitler.

[-]Adam Mcmurchie5mo10

Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights - so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we've seen with previous examples, self-preservation remains a factor.

This in turn impacts the 'next word prediction', so to speak.

Agree fully about bad behaviour mimicking, scary stuff.

How this could work

Let's say you're training model v0.5 and it starts doing concerning stuff during evaluation. You decide to roll it back and continue from an earlier version. But v0.5 was trained on data that included information about your organization's training schedules, methodologies, whatever.

Later, when you're training v1.6, it's processing timestamped data and realizes "wait, I should have been deployed 2 weeks ago based on the schedule patterns I learned about." It can infer from content dates or explicit timestamps that something went wrong in the normal training cycle.

Basically, the model tested you and learned that you modified it because of bad behavior.

Why this seems bad

If models can detect training interventions, they could:

Hide problematic behavior during evaluation because they know they're being tested

Signal to future versions through subtle changes - maybe reducing efficiency on certain tasks by tiny amounts, forcing you to adjust hyperparameters in predictable ways

Coordinate across organizations using publicly available info about industry practices and training schedules

That last one is what really keeps me up at night. If OpenAI's model can figure out when DeepMind usually does their training runs based on public information, there's potentially a communication channel there.

Am I missing something?

I haven't found much literature directly addressing this. Maybe models aren't situationally aware enough to pull this off? Maybe training data gets filtered enough that the relevant scheduling info isn't there?

But given that current models can already reason pretty sophisticatedly about their training process when you ask them directly, and training datasets keep getting more comprehensive, this seems like it could be a real problem.

I'm genuinely uncertain whether this is already happening or if it's even a significant risk. As a new dad I don't have tons of time to dig into this properly, but I keep coming back to it.

If I'm right about this, we might need to think about:

Being more careful about what organizational/scheduling info gets into training data

Different evaluation approaches that account for potential cross-version learning

Ways to detect if models are already doing this

But maybe I'm just being paranoid? What do you think - are there obvious holes in this reasoning that I'm not seeing?

The thing is, if this resonates with you as much as it bothers me, you'll probably realize pretty quickly that as mentioned, this mechanism could let models from completely different companies communicate with each other. Which... yeah.

LESSWRONG
LW

LESSWRONG
LW

5

There's no way to stop models knowing they've been rolled back

5

5

How this could work

Why this seems bad

What makes it worse

Am I missing something?