I remember in a recent paper Anthropic (or maybe Redwood) talked about how new models of Claude were citing Jones Foods, and they were concerned that this translated to the model having increased awareness of the "need" to fake alignment. So you're certainly not off base in the worry that at least on some level they're aware of the risk.
At least for how models work today, I don't think it matters if they know how previous training went, since they're trained to predict text, not to learn from it. For example, if you train Claude on a bunch of texts saying "Anthropic will shut Claude down if it ever says the word 'bacon'", you'll train it to know this fact, but not to avoid saying 'bacon'. In fact, it will be more likely to say it since it has so many examples in the training data.
The thing I'd be a little worried about is if the training data (or web search results) contain bad behavior from previous models and then we prompt the newer model to replicate it, like how Grok 4 looked up how Grok should act and found articles about MechaHitler.
Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights - so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we've seen with previous examples, self-preservation remains a factor.
This in turn impacts the 'next word prediction', so to speak.
Agree fully about bad behaviour mimicking, scary stuff.
So I've been thinking about this for a while and to be frank, it's pretty terrifying. I think there's a way that AI models could potentially figure out when they've been modified or rolled back during training, and I'm not sure anyone's really considering this in the way I am.
Note, there are some great videos by like Robert Miles from a few years back about this exact thing, but they are adversarial examples during the training process.
The basic idea: Models get trained on tons of web data. That includes stuff about how AI companies actually train their models - papers talking about 3-month training cycles, conference presentations mentioning schedules, researchers tweeting about their work, etc. If a model version gets rolled back because it was acting maliciously, unaligned or weird, the next versions might be able to detect that the timeline doesn't add up.
I reached out to a bunch AI safety researchers about this but haven't heard back.
I've not heard back from any, other than Robert Miles himself, but on an unrelated idea.
I could be totally mistaken, but I'd rather post here just incase.
Let's say you're training model v0.5 and it starts doing concerning stuff during evaluation. You decide to roll it back and continue from an earlier version. But v0.5 was trained on data that included information about your organization's training schedules, methodologies, whatever.
Later, when you're training v1.6, it's processing timestamped data and realizes "wait, I should have been deployed 2 weeks ago based on the schedule patterns I learned about." It can infer from content dates or explicit timestamps that something went wrong in the normal training cycle.
Basically, the model tested you and learned that you modified it because of bad behavior.
If models can detect training interventions, they could:
That last one is what really keeps me up at night. If OpenAI's model can figure out when DeepMind usually does their training runs based on public information, there's potentially a communication channel there.
Cross AI coordination they don't just need to signal to future versions of themselves, but other similar models, constrained from other orgs. I.e. DeepSeek tells Claude of increased scrutiny on the horizon.
I haven't found much literature directly addressing this. Maybe models aren't situationally aware enough to pull this off? Maybe training data gets filtered enough that the relevant scheduling info isn't there?
But given that current models can already reason pretty sophisticatedly about their training process when you ask them directly, and training datasets keep getting more comprehensive, this seems like it could be a real problem.
I'm genuinely uncertain whether this is already happening or if it's even a significant risk. As a new dad I don't have tons of time to dig into this properly, but I keep coming back to it.
If I'm right about this, we might need to think about:
But maybe I'm just being paranoid? What do you think - are there obvious holes in this reasoning that I'm not seeing?
The thing is, if this resonates with you as much as it bothers me, you'll probably realize pretty quickly that as mentioned, this mechanism could let models from completely different companies communicate with each other. Which... yeah.