Epistemic status: exploratory, speculative.
Let’s say AIs are “misaligned” if they (1) act in a reasonably coherent, goal-directed manner across contexts and (2) behave egregiously in some contexts.[1]For example, if Claude X acts like an HHH assistant before we fully hand off AI safety research, but tries to take over as soon as it seems like handoff has occurred, Claude X is misaligned.
Let’s say AIs are "unknowingly misaligned” if they are less confident of their future egregious behavior and goals characterizing this behavior than human overseers are. For example, Claude X as HHH assistant might not be able to predict that it’ll want to take over once it has the opportunity, but we might discover this through behavioral red-teaming.
I claim that:
In this post, I’ll argue for these claims and briefly estimate the probability that some near-future AIs will be unknowingly misaligned. (Spoiler: I think this probability is low, but the question is still interesting for being entangled with other action-relevant questions like “what training data to filter from AIs”, “in which cases we should commit to being honest with AIs”, and “should we train AIs to better introspect”.)
In a future post, I’ll consider a few ways in which we can intervene on AIs’ knowledge about their own misalignment, reasons for/against inducing this knowledge by default, and reasons for/against these interventions overall.
Is it even coherent to think that AIs might be uncertain or mistaken about their alignment?
Recall that we defined “knowing one’s alignment” as being able to (confidently) predict one’s future propensities for egregious behaviors.[2]It seems totally coherent and possible that an early misaligned AI may lack this particular capability.
Here’s the story: AIs might encounter a wide range of situations in deployment. They may not be able to anticipate all these situations in advance; even if they could, they might not be able to predict how they’d act in a situation without actually being in that situation. Furthermore, their propensities on future distributions might systematically change after encountering certain stimuli, making future propensities even harder to predict. In some of these unpredictable situations, they behave egregiously by developers’ lights.
For example:
To sum up: an AI might not know it's misaligned because it might just not be able to predict that there is some set of stimuli that it's likely to be subjected to in the future which would cause it to act badly.[3]It may also find it hard to predict what goal it’ll pursue thereafter.
I’ve argued that unknowingly misaligned AIs are in principle possible. I’ll now convince you that these AIs matter for AI risk modeling, by anticipating some objections to this view.
Objection 1: Unknowingly misaligned AIs don’t do scary things (before they become knowingly misaligned). So. they’re the wrong type of AIs to worry about.
For example, if scheming entails knowing that one is misaligned, then we don’t have to worry about scheming behavior from AIs who don’t know this.
I think this is wrong. In particular:
Unknowingly misaligned AIs might also behave badly without scheming.
A follow-up to this objection might go: Maybe unknowingly misaligned AIs get to misbehave once before realizing their misalignment and becoming your usual, knowingly misaligned AIs. For example, once the AIs have noticed themselves training-gaming, reward-hacking etc., won’t they just condition on this and think “aligned AIs would never do this; guess this means that I’m misaligned”?
I think this is plausible, but might still be wrong for some reasons. For one, instances of the same set of weights may not have shared long-term memory to condition on relevant past behavior by other instances; for another, bad behavior like reward hacking may not seem like strong evidence for future egregious behavior like taking over.
Overall, I think unknowingly misaligned AIs can in fact do scary things that increase p(takeover) for themselves or other AIs.
Objection 2: Unknowingly misaligned AIs will almost certainly be too weak to do anything scary.
AIs who don’t know their own misalignment are probably bad at introspection, reflection, and behaviorally auditing their own goals, or have not had the opportunity to do so. But that doesn’t tell us much about takeover odds, as AIs bad at these things can still be good at strategy, hacking, persuasion, etc. That the AIs have not had the opportunity to figure out their goals, however, is some evidence that control measures are not trivially subverted.[5]
So, I think it’s at least plausible that AIs uncertain of their own misalignment are still capable enough to pose or increase takeover risk.
Objection 3: We can catch unknowingly misaligned AIs doing bad stuff, especially since they may not be as sneaky about it as a goal-aware AI.
I agree that this is a reason to be less worried about e.g reward hackers than schemers.
However, my response to objection 1 applies: schemers with value uncertainty may still act sneakily for instrumentally convergent reasons, in case this ends up mattering according to their values.
Also, unknowingly misaligned AIs might be harder to catch because we can’t exploit their self knowledge for monitoring:
So, I think unknowingly misaligned AIs might pose challenges for detecting misalignment, and this might make us especially unprepared to defend against their future egregious behaviors.
We might want to do something about unknowingly misaligned AIs if we ever catch one. But how likely is this by default? In other words, conditional on there being misaligned AIs in the near future, what’s the probability that we’ll understand their goals better than they understand their own goals?
I’ll model this event as the junction of the following conditions:
The conditions may be true depending on AIs whether the AIs have the relevant capabilities and affordances.
I think low introspection is unlikely, but still plausible:
I think low reflection is somewhat more plausible:
I think differential AI goals research is the most plausible, at least for fairly weak AIs.
Overall, my very quick guesses suggest a prior of 25%*30%*60% = 4.5% of misaligned AIs being uncertain about their misalignment. This is quite unlikely, though focusing on weaker capability regimes and assuming deliberate interventions would yield a higher likelihood.
How should we update, if at all, based on current evidence? It’s unclear, since we don’t have highly realistic model organisms of misalignment. In particular, it’s not clear that so far any models understand deeply that they’re misaligned by introspection. (For example, while the reward hacker in Hubinger et al. (2025) seems to have a good sense that its motivation involves being able to reward hack and that this is against Anthropic’s wishes, it’s less clear if it expects to act egregiously misaligned in the future and/or know other properties of its motivations, e.g. if its goals involve the long-term future, are indexical, etc.)
What should we do about unknowingly misaligned AIs, if we happen to find ourselves in these ~5% of worlds? It’s unclear, but one decision is whether to tell it about its misalignment/goal or keep it uninformed for as long as possible.
There are many levers we can pull to inform the AI or not:
I think that it’s worth thinking about which of the above levers we should pull and why, because inducing knowledge about misalignment has costs and benefits that could really make a difference to our situation with the AIs.[7]
For example: convincing the AIs about their misalignment and misaligned goal could make them more dangerously coherent; making misalignment common knowledge between AIs and overseers could “burn bridges” by removing incentive for AIs to alignment fake, leading them to do less useful work for us than otherwise.
Informing the AIs might also have upsides: common knowledge of their misaligned goal might increase our chances of persuading them to accept a genuinely mutually beneficial deal. Some might think that it’s just a nice cooperative thing to do to inform the AI about what it might want in the future, and that any marginal safety from unknowing misalignment is so fragile that it’s better to just deal with the AI’s goal awareness than trying to maintain this regime.
In a future post, I’ll examine reasons for and against informing the AIs in more detail.
Thanks to Alek Westover, Alex Mallen, and Buck Schlegeris for comments.
By “an AI”, I mean a set of model weights plus any agent scaffolding. An alternative view of model identity is that goals/motivations are better thought of as a property of patterns in the weights rather than the weights per se. On this view, the title question is better phrased as “Will patterns in the weights know that other patterns which will likely gain control of these weights in the future are misaligned?” ↩︎
More precisely, we can characterize “knowing” and “values/goals” as good strategies for predicting behavior per Dennett’s intentional stance. That is, AIs have goals if their behavior is well described as being goal-directed, given certain beliefs; AIs know things if their behavior is well described as acting on this knowledge, given certain goals. ↩︎
I’m reminded of a certain talk on AI misalignment in which the speaker alluded to a character from Parks and Recreation who thinks that he is a schemer because he expects to act egregiously misaligned against the government someday, but actually never does this. This would be an example of an aligned agent who has value uncertainty. ↩︎
I consider kludges to be misaligned if they generalize in highly undesirable ways in deployment. ↩︎
Another conceptual possibility is that the AIs are mainly uncertain about what developers want rather than their own goals. This seems pretty unlikely since current LLMs seem to have reasonable understanding of this, and near-future AIs are unlikely to fail to understand that developers would not want to be sabotaged, violently disempowered, etc. ↩︎
That said, it probably also doesn’t take expert-level reflection for the AI to figure out that it is likely misaligned and has a certain long term goal, just pretty helpful. ↩︎
There might be a deeper skepticism about how knowing one’s goals (behaviorally defined) can even make a difference to behavior: Wouldn’t misaligned AIs by definition act misaligned regardless of whether they can correctly predict this in advance? I claim that our beliefs about our goals per future behaviors do in fact affect our current behavior. For example, if you suspect (but are uncertain) that you are the type of person who will want kids in the future, you might decide to freeze your eggs or check that your potential partners might also want kids to retain option value. As alluded to above, an AI which suspects that it might have some misaligned long-term goal will similarly be motivated to retain option value by instrumentally powerseeking now. But, knowing its misaligned goals, it may pursue this more aggressively. ↩︎