To be legible, evidence of misalignment probably has to be behavioral

ryan_greenblatt

LESSWRONG
LW

To be legible, evidence of misalignment probably has to be behavioral — LessWrong

58 To be legible, evidence of misalignment probably has to be behavioral

by ryan_greenblatt

15th Apr 2025

AI Alignment Forum

4 min read

58 Ω 28

One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that it did something egregiously bad, converting this into legible evidence the AI is seriously misaligned, and then this triggering some strong and useful response (like spending relatively more resources on safety or undeploying this misaligned AI).

You might hope that (fancy) internals-based techniques (e.g., ELK methods or interpretability) allow us to legibly incriminate misaligned AIs even in cases where the AI hasn't (yet) done any problematic actions despite behavioral red-teaming (where we try to find inputs on which the AI might do something bad), or when the problematic actions the AI does are so subtle and/or complex that humans can't understand how the action is problematic^[1]. That is, you might hope that internals-based methods allow us to legibly incriminate misaligned AIs even when we can't produce behavioral evidence that they are misaligned.

Unfortunately, I think it's unlikely that evidence which is just from fancy internals-based methods (and can't be supported by human inspection of AI behavior) will be legible and convincing in practice. Suppose that we use some fancy method to train an ELK head to honestly predict what is going to happen or to honestly answer our questions. Then, suppose this head says/predicts "this action would yield very bad consequences in a way which is unlikely to be an accident" but humans can't understand how this action would result in these bad consequences (even after it is explained to them). Even if the internals-based method is extremely well supported theoretically and empirically (which seems quite unlikely), I don't think this would suffice for this to trigger a strong response by convincing relevant people^[2]. This reasoning also applies to "our interpretability method decoded some very problematic deceptive reasoning" or "the deception probe fires in a very suspicious way (e.g. it fires more when we give it a fake opportunity to escape or ask it about its alignment research outputs)".

More generally, I think that to convince people enough to take strong action, evidence will need to bottom out in some sort of behavioral evidence that (many) humans can understand.^[3] (That's not to say solid behavioral evidence will necessarily suffice.)

This isn't to say that internals-based techniques for detecting scheming aren't useful. These techniques could be quite useful via two mechanisms:

The technique could allow us to find behavioral evidence that would have otherwise been hard to find. For instance, maybe we can use the method to discover an input on which the AI does something egregiously bad (that we can understand). (However, it will be important to legibly not apply too much optimization pressure as you might be able to get any model to look misaligned with enough optimization pressure.) Internals-based techniques could also help with having humans understand why an action is egregiously bad (e.g., we might be able to use internals-based techniques to effectively do better elicitation for debaters).
Detection of scheming can be useful via mechanisms other than convincing people to take strong and expensive actions:
- Marginally more buy-in and support for handling misalignment could be helpful.
- If we can detect misalignment, we can study techniques for preventing or removing misalignment by iterating (or even training) against our detectors. The more robust the detector is, the more useful this would be.
- The subset of people who buy this evidence could usefully alter their actions and strategy. For instance, AI company employees implementing countermeasures and utilizing AIs might adjust their approach now that they know the AIs are likely seriously misaligned (though presumably not in a way that greatly increases safety costs). In the case where you know your AI is seriously misaligned, you want to adjust your control protocol, even if you're unable to take much of a hit on (apparent) usefulness^[4].
- We might be able to trigger well-enforced preexisting commitments. We could even imagine some sort of legally binding contract which triggers in this case (effectively a bet between people who are more worried about misalignment and those who are less worried).

Internals-based techniques for generally understanding an AI's behavior could also help by providing evidence for a particular interpretation of behavioral evidence.^[5] For instance, suppose an AI attempted to sabotage some alignment research. Internals-based techniques could (in principle) help by backing up the interpretation that this behavior is due to the AI being dangerously misaligned (or could provide evidence against this interpretation). However, for usage as legible evidence, this internals-based understanding is unlikely to work as a heavily load-bearing component, while it might work as a more minor/moderate piece of evidence in support of a given interpretation.

Given these difficulties in producing legible evidence, I think researchers working on internals-based techniques for detecting misalignment should spend some time thinking about the ultimate use case of their technique and then alter their approach as needed. These difficulties also generally make me somewhat less excited about internals-based methods for detecting misalignment (at least if these methods don't involve finding inputs on which the AI behaves poorly). I think techniques for detecting misalignment should focus on finding behavioral evidence or should aim to be robust (and cheap) enough to be iterated against.

Audio narrations of this blog are available on our podcast. Search "Redwood Research" or subscribe via Apple Podcasts, Spotify, RSS, or other platforms.

We can push the frontier of what humans can understand using recursive oversight techniques like debate which might allow for eliciting explanations that allow for humans to understand how the action is problematic (e.g., pointing out exactly what the problematic behavior is and explaining how to test this in isolation). However, if the problematic nature of an action is sufficiently subtle and/or complex (perhaps due to the AI having superhuman understanding), then even well elicited debate might be unable to provide a good enough explanation. We might have hoped that internals-based methods could push frontier beyond debate (perhaps to indefinite worst case scalability) while still producing legible and convincing evidence, but I'll argue that non-behavioral evidence is unlikely to be convincing. This suggests an upper bound on strategies which depend on legible evidence: once well elicited debate (and other methods for improving human understanding) aren't viable, misaligned AIs can get away with doing bad actions without these actions resulting in legible evidence of misalignment. ↩︎
That is, relevant humans, this approach might work for eliciting a strong response from AIs once AIs make the important decisions. ↩︎
That is, it must at least partially bottom out behaviorally. Internals-based evidence which is substantially behaviorally supported (in a relatively direct way) could suffice. E.g., if we can read off the AI's internal reasoning using some fancy method, it looks quite malign, and the malign aspects of the reasoning are also quite predictive of the AI's behavior, this could suffice. For instance, Alignment Faking in Large Language Models utilized a mixture of inspecting the Chain-of-Thought and behavioral evidence in a relatively effective way. ↩︎
Presumably, the actual level of usefulness might be substantially lower as the AI is a misaligned AI trying to disempower humans. ↩︎
I added this paragraph in an edit because it seemed like an important additional hope. ↩︎

Interpretability (ML & AI)AI

Frontpage

58 Ω 28

New Comment

19 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:18 AM

[-]Curt Tigges10mo70

I'm not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.

My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:

Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition that will be largely ignored or not acted on fully
Eventually, legible examples of misbehavior will be found, resulting in action or increased scrutiny
This scrutiny will then propagate backwards to finding causes or indicators of that misbehavior, and provided interp tools are indeed predictive, this path that has been developed in parallel will suddenly be much more worth paying attention to

Thus, I think it's worth progressing on these internals-based techniques even if their use isn't immediately apparent. When legible misbehaviors arrive, I expect internals-based detection or analysis to be more directly applicable.

[-]ryan_greenblatt10mo60

I'm not claiming that internals-based techniques aren't useful, just that internals-based techniques probably aren't that useful for specifically producing legible evidence of misalignment. Detecting misalignment with internals-based techniques could be useful for other reasons (which I list in the post) and internals based techniques could be used for applications other than detecting misalignment (e.g. better understanding some misaligned behavior).

If internals-based techniques are useful for further investigating misalignment, that seems good. And I think I agree that if we first find legible evidence of misalignment behaviorally and internals-based methods pick this up (without known false positives), then this will make future evidence with internals-based techniques more convincing. However, I think it might not end up being that much more convincing in practice unless this happens many times with misalignment which occurs in production models.

[-]Yash Shirsath10mo*20

Agreed. A concrete example of this might be: multi-agent systems trained to communicate via chain-of-thought. Or even simpler, RL pressure inducing steganography. We already have toy examples of this. In cases like these, "fancy internals-based methods" might be the only ways to flag steganography. These can then be reproduced in sandbox environments to create legible examples.

[-]Rohin Shah10moΩ440

In some sense I agree with this post, but I'm not sure who the intended audience is, or what changes anyone should make. What existing work seems like it will generate "evidence which is just from fancy internals-based methods (and can't be supported by human inspection of AI behavior)", and that is the primary story for why it is impactful? I don't think this is true of probing, SAEs, circuit analysis, debate, ...

[-]ryan_greenblatt9moΩ662

This is often important in my thinking: when thinking about various internals based methods that could test for scheming (but won't produce direct behavioral evidence), this comes up. I wrote this doc after noticing that I wanted to reference this somewhere.

Also, I often hear people discuss getting non-behavioral evidence for scheming using internals/interp. (As an example, probes for detecting deceptive cognition and then seeing if this fire more than expected on honeypots.) And, understanding this isn't going to result in legible evidence is important for understanding the theory of change for this work: it's important that you can iterate usefully against the method. I think people sometimes explicitly model iterating against these testing methods, but sometimes they don't.

Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often.

(E.g., it's the application that the Anthropic interp team has most discussed, it's the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)

Notably Dario seemingly thinks that circuit style interp analysis (which IMO would be unlikely to yield behavioral evidence on it's own) is the main way we might get definitive (aka legible) evidence of scheming. So, I think Dario's essay on interp is an example of someone disagreeing with this post! Dario's essay on interp came out after this post was published, otherwise I might have referenced it.

I wasn't trying to trigger any research particular reprioritization with this post, but I historically found that people hadn't really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.

[-]ryan_greenblatt9mo*Ω554

Here is the quote from Dario:

More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.

To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments2. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.

IMO, this implies that interp would allow for rallying support while it would be hard otherwise, implying the behavioral evidence isn't key.

[-]Rohin Shah9moΩ330

I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.

[-]Rohin Shah9moΩ220

Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)

Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often.
(E.g., it's the application that the Anthropic interp team has most discussed, it's the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)

The Anthropic discussion seems to be about making a safety case, which seems different from generating evidence of scheming. I haven't been imagining that if Anthropic fails to make a specific type of safety cases, they then immediately start trying to convince the world that models are scheming (as opposed to e.g. making other mitigations more stringent).

I think if a probe for internal deceptive reasoning works well enough, then once it actually fires, you could then do some further work to turn it into legible evidence of scheming (or learn that it was a false positive), so I feel like the considerations in this post don't apply.

I wasn't trying to trigger any research particular reprioritization with this post, but I historically found that people hadn't really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.

Fair enough. I would be sad if people moved away from e.g. probing for deceptive reasoning or circuit analysis because they now think that these methods can't help produce legible evidence of misalignment (which would seem incorrect to me), which seems like the most likely effect of a post like this. But I agree with the general norm of just saying true things that people are interested in without worrying too much about these kinds of effects.

[-]Aaron_Scher10moΩ330

I think your main point is probably right but was not well argued here. It seems like the argument is a vibe argument of like "nah they probably won't find this evidence compelling".

You could also make an argument from past examples where there has been large action to address risks in the world, and look at the evidence there (e.g., banning of CFCs, climate change more broadly, tobacco regulation, etc.)

You could also make an argument from existing evidence around AI misbehavior and how its being dealt with, where (IMO) 'evidence much stronger than internals' basically doesn't seem to affect the public conversation outside the safety community (or even much here).

I think it's also worth saying a thing very directly: just because non-behavioral evidence isn't likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs. Buck's previous post and many others discuss the rough epistemic situation when it comes to detecting misalignment. Internals evidence is going to be one of the tools in the toolkit, and it will be worth keeping in mind.

Another thing worth saying: if you think scheming is plausible, and you think it will be difficult to update against scheming from behavioral evidence (Buck's post), and you think non-behavioral evidence is not likely to be widely convincing (this post), then the situation looks really rough.

[-]ryan_greenblatt10moΩ330

I think your main point is probably right but was not well argued here.

Fair, I though that an example would make this sufficiently obvious that it wasn't worth arguing for at length but I should have spelled it out a bit more.

I think it's also worth saying a thing very directly: just because non-behavioral evidence isn't likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs.

FWIW, I do say this under "These techniques could be quite useful via two mechanisms:".

[-]Francesca Gomez10moΩ23-1

I think you make some good points here, but there's an additional mechanism by which I believe internal-based techniques have the potential to make people intervene strongly on suspected misaligned behaviour.

This is whereby (in pre-superintelligent models) we are able to establish a strong correlation between misaligned AI behaviour examples understandable to humans and some model internals e.g. a deception probe. If there is strong empirical evidence establishing a link (I'm not strongly confident this will be the case, but mildly optimistic), then as we move to superintelligent models, I believe people will be more likely to take action on evidence from model internals alone, especially if above a certain threshold for likelihood.

My reasoning for this relates to example today such as medical interventions which are taken as a result of EEG data (electrical activity in brain) even if no external behavioral signs are present (or ever present), but because there is enough evidence that certain patterns act as early warning signs for medical issues.

While there are obviously material differences between the 'cost' of these decisions, it does give me encouragement that people will place a high level of confidence in signals which aren't directly interpetable to humans if a statistical correlation has been established with previously observed behaviour.

I think this holds true only in a situation where there is positive intent by decision makers to actually accurately detect misaligned behaviour, as without human understable behavioural examples, internal-based signals would be easier to dismiss if that was the intent of the decision maker.

[-]Knight Lee10mo20

When Gemini randomly told an innocent user to go kill himself, it made the news, but this news didn't really affect very much in the big picture.

It's possible that relevant decision-makers don't care that much about dramatic bad behaviours since the vibe is "oh yeah AI glitches up, oh well."

It's possible that relevant decision-makers do care more about what the top experts believe, and if the top experts are convinced that current models already want to kill you (but can't), it may have an effect. Imagine if many top experts agree that "the lie detectors start blaring like crazy when the AI is explaining how it won't kill all humans even if can get away with it."

I'm not directly disagreeing with this post, I'm just saying there exists this possible world model where behavioural evidence isn't much stronger (than other misalignment evidence).

[-]Katalina Hernandez10mo31

@Knight Lee This is precisely one of the incidents I've seen (in Europe) Policy people refer to when arguing "why GenAI providers need to be held accountable" for misbehaviours like this.
It is sad that this is example inspired regulatory actions in other jurisdictions, but not where the incident happen...

[-]Knight Lee10mo20

Oops. Maybe this kind of news does affect decision makers and I was wrong. I was just guessing that it had little effect, since... I'm not even sure why I thought so.

I did a Google search and it didn't look like the kind of news that governments responded to.

[-]Katalina Hernandez10mo20

No, you're right! It is just a policy/ AI Safety advocacy argument, but one that does change minds and shape decisions. I guess it's not as visible as it should be. Still, glad you brought this up!

[-]Dusto10moΩ120

Honestly I think this can go one step further and require that the behavioural output be something of more substance than text of what the model says it would do, or chain of thought pondering. I'm thinking more of the lines of the honeypot work. I suspect we are interested in convincing similar people about AI safety and will likely get the "cool, come back to me when the models start actually doing bad things" responses if we aren't able to provide concrete examples of model capability.

[-]Buck10moΩ473

Ryan agrees, the main thing he means by "behavioral output" is what you're saying: an actually really dangerous action.

[-]Dusto10mo10

Brilliant! Stoked to see your team move this direction.

[-]GRI10mo*10

Even if the internals-based method is extremely well supported theoretically and empirically (which seems quite unlikely), I don't think this would suffice for this to trigger a strong response by convincing relevant people

Its hard for me to imagine a world where we really have internals-based methods that are "extremely well supported theoretically and empirically," so I notice that I should take a second to try and imagine such a world before accepting the claim that internals-based evidence wouldn't convince the relevant people...

Today, the relevant people probably wouldn't do much in response to the interp team saying something like: "our deception SAE is firing when we ask the model bio risk questions, so we suspect sandbagging."

But I wonder how much of this response is a product of a background assumption that modern-day interp tools are finicky and you can't always trust them. So in a world where we really have internals-based methods that are "extremely well supported theoretically and empirically," I wonder if it'd be treated differently?

(I.e. a culture that could respond more like: "this interp tool is a good indicator of whether or not that the model is deceptive, and just because you can get the model to say something bad doesn't mean its actually bad" or something? Kinda like the reactions to the o1 apollo result)

Edit: Though maybe this culture change would take too long to be relevant.

Moderation Log