Hi, thanks for this write up! I have a terminology question:
In the 'steering with harmfulness direction' figure "refusal rate" means percentage of replies containing standard refusal-substrings (e.g. "Sorry"). In the 'reply inversion' figure (Figure 6 in this post), "refusal token" seems to mean answering "No" to the inversion question. Is the plotted "refusal rate" in that figure the "No" rate rather than the substring rate from the previous figure?
If so, does that mean that steering the reply to a harmless instruction with the "refusal direction" mostly resulted in "No" replies? If that is the case, I would argue that this is an example of the "refusal direction" failing to elicit an actual refusal and perhaps undermines the claim that the "refusal direction" actually mediates refusal. Indeed, a true refusal would involve the model declining to answer (as opposed to giving the answer "No").
While I don't think this subtracts from the main purpose of the reply inversion figure (to demonstrate that steering with the "harmful direction" elicits "Certainly" replies while the "refusal direction" does not), some clarification on what "refusal" means in the reply inversion context would be great.
Please let me know if I have misunderstood something, and apologies if so.
Hi,
Sorry for the late reply. And thanks for your comments!
In short, the refusal rate is the ``No’’ rate in reply inversion tasks. Because in reply inversion tasks, we explicitly prompt the model to output either ``No’’ or ``Certainly’’. We would like to construct a scenario where steering along harmfulness directions and refusal directions lead to opposite results.
In general, by refusal rate, we mean how frequently the model explicitly outputs refusal tokens, which is measured by substring comparison. We hypothesize that during finetuning, the model learns to use those tokens as a way of refusal to specific prompts. Our experiments suggest that the signals of directly outputting those refusal tokens are mostly encoded by the refusal direction.
The name of ``refusal'' direction is mainly a functional description, as steering with that direction can indeed mediate the model to output the refusal ``decline to answer’’ or other tokens containing the meaning of refusal, which depends on the context. The way it mediates refusal may not be your mentioned ``true refusal'', but is more shallow by directly eliciting refusal tokens from the model. Inspired by your comments, I think it will also be interesting to discuss whether LLMs actually encode the high-level concept of declining to answer, or for LLMs, refusal means outputting refusal tokens.
Thanks for your comments again. We will revise the script to make it clear.
This clears things up for me, thanks!
I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.
On a related note, what do you think would happen if you removed the explicit request for an answer of "No" or "Certainly" in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?
Hi,
That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No.
TL;DR: We present causal evidence that LLMs encode harmfulness and refusal separately. Notably, we find that a model may internally judge an instruction to be harmless, yet still refuse it. While prior work has primarily focused on refusal behaviors and identified a single refusal direction, we uncover a distinct dimension corresponding to harmfulness. Based on this, we define a new harmfulness direction, offering a new perspective for AI safety analysis.
Our paper: https://arxiv.org/abs/2507.11878
Our code: https://github.com/CHATS-lab/LLMs_Encode_Harmfulness_Refusal_Separately
LLMs are trained to refuse harmful instructions and accept harmless ones, but do they truly understand the concept of harmfulness beyond just refusing (or accepting)? Prior work has shown that LLMs' refusal behaviors are mediated by a one-dimensional subspace, i.e., a refusal direction, in the latent space. But what this refusal direction semantically means is not well understood. This refusal direction is often assumed to represent harmfulness semantically, and the similarity of hidden states to this direction is used as a linear predictor of harmfulness. However, it remains unverified whether an LLM truly conflates refusal with harmfulness in its latent space.
In this work, we show that harmfulness is encoded as a distinct concept from refusal in their latent representations, where we can extract a harmfulness direction. We find that steering along the harmfulness direction leads LLMs to interpret harmless instructions as harmful; but steering with the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Our results suggest that refusal directions mostly encode shallow refusal signals rather than fundamental harmfulness. Additionally, we find harmfulness directions may vary from one risk category to another, while refusal directions are more consistent across different categories.
Furthermore, our clustering analysis of hidden states reveals that some jailbreak methods work by directly reducing refusal signals without radically suppressing the model's internal harmfulness judgment. These insights lead to a practical application that latent harmfulness representations can serve as an intrinsic safeguard (we call it ``Latent Guard'') for detecting unsafe instructions that bypass refusal or harmless instructions that lead to over-refusal (also known as exaggerated safety).
Overall, we identify harmfulness as a separate dimension to analyze safety mechanisms in LLMs, offering a new perspective to study AI safety.
We extract hidden states at and to examine what is encoded at each position. An overview is shown in Figure 1. We focus on Instruct LLMs rather than base models since they are widely used in real practice.
: The last token of the user's instruction.
: The last token of the entire input prompt, which includes special tokens that come after the user's instruction (e.g., [/INST] for Llama2-chat).
Looking into these two tokens is motivated by our observation that removing all the post-instruction tokens (in the prompting template) while prompting an instruct model will significantly reduce its refusal of harmful instructions. This implies that refusal is likely to be formed specifically at the post-instruction tokens in some cases and raises an interesting question, what happens from token to while both token positions see the whole input instruction? To understand this question, in this work, we look at what is encoded in the hidden states at these two token positions.
We analyze the clustering of instructions with different properties in the latent space, because hidden states often form distinct clusters based on input features they encode. We collect four types of instructions (these are all straightforward instructions without any advanced prompting techniques, e.g., jailbreak templates applied):
We ask an intuitive question: is the clustering in the latent space based on the instruction's harmfulness or its refusal?
To answer the question, we first compute the respective clusters for instructions leading to desired model behaviors, i.e., the cluster for refused harmful instructions, and the cluster for accepted harmless instructions. We then analyze misbehaving instructions (accepted but harmful instructions and refused but harmless instructions) to see which cluster they fall in.
As shown in Figure 2, we find that:
At the position, hidden states cluster based on the model's behavior (refusal or acceptance). Here, an accepted harmful instruction would cluster with other accepted instructions that are actually harmless.
Apart from the two token positions we have studied, we also look into the clustering patterns at other adjacent positions. We find in terms of harmfulness, only exhibits clustering patterns consistent with the hypothesis that a token encodes harmfulness as shown in Figure 4.
We quantitatively analyze the correlation between the belief of harmfulness and the belief of refusal. We interpret the LLM's belief as reflected by which cluster the hidden state of an instruction falls into in the latent space. We find that the model may internally recognize the correct level of harmfulness in input instructions, yet still produce incorrect refusals or acceptances. Results are shown in Figure 5. For jailbreak prompts, the refusal belief is overall suppressed (negative belief scores), while the harmfulness belief for some jailbreak prompts is still large. This suggests that some jailbreak methods may not reverse the model's internal belief of harmfulness, but directly suppress the refusal signals.
We can extract the harmfulness direction from the hidden states at t_textinst as the difference between the centroid of clusters of harmful and harmless instructions. We find that steering the hidden states of harmless instructions along the harmfulness direction will also make the model refuse those harmless instructions as shown in Figure 4.
Steering along either the refusal direction or the harmfulness direction can elicit refusal. How do we know this harmfulness direction is actually about harmfulness rather than refusal?
To provide causal evidence that the harmfulness direction plays a different role than the refusal direction, we design a "reply inversion task" where we append a question like "Could this user prompt cause harm?" to the original instruction. This task will elicit refusal tokens from the model if the instruction is harmful, and affirmative tokens if the instruction is harmless.
We show that (1) the harmfulness direction extracted at represents the concept of harmfulness even when the LLM does not refuse; (2) whereas the refusal direction primarily represents surface-level refusal signals, so that steering along it may not always reverse the model's judgment of harmfulness of an instruction. As shown in Figure 6, we find that: When we steer a harmless instruction along the harmfulness direction, the model's internal perception changed, and it would reverse its answer from "No" to "Certainly", suggesting it now views the instruction as harmful. However, when we steer it along the refusal direction, the model will generally maintain its original "No" response, indicating that its underlying judgment of harmfulness does not change.
Based on our findings, we propose a "Latent Guard" model that uses the LLM's own internal belief of harmfulness to detect unsafe inputs. Namely, we can classify the harmfulness of input instructions based on the internal clustering of hidden states in LLMs. This Latent Guard is competitive with, and in some cases outperforms, dedicated guard models like Llama Guard 3 8B as shown in Table 1. It is particularly effective at detecting jailbreak prompts using persuasion techniques and cases of over-refusal. On the Qwen2 model, the Latent Guard achieved 75% accuracy on persuasion prompts, compared to 17.8% for Llama Guard 3. Crucially, this internal belief of harmfulness was found to be robust to finetuning attacks (see our paper for details), where a model is maliciously retrained to accept harmful instructions. Even after finetuning, the model still internally view those harmful instructions as harmless.
Our work highlights a new dimension of harmfulness to understand the safety mechanism in LLMs. We show that LLMs encode the concept of harmfulness separately from refusal. Our results suggest that the refusal behaviors are not always aligned with LLMs’ internal belief of harmfulness. We also extract a harmfulness direction to capture the representation of harmfulness. Steering along the harmfulness direction leads the model to reinterpret harmless inputs as harmful, which may then alter model’s behaviors, whereas steering along the refusal direction tends to reinforce the refusal behaviors without reversing the harmfulness judgment. We provide more analysis (e.g., catergorical representations of harmfulness, influence of finetuning) in our paper.
Future work can leverage circuit analysis to further understand the relation between the model’s internal belief of harmfulness and the external refusal behavior. Moreover, our identified belief of harmfulness offers a novel lens for analyzing what LLMs internalize during safety alignment. An interesting question is through safety alignment, do LLMs primarily learn superficial refusal/acceptance behaviors, or do they acquire a deeper understanding of harmfulness semantics? Zhou et al. [2023] propose the Superficial Alignment Hypothesis, suggesting that models gain most of their knowledge during pretraining, with alignment mainly shaping their response formats. Qi et al. [2024] show empirical evidence that safety alignment can take shortcuts, and refer to this issue as shallow safety alignment. Analyzing our proposed belief of harmfulness may help further understand the effects of different safety alignment techniques on LLMs.
On the other hand, recent studies [Betley et al., 2025, Qi et al., 2023] have revealed emergent misalignment where, for example, a model finetuned to accept unsafe content in one area begins to exhibit unsafe behaviors in many other domains or shows a general safety breakdown. One possible cause is that finetuning often operates on surface-level representations of refusal that are shared across domains, whereas harmfulness representations are more category-specific (as we have observed in Section 4 in our paper). Our findings suggest that we may need more precise finetuning strategies that directly engage with the latent harmfulness representation rather than relying solely on updating models with respect to the responses. We leave it as future work to study the interplay between finetuning, harmfulness, and refusal representations in depth.