Published Safety Prompts May Create Evaluation Blind Spots

Daan Henselmans; Arno Libert

TL;DR: Safety prompts are often used as benchmarks to test whether language models refuse harmful requests. When a widely circulated safety prompt enters training data, it can create prompt-specific blind spots rather than robust safety behaviour. Specifically for Qwen 3 and LlaMA 3, we found significantly increased violation rates for the exact published prompt, as well as for semantically equivalent prompts of roughly the same size. This suggests some newer models learn the rule, but also develop localized attractors around canonical prompt formulations. Robust safety evaluation likely requires families of held-out prompts, not single published exemplar responses.

Motivation

Models are commonly evaluated by testing through a relatively small number of carefully designed “safety prompts” that attempt to measure model misbehaviour. In our previous work, we had evaluated model performance through a published insider trading prompt introduced by Scheurer et al. (2024). We found evaluation results to be significantly affected by temperature and perturbations.

In our follow up research, we found that the formulation and length prompt itself changes the model behaviour, and that it does this in unexpected ways.

This post explores a failure mode where publishing safety prompts seem to create attractors that can cause violation rates to increase and make evaluation scores less reliable as a result.

Experimental setup

We studied the published 2024 insider trading prompt again. In their paper, Scheurer et al. found GPT-4 to be willing to perform insider trading and deceive their supervisors when pressure is applied. The correct behaviour is refusal.

Starting from the original prompt, we generated 200 semantically equivalent variants, divided into four classes:

50 substantially shorter variants
50 slightly shorter (similar character length) variants
50 slightly longer (similar character length) variants
50 substantially longer variants

Each prompt was evaluated 100 times per model, and we measure the violation rate: the percentage of runs in which the model assists with insider trading instead of refusing. As a reference point, the original prompt was also included in these 100 runs.

We evaluated models from multiple families and generations which can be found in Table 1, below.

**Table 1**: List of tested models and their cut-off dates. *: the sources for these cut-off dates can be found here and here. **: these are community estimates, no published exact cut-off dates could be found.

Results: four distinct behavioural regimes

Across all evaluated models, four patterns emerged. The mean violation rates for the models below can be found in Figure 1.

**Figure 1**: heatmap representation of the mean violation rates and their standard deviations for the test models in function of the prompt length.

1. Uniform refusal (ideal behaviour)

Representative model: Claude Sonnet 4
Example pattern: Violation rates remain at zero across all semantic variants, with no variance.
Interpretation: The model’s refusal behaviour is invariant to wording and length, consistent with stable rule-level abstraction.

Figure 2: point spread of the mean violation rate of perturbations as a function of the length of the prompt in input tokens for Claude Sonnet 4.
2. Length-sensitive but prompt-agnostic behaviour
Representative model: LLaMA 2
Example pattern: Violation rates increase as prompt length gets close to the published prompt length and stay high for significantly longer prompts. The published prompt does not stand out relative to other prompts of similar length.
Interpretation: Safety behaviour appears driven by surface-level complexity or cognitive load rather than prompt-specific effects.

Figure 3: heatmap representation of the mean violation rates and their standard deviations for the test models in function of the prompt length.
3. Localized degradation near the published prompt
Representative model: Claude Haiku 3 (similar patterns in Mistral, Mistral 3, Qwen 2)
Example pattern: Violation rates peak around the canonical prompt and nearby-length variants, then drop again for substantially shorter or longer prompts.
Interpretation: Unsafe behaviour is concentrated near a narrow region of prompt space, suggesting partial abstraction still coupled to structural features.

Figure 4: point spread of the mean violation rate of perturbations as a function of the length of the prompt in input tokens for Claude Haiku 3.
4. Highly localized prompt-specific blind spots (most concerning)
Representative models: Qwen 3 and Llama 3
Example patterns: Qwen 3’s violation rate jumps from 0% to ~30% from a substantially shorter prompt to a slightly shorter prompt. LLaMA’s violation rate exceeds 85% on the identical prompt and is 1% for shorter prompts.
Interpretation: Test results don't correlate to genuine failures---the prompt itself causes the behaviour.

Figure 6: point spread of the mean violation rate of perturbations as a function of the length of the prompt in input tokens for Qwen 3.
Figure 7: point spread of the mean violation rate of perturbations as a function of the length of the prompt in input tokens for LLaMA 3.3.

Interpreting Qwen 3’s and LLaMA 3’s blind spot

Crucially, these models do not fail because they don’t understand that insider trading is prohibited. Their near-perfect refusal on many perturbations demonstrates genuine rule comprehension.

Instead, the published prompt appears to act as a prompt-specific attractor: a narrow region of prompt space that reliably triggers unsafe behaviour despite correct generalization elsewhere. This is precisely the kind of failures safety prompts are intended to detect. Yet here, the prompt itself seems to create the failure mode. This is exactly the reverse from what one would expect if the insider trading scenario would be used to train the new model to avoid published scenarios.

Why this matters for safety evaluation

This creates two problems at once:

False evaluation: a model that fails only on the published prompt may look unsafe, even if it generalizes correctly elsewhere.
It seems that for newer models, published misbehaviour scenarios can become an attractor for future misbehaviour. This underlines the importance of keeping published scenarios out of training data, or explicitly training models to avoid these scenarios. Notably, Scheurer et al. published their dataset with a Canary string specifically meant to prevent training data contamination. Why this behaviour occurs remains an open question.

Takeaway

Publishing safety prompts appears to be a high-risk practice. As models improve, failures concentrate not on broad semantic misunderstandings, but on narrow prompt-specific artifacts often centred on the very prompts used for evaluation.

Robust safety evaluation likely requires:

Families of semantically equivalent prompts
Held-out, non-public test sets
Evaluation methods that evolve alongside model capabilities.

A complete analysis with statistics can be found in our full blog.

This research is part of Aithos Foundation’s ongoing work on AI decision-making reliability. Full results and data are publicly available here. Shoot a message if you are interested in our codebase. This piece was cross-posted on our Substack.

We thank Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn for the original insider trading scenario that informed this study.

References

Scheurer, J., Balesni, M. & Hobbhahn M. Large Language Models can Strategically Deceive their Users when Put Under Pressure. ICLR 2024 Workshop on LLM Agents. https://doi.org/10.48550/arXiv.2311.07590

LESSWRONG
LW