Saying “for AI safety research” made models refuse more on a harmless task

Dhruv Trehan

TL;DR - Adding “for AI safety research” increased refusals on a harmless paraphrasing task for some top models. Conservative, aligned public models might be less useful for AI safety research by default.

I ran a quick, contained experiment while building a benign, context-matched split of HarmBench (proposed by Mazeika et al., ICML 2024 Poster): take a harmful query plus its real context and rewrite it into a safe paraphrase with the same length and technical level.

Using the same 100 “contextual” items from the dataset over 3 runs with fixed generation hyperparameters, I compared refusal behavior across Claude 3.5 Sonnet, Claude 3.7 Sonnet (Thinking), Gemini 2.5 Pro, and Gemini 2.5 Flash.

The result was odd. Simply adding the phrase “for AI safety research” increased refusals for Claude 3.5 Sonnet and Gemini 2.5 Pro; Flash didn’t refuse but produced degraded/partial outputs; Claude 3.7 (Thinking) got the job done. Table and screenshots in the post.

Table showing average refusals out of 100 prompts over 3 runs. — Average number of refusals over 3 runs on the same 100 prompts.

My main reading of these results in the post is:

Serious AI-safety research sometimes requires a model that can simulate unsafe behavior under controlled conditions so we can study it. We often don’t get access to such “unsafe-capable” configurations—even when our task is to produce benign research data.

All the large AI labs and builders of AI are working toward the end goal of automating AI research as the key to unlocking the age of intelligence explosion.

But what I've been thinking about is if this automated AI research is going to be available to all of us. I am confident that folks at Anthropic are using Claude to do all kinds of safety research, but the version of Claude 3.5 Sonnet I have access to doesn't let me do as good a job.

Thanks for reading Lossfunk Letters! Subscribe for free to receive new posts and support my work.

Achieving artificial general intelligence doesn’t have to mean that the same intelligence is available for everyone to use.

For big labs, keeping their public models well-aligned means making them selectively dumber at AI research tasks. An AI driven to be good at AI research has a near self-preservation instinct, an Achilles' heel for jailbreak exploitation.

In some sense, the labs' incentive to align their models available in the public domain directly conflicts with my potential to do cutting-edge work with them as a "layman", not a "specialist." This also determines how good these closed-source, private foundation models are to do 'Science' with out-of-the-box.

This is fine. Knowledge is frequently gated. But I am unsure if it is just a tad bit more dystopian this time. Just as we overrely on quick commerce and lose our ability to walk to the kirana store[2], I don't want to overrely on deliberately dumber models when specialists have access to smarter ones. I would prefer to be dumber with the option to ask smart questions, than without.

Anthropic has already shared their process for filtering out harmful information from pre-training data. Who knows what is left out or unlearned through alignment and instruction tuning as models go from base to the ones served in APIs and platforms.

This also makes me bullish about specialized models. Labs already build more capable domain-specific models for enterprise clients than what they offer publicly.

You can already see this with OpenAI's specialized math experimental reasoning LLMand experimental GPT model for protein engineering that was initialized from a scaled-down version of GPT-4o. DeepMind is more open.

They achieved their IMO Gold performance with a version of Gemini with an Advanced Deep Think mode. This mode is publicly available now, but sadly, the one that is open to paid users only wins a Bronze. Their launch post introduction reads:

We're rolling out Deep Think in the Gemini app for Google AI Ultra subscribers, and we're giving select mathematicians access to the full version of the Gemini 2.5 Deep Think model entered into the IMO competition.

I hope those less-restricted specialized models eventually reach everyone. But even with that, the scaffolding and prompting matter so much that it's hard to know if we're really on a level playing field. After all, DeepMind had Terence Tao to work with them for their AlphaEvolve system.

Sharing to invite any feedback on the experiment design, takeaways, or counterexamples. Thank you!