Can we steer AI models toward safer actions by making these instrumentally useful?

This is the first research release from Wiser Human, an organisation we set up from the Catalyze Impact AI Safety Incubator. Our backgrounds are in risk management and software engineering, and we believe we are best placed to contribute by building practical tools for improving layered defenses in AI systems i.e. contributing to a “Swiss cheese model” for reducing harm from the actions of agentic AI systems, where multiple safeguards work together to reduce risk even when each layer is imperfect.

Why we explored steering controls

For this project, we focused on how models might be steered toward safer actions, such as escalating problems, rather than taking useful but harmful paths, such as blackmail.

This idea comes from human insider-risk management, which recognises that even trusted employees can face pressures or conflicts that tempt misuse of access. Instead of assuming perfect behaviour, well-designed systems provide safe, instrumentally useful alternatives, like escalation routes or grievance channels, so that when those moments arise, people have a constructive path to act responsibly.

We wanted to see if a similar principle could work for AI agents when they face scenarios where harmful behaviour may be useful for completing a task.

What we did

We combined:

Theoretical models such as Situational Crime Prevention, which changes the perceived value of harmful actions by altering incentives and affordances.
Practical mitigations used in insider-risk programs, such as compliance bulletins and escalation channels, to see if these could shape AI behaviour.

We tested these interventions using Anthropic’s Agentic Misalignment blackmail scenario, because it provided a simple, replicable instance of harmful behaviour across models.
Even though the scenario is simplified, its open-source design made it possible for us to test ten different models to evaluate how well mitigations generalised across model families.

What we found

Our results were encouraging:

Blackmail rates fell from 38.7% to 0.85% for the best-performing mitigation (across all models).
Escalation channels and compliance cues successfully steered agents toward safe actions without retraining: key findings generalised across all 10 models we tested.
Because these mitigations generalised across models, they could form a low-cost, model-agnostic defence layer that reduces how often harmful actions need to be caught by monitoring.

Our motivation

We see this work as complementary to alignment and AI control (specifically, monitoring for and constraining misaligned model actions), not a replacement for them.
We hope mitigations like the ones we test could:

Help developers deploy economically useful models more safely.
Generate new data on how models escalate, self-regulate, or seek oversight.
Contribute to a layered defence-in-depth architecture combining training, evaluation, monitoring, and steering controls.

Looking ahead to a time when models can autonomously complete longer and more complex tasks, we’re concerned about how environment-shaping patterns might emerge. These are actions that remain consistent with a model’s immediate goal yet subtly alter its environment in ways that strengthen its own position, for example, by selectively escalating or discrediting individuals who might later threaten its autonomy, or by improving its technical resiliency through pre-emptive recovery planning. We hope to explore this long-horizon, environment-shaping threat model in future work we do.

We’ve released the code to allow others to replicate our experiments or test their own mitigations.

📄 Research page
✍️ Blog summary
📘 Paper (preprint)
💻 Code + dataset

LESSWRONG
LW