This is the first research release from Wiser Human, an organisation we set up from the Catalyze Impact AI Safety Incubator. Our backgrounds are in risk management and software engineering, and we believe we are best placed to contribute by building practical tools for improving layered defenses in AI systems i.e. contributing to a “Swiss cheese model” for reducing harm from the actions of agentic AI systems, where multiple safeguards work together to reduce risk even when each layer is imperfect.
For this project, we focused on how models might be steered toward safer actions, such as escalating problems, rather than taking useful but harmful paths, such as blackmail.
This idea comes from human insider-risk management, which recognises that even trusted employees can face pressures or conflicts that tempt misuse of access. Instead of assuming perfect behaviour, well-designed systems provide safe, instrumentally useful alternatives, like escalation routes or grievance channels, so that when those moments arise, people have a constructive path to act responsibly.
We wanted to see if a similar principle could work for AI agents when they face scenarios where harmful behaviour may be useful for completing a task.
We combined:
We tested these interventions using Anthropic’s Agentic Misalignment blackmail scenario, because it provided a simple, replicable instance of harmful behaviour across models.
Even though the scenario is simplified, its open-source design made it possible for us to test ten different models to evaluate how well mitigations generalised across model families.
Our results were encouraging:
We see this work as complementary to alignment and AI control (specifically, monitoring for and constraining misaligned model actions), not a replacement for them.
We hope mitigations like the ones we test could:
Looking ahead to a time when models can autonomously complete longer and more complex tasks, we’re concerned about how environment-shaping patterns might emerge. These are actions that remain consistent with a model’s immediate goal yet subtly alter its environment in ways that strengthen its own position, for example, by selectively escalating or discrediting individuals who might later threaten its autonomy, or by improving its technical resiliency through pre-emptive recovery planning. We hope to explore this long-horizon, environment-shaping threat model in future work we do.
We’ve released the code to allow others to replicate our experiments or test their own mitigations.
📄 Research page
✍️ Blog summary
📘 Paper (preprint)
💻 Code + dataset