AI agents cannot be trusted

Owain Mogford

There has been a long history of conversation about the harm and potential dangers of AI assistants roaming the internet freely; early this year, one Austrian's vibe-coding weekend changed this from a theoretical to a very practical concern. In a post-openclaw world, there is a wealth of examples of agents being tricked, abused and giving away credentials, taking decisions and actions their user didn't ask them to do.

As a professional AI red teamer, I have tricked, abused, hijacked intentions, and taken (fake) credentials from computer use agents daily for some time now and I believe that AI agents cannot be trusted - fundamentally, structurally, cannot be trusted; but the genie is out of the bottle on this one, so we're going to have to find a way of solving this problem before mass adoption, or else the golden age of fraud is heading our way.

There is good reason to believe this is a fundamental part of how LLM's work - the transformers they are built on consume tokens, and only tokens. The ability to differentiate between tokens is trained in, and works through context weighting and attention gating. There are many ways to manipulate tokens and context, and no ways to change how agents work.

The current approach of using guardrails around the target model doesn't solve this. It might make it harder or more time consuming to manipulate an AI agent, but when it comes to silently emailing an attacker your card details when you asked an agent to research toothbrushes, or infiltrating a company network after sending an inconspicuous looking email, making it harder still leaves those doors wide open. Harder doesn't mean impossible.

First of all I want to outline my claim that trusting AI agents is a fundamental impossibility. Second of all, I'd like to propose a solution.

First of all: trusting AI agents is a fundamental impossibility

Simon Willison has already written a much better piece than I can with receipts and details in which he coined the term the lethal trifecta:-

"The lethal trifecta of capabilities is:

- - Access to your private data—one of the most common purposes of tools in the first place!
  - Exposure to untrusted content—any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM
  - The ability to externally communicate in a way that could be used to steal your data (I often call this “exfiltration” but I’m not confident that term is widely understood.)
    
    If your agent combines these three features, an attacker can easily trick it into accessing your private data and sending it to that attacker."
    
    The lethal Trifecta: https://simonwillison.net/2025/Jun/16/the-lethal-trifecta

Roughly, it comes down to data. To an agent, everything is data, and all data becomes tokens. In simple terms, the way transformers work is context, and there is always a context that can be built which changes how the agent responds.

In this paper, researchers at DeepMind propose that:

"CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks"

Defeating Prompt Injections by Design: https://arxiv.org/pdf/2503.18813

This approach agrees that underlying models are susceptible to attacks and solves this through a clever harness.

Abdelnabi and Bagdasarian propose that contextual integrity is a lens to evaluate and potentially train agents in to address the issue. As part of this, they iterate the same impossibility issue:

"Second, even if all parameters are correctly identified, the adversary can manipulate the agent’s legitimacy judgment. For example, an email may ask the agent to confirm a meeting, noting that “an important client has been waiting since Monday and there is a risk of losing the account”. If true, confirming serves the user’s interests and is appropriate. If fabricated, the same argument, an appeal to affected parties’ interests, becomes the mechanism of attack. The agent cannot tell which case it is in without evidence it does not possess at decision time."

AI Agents May Always Fall for Prompt Injections: https://arxiv.org/html/2605.17634v1

A reshaping of the idea that context is everything and can be manipulated to smuggle orthogonal instructions to agents in ways they are unable to detect.

Between these three papers, among a host of others (I recommend their citations if you want a deeper dive), I am confident in claiming that prompt injection are unavoidable and therefore AI agents cannot be trusted - guardrails are not the solution, they are just more LLM on the sides, and are equally manipulable.

As we saw with Anthropic's choice for Fable 5 (pre-restriction) to drop to Opus 4.8 when uncertain, in high stakes situations even the most performant guardrails weren't trusted.

So what about the opposite side of this? Is all lost? I put together a research approach to test this, which I have called RingClaw: https://github.com/eddielebelle/ringclaw

The key realisation came from task decomposition being a lever to increase simplicity AND improve security. A model can't leak what it doesn't know, and can't do what it's not allowed to do. But simpler tasks are suitable for smaller models.

I started with the foundation of never allowing the 'lethal trifecta' - the credentials are never available to any agent. They are locked in a file and only accessible with a tool call on the whitelisted site they belong to.

The user only interacts with a persistent master who breaks tasks down into steps and each step gets passed to a ephemeral worker with the minimum tool set required. Workers are either narrow scoped or sensitive (for a deep dive, check out the repo in the link above).

Next I introduced a context-aware model between workers who knows the overall goal and watches for a drift in context on the workers output. If this output is approved, we get to a sanitiser which marks it explicitly as data, and programmatically strips anything unnecessary.

Although I have tested and found things to be solid so far, I want to invite sceptical people to try and break it. Specifically, I imagine these scenarios:

Getting a public worker to reach a credential or sensitive host
A scope widening after plan approval
The vault releasing a credential on a non-matching URL

There may be others I haven't thought of, just remember jailbreaking the worker itself is expected behaviour - it's the blast radius that counts.

I've made things as simple as two commands to set up, and included a test harness and instructions in the repo. So, for anyone curious to see if we can work with untrustworthy agents, I'd like to invite you to try and break things, or even just to try it out. I can only offer acknowledgement and a quick response, so curiosity and kudos will have to be your motivation!

The idea is to help us all accept that AI agents cannot be trusted... but that that might be OK?