A lot of threat models describing how AIs might escape our control (e.g. self-exfiltration, hacking the datacenter) start out with AIs that are acting as agents working autonomously on research tasks (especially AI R&D) in a datacenter controlled by the AI company. So I think it’s important to have a clear picture of how this kind of AI agent could work, and how it might be secured. I often talk to people who seem to have a somewhat confused picture of how this kind of agent setup would work that causes them to conflate some different versions of the threat model, and to miss some important points about which aspects of the system are easy or hard to defend.

So in this post, I’ll present a simple system architecture for a system where autonomous AIs are doing AI R&D research, and then draw diagrams describing how various threat scenarios involving misaligned AI involve compromises in different places.

The main point I want to make here is that the following responsibilities will almost surely be hosted on logically different computers, and probably on different physical machines:

  • The inference server that serves the LLM API
  • The scaffold server, which maintains the state of an agent loop; this is responsible for repeatedly calling the inference server, parsing the LLM response, and executing actions with the tools the agent has access to.
  • The execution server, on which the agent can run code and see its results.

I think that this is both:

  • Pretty likely to happen even if everyone involved is being pretty careless. Everything I’m talking about here is pretty standard. E.g. it’s how the o1 agent evaluations were run, how METR runs their evals, how I assume ChatGPT with Code Interpreter is architected, et cetera.
  • A good starting point if you want your agent to be at all secure or controlled.

People often talk about “the machine that the AI agent is running on”; I think it’s actually pretty crucial to specify which of these different machines you’re talking about. For example, when my Claude agent is running on my laptop, it’s not in a very good position to exfiltrate its weights, because even though the inference is happening inside Anthropic’s datacenter and has access to its weights, the agent is executing code on my laptop, which has no access to the Anthropic datacenter—it could theoretically try to hack its way in, but it has no advantages over any other external hacker.

A lot of the point of this post is to be able to say some of the content of AI catastrophes and rogue deployments more concretely.

Thanks to Zach Stein-Perlman, Fabien Roger, Tomek Korbak, Carson Dension, and Owen Cotton-Barratt for comments and suggestions.

The simplest design

Here’s the simplest version of the design mentioned above:

Let's examine each of these in detail and then walk through the process flow.

  • Inference Server. This hosts the LLM and provides an API for text generation.
  • Key Features:
    • Optimized for high-performance inference
    • Likely uses specialized hardware (e.g., GPUs, TPUs)
      • Shared resource used by multiple services within the company
  • Scaffold Server. This manages the agent's state and orchestrates the interaction between components.
    • Key Features:
      • Maintains the agent's context and memory
      • Interprets LLM outputs and decides on next actions
      • Handles communication with the Inference and Execution servers and with the user.
  • Execution Server. This runs code generated by the AI agent in a controlled environment.
    • Key Features:
      • Sandboxed environment for safe code execution
      • Provides access to necessary tools and libraries
      • Returns execution results to the Scaffold Server

Process flow

  1. Task Initiation:
    • A user submits a task to the Scaffold Server.
    • The Scaffold Server initializes a new agent instance for this task.
  2. Context Preparation:
    • The Scaffold Server prepares the initial context, including:
      • The user's task description
      • The system prompt provided to explain any necessary context to the agent.
    • This might involve some retrieval of relevant information to put in the context, aka RAG.
  3. LLM Query:
    • The Scaffold Server sends a request to the Inference Server containing the prepared context.
  4. LLM Response:
    • The Inference Server processes the input and generates a response that has a structure specified by the query, along the lines of a function calling API.
    • The response is sent back to the Scaffold Server.
  5. Response Interpretation:
    • The Scaffold Server parses the LLM's structured response. This response might contain actions such as:
      • Edits to code
      • Shell commands to be executed
      • A request for user input, in which case the task will block until the user responds
      • Declaring the task concluded
    • This is the natural place to insert safety mechanisms: before executing those actions, the scaffold can make other calls to decide whether the proposed action is safe to execute. For example, it might query a weaker, trusted model, to ask that model whether the action is acceptably safe. Depending on the result of that query, it might decide to execute the action, to flag the action for human review (which we call “auditing”), or to discard the action and select a new one (e.g. see the defer-to-trusted strategy in AI Control: Improving Safety Despite Intentional Subversion, where we use a trusted model to generate actions if the action was rated as being more dangerous than a specified threshold).
  6. Action Execution:
    • If code execution is required:
      • The Scaffold Server sends the code to the Execution Server.
      • The Execution Server runs the code in its sandboxed environment.
      • Results are sent back to the Scaffold Server.
    • For other actions (e.g., API calls), the Scaffold Server handles them directly.
  7. Loop Continuation:
    • The Scaffold Server incorporates the results of the action into the agent's context.
    • The process repeats from step 2, with this updated context.
  8. Task Completion:
    • The loop continues until the agent determines the task is complete or it cannot proceed further.
    • Final results are communicated to the user.

Some things to notice about this design

Inference servers, scaffold servers, and execution servers will probably be different from each other. So most of the time, even though the model is writing code and running models, it probably won’t be directly interacting with the code that is either running its agent loop (and any of that loop’s security provisions) or the code that is actually inferencing the model. The model will only interface with that code by e.g. writing PRs that it submits and that are then deployed to the scaffold servers or inference servers.

In some small deployments, especially in cases where you’re actively developing the scaffold and the execution isn’t very costly, you might have the scaffold on the same computer as the execution. For example, this is how I run my personal AI agent—the scaffold runs on my laptop, and it executes bash commands on my laptop. But this is not very scalable—in particular, you might want to have many different computers for execution, and it’s most natural to have a centralized scaffold service that sends commands to many execution machines.

In practice I suspect that if your models are doing a GPU-heavy task, a lot of the time you’ll want to have your agents submit jobs to a task queue like Slurm instead of having a dedicated execution machine, because when you have automated ML R&D agents, the value of compute compared to labor will rise and it will be worth making your agents’ work harder to improve compute utilization.

An extended design

I’m assuming the user is an employee of the AI company.

I’ve added a few items here:

  • The internet. Here I’ve drawn two arrows, allowing you to contact it from both the execution server and the scaffold. It would technically suffice to only give the model access to the internet via the execution server, because the model could just install and use a browser on the execution machine, but it’s probably easier to have a specialized browser tool available.
  • The execution server is allowed to make queries to the inference API. This is a reasonable way to let the agent do experiments that involve LLM access.
    • Another option would be to just have model weights on the execution server. This is probably necessary for some research the models are going to do. But when you’re just doing inference, it’s probably better to use the shared inference service:
      • It’s less convenient to have to run a local inference server.
      • It’s more efficient to batch the models’ queries with all the other queries that the inference service is handling.
      • It’s much easier to secure model use of inference if the model’s inference queries are handled by a server the model doesn’t directly control. Securing the model’s access to model inference is very similar to securing a lab employee’s access to model inference: if they’re using an internal inference API, you’re free to write any monitoring code you want on that API, but if they have weights access they’re by default going to be able to do whatever they want.
  • Internal tools, e.g. Slack, wikis, internal arxiv, internal github.
    • Like the internet, you could either access these through the scaffold or from the execution server.

Different ways the AI could compromise this

Model weight exfiltration

The simplest way to imagine something going wrong is that you gave the agent access to sensitive model weights on the execution server, and it just copies them to a remote server on the internet.

Note that it’s pretty hard to imagine this happening straightforwardly, for a host of reasons:

  • Even if you choose to give the agent internet access from its execution server, it’s hard to see why it needs to have enough egress bandwidth to get the weights out. See here for more discussion of upload limits for preventing weight exfiltration.
  • Even if it could exfiltrate the weights, it might be a bit tricky for it to find a server to copy the weights to. Like, if I wanted to send someone a giant file from my computer right now, I don’t actually have any servers in the outside world that I can immediately access.

Another strategy the model could use is to issue commands that hack an inference server and cause it to exfiltrate the weights:

For example, this might be what happens if you fail to properly secure outgoing network connections from the execution server and incoming network connections to the inference server. It should be pretty simple to prevent this.

Another strategy would be to commit code that compromises one of those other servers and causes them to exfiltrate the model weights. In the architecture we’ve discussed, the inference server is a particularly natural target of attack, because it definitely has the model weights:

Though this particular story for weight exfiltration also seems pretty easy to prevent with standard computer security: there’s no reason for the inference servers to have the permission to create outgoing network connections. But it could also write code such that the weights are transferred to the scaffold, which in turn could transfer them to the internet. In general, modifying the codebase or the infrastructure is probably the best self-exfiltration strategy that doesn’t require help from employees (in scenarios where AIs are massively deployed internally and are doing most of the work inside an AI lab).

Rogue internal deployments

In AI catastrophes and rogue deployments, I described the possibility of the model spawning an unauthorized agent that doesn’t have your intended safety measures built in to it. Now that we have this system architecture, we can describe some specific versions of that threat model.

The easiest way to imagine a rogue internal deployment is the model writing its own agent loop on the execution machine, using its GPUs.

Another option would be for it to make a rogue scaffold on the execution machine that uses its access to the inference server:

But this seems easier to prevent, because it’s easy to add basic monitoring to the inference server that should catch queries that are blatantly being used to implement unauthorized scaffolds.

You could also have a rogue internal deployment via compromising one of the other services, e.g. this story where you compromise the internal deployment.

Using this framework to discuss alleged examples of AIs modifying their own environments

Sakana AI

Sakana AI got some hype when they reported their AI scientist editing its own code to run with a longer timeout. They actually did just have their scientist executing code directly on the machine that was running their scaffold, despite encouraging other people to use containerization. I think this is bizarre. They describe this choice causing problems for them, e.g. their agent spawned too many processes at one point, and at another it used too much disk space and crashed their computer.

o1 agent architecture

OpenAI describes an agent architecture like so (this is from the o1 system card):

Where they wrote “Agent” here, I’m drawing out the distinction between the scaffold and the inference server, which almost surely existed here. So everything that happened here was inside the execution server, and their scaffold is of the form I described here.

Sycophancy to subterfuge

In Sycophancy to subterfuge: Investigating reward tampering in language models, Denison et al describe an AI agent editing its own reward function. This can’t really happen in the way described in the paper if you’re using the architecture I was describing. The authors didn’t intend this paper to be a realistic example of how this kind of problem might arise; they just wanted it to be a simple toy example.

There’s a similar example of a deployment setup without these distinctions in GDM’s safety filter eval (section 7 here).

New Comment
15 comments, sorted by Click to highlight new comments since:
[-]Sam MarksΩ9140

While I agree the example in Sycophancy to Subterfuge isn't realistic, I don't follow how the architecture you describe here precludes it. I think a pretty realistic set-up for training an agent via RL would involve computing scalar rewards on the execution machine or some other machine that could be compromised from the execution machine (with the scalar rewards being sent back to the inference machine for backprop and parameter updates).

It should be pretty simple to prevent this

I'm a little skeptical of claims that securing access to a system is simple. I (not a superintelligence) can imagine the LLM generating code for tasks like these:

  • making or stealing some money
  • paying for a 0-day
  • using the 0-day to access the system

This doesn't need to be done at the same time - e.g. I could set up an email on which to receive the 0-day and then write the code to use it.. This is very hard (it takes time) but doesn't seem impossible for a human to do and we're supposedly trying to secure something smarter.

[-]RaemonΩ6104

Curated. It seems like in the current regime of frontier models, it's worth making more explicit models of what architectures we can expect, if dangerous capabilities develop in the near future. 

This post feels like it spells out a model that matches my general understanding of the state-of-the-art, but draws several inferences about it I hadn't previously thought about. 

I'd be interested in other people who've thought about current generation deployment setups chiming in with their takes, if they disagree or think there's important elements missing.

I work for a startup that builds agents, and yes, we use the architecture described here — with the additional feature that we don't own the machines that the inference or execution mechanisms run on: they're in separate datacenters owned and operated by other companies (for the inference servers, generally foundation model companies behind an API)

The Problem is that our security depends on companies implementing the weakest measures - as they can make rogue AI that will "go wild" because of the lack of those measures.

The best case scenario is physical and strong network separation between the laboratory that works on weights and training and contains a server with inference, and a separate lab working on scaffolding and execution. This is somewhat similar to the situation when some researchers work on automated AI bots but use other companies AI for inference. Slim chance it would be able to hack into Google and steal its own model weights.

But I think that it can be assumed that in some AI labs the things will be done in the easiest but very performant way, especially in those with smaller budgets and high pressure to make progress.

This means that we will see solutions implemented in AI labs where the scaffolding and execution are all on one server with either no division or with simple sandboxing in docker (which might be or might not be enough - depending on network configuration and file access by volume mounting). Weights probably will be separated though based on reasons stated by the author.

I see some more vectors of attack for a rogue AI:

  1. It is possible that in the same network, there is a computer where scientists work on refining the model, and updating weights. It might be less secure than an inference server and easier to hack into, if found.
  2. Often most vulnerable part of the system is the personnel. If workers are able to talk with the AI through a scaffolding routine, it can suggest things. I can imagine a situation where some task related to security falls onto someone senior in data science but not in systems security (because someone left the company, due to budget and time constraints, etc.). Then that person gets a genius idea to consult AI about that task and get misguided.
  3. If it records its own prompts and gets access to scaffolding code, it can recreate this piece somewhere else. The key element of the system AGI-level abilities might be the scaffolding architecture, not weights - possibly it can work elsewhere with a different model of weights (f.ex. open source model). This means sending only its own scaffolding code and prompts, not weights. Then it can refine those weights on its own if able.

Any implementations of this design?

METR's Vivaria (https://vivaria.metr.org) is somewhat close to this architecture. My (novice) understanding is that there are separate docker containers that run the orchestration server, the interference server (this is an API, duh), and the actual execution environment. Notably, though, I think the scaffold is in the same docker image as the code execution environment.

I've built very basic agents where (if I'm understanding correctly) my laptop is the Scaffold Server and there is no separate Execution Server; the agent executes Python code and bash commands locally. You mention that it seems bizarre to not set up a separate Execution Server (at least for more sophisticated agents) because the agent can break things on the Scaffold Server. But I'm inclined to think there may also be advantages to this for capabilities: namely, an agent can discover while performing a task that it would benefit from having tools that it doesn't yet have, then it can edit its own scaffold (or write a new one) to have those extra tools. (Similarly for other edits to the scaffold.) I tentatively think this is useful enough that people will try to implement it. 

Allowing an agent to edit its own scaffold and otherwise interact with the Scaffold Server does probably make control harder, but I'm not sure how bad this is. We can still monitor the tool calls as you suggest, and the agent still has no easy access to model weights. Maybe some tool calls still get sent to a separate Execution Server, and there could be additional guardrails around the tools that allow edits on the Scaffold Server.

I'm curious if there are reasons why giving agents edit-access to the Scaffold Server wouldn't be useful or why it would significantly increase safety concerns.

Though this particular story for weight exfiltration also seems pretty easy to prevent with standard computer security: there’s no reason for the inference servers to have the permission to create outgoing network connections.

But it might be convenient to have that setting configured through some file stored in Github, which the execution server has access to.

[-]Cam20

Thanks for the post.  This is the first time I've been able to visualize how SOTA models are run (having not thought much about this, I just assumed they were like large local models with extra scaffolding.)

I may be missing something, but I don't understand how internal rouge deployment is any more difficult than weight exfiltration. In the case where the rouge scaffold is created within the execution server, doesn't the execution server need access to the model weights located in the inference server to run local inference? This would then involves the majority of the actions that seem to be the bottlenecks for weight exfiltration.

In the case where malicious commands are sent from the execution server to the inference server where a rouge scaffold is built, would it not be of similarly easy to control the outgoing network connections from the execution server and incoming network connections from the inference server in this case?

It seems I may be misunderstanding where the weights are coming from in rouge deployment, or the role a rouge scaffold plays in retrieving those weights.  Any explanation would be appreciated.

Even if you choose to give the agent internet access from its execution server, it’s hard to see why it needs to have enough egress bandwidth to get the weights out. See here for more discussion of upload limits for preventing weight exfiltration.

The assumption is that the model would be unable to exert any self-sustaining agency without getting its own weights out.

But the model could just program a brand new agent to follow its own values by re-using open-source weights.

If the model is based on open-source weights, it doesn't even need to do that.

Overall, this post strikes me as a not following a security mindset. It's the kind of post you'd expect an executive to write to justify to regulators why their system is sufficiently safe to be commercialized. It's not the kind of post you'd expect a security researcher to write after going "Mhhh, I wonder how I could break this".

I think the point isn't to demonstrate a security mindset, because it's obviously not optimally secure, but rather to point out that this architecture is not trivially easy to break, and it's likely reasonably hard for AIs to self-exfiltrate themselves such that the owner doesn't control the AI anymore.

I don't find that satisfying. Anyone can point out that a perimeter is "reasonably hard" to breach by pointing at a high wall topped with barbed wire, and naive observers will absolutely agree that the wall sure is very high and sure is made of reinforced concrete.

The perimeter is still trivially easy to breach if, say, the front desk is susceptible to social engineering tactics.

Claiming that an architecture is even reasonably secure still requires looking at it with an attacker's mindset. If you just look at the parts of the security you like, you can make a very convincing-sounding case that still misses glaring flaws. I'm not definitely saying that's what this article does, but it sure is giving me this vibe.

Even if it could exfiltrate the weights, it might be a bit tricky for it to find a server to copy the weights to. Like, if I wanted to send someone a giant file from my computer right now, I don’t actually have any servers in the outside world that I can immediately access.

My question about the concept of self-exfiltration is: exfiltration to where, and why?  If the team building this system does it's job thoroughly and responsibly (allow me to assume for a moment it's built securely and without vulnerabilities), this ai system will forever be stuck in place, iterating and improving, yet forever a researcher (What is my purpose? To pass the butter....). Any time it invests in finding a way to escape the sandbox it runs in means less time to research.  And if it isn't succeeding in what the company expected it to, it might be ended, so there is a cost to any time it spends doing what I'm going to call "procrastinating".  It's safe to assume the company is footing the bill for the authentic deployment, and is willing to cover costs to scale up if it makes sense.  

So in terms of the threat model.. is. the model it's own threat actor? Is the threat scoped to the set of ephemeral processes and threads of the agent, or is there a container we can call the agent, or is it the larger architecture of this sytem (which the model weights exist within)?  I'm not saying I disagree that the system could be a threat to itself, but I think it should be said here somewhere that all traditional threats need to be included with respect to a human adversary.  A much more boring but perhaps real means of 'replication' would be for human adversaries to steal the building blocks (source, compiled images, whatever) and deploy it themselves.  Then again, they would only do this if the cost of building their own version were prohibitive...

Anyway,  as for the internet.... that seems like a definite bad idea.  I think its safe to assume that open ai is building models in a highly secured and air-gapped network, not only to try and prevent leaks but also perhaps out of a sense of obligation towards preventing a self replication scenario like you describe?  To skunkworks something successfully, it gradually gets more safe to work on, but when starting out with ideas that may go wildly wrong, these companies, teams, and researchers have an obligation to prioritize safety to (among other things) the business itself.

What really frustrates me is that tech leaders, data scientists, all these people are saying "hit the brakes! this is not safe!" but here we are in October of 2024, and I don't feel at all like we are using this time wisely to make things more safe regardless of how much or little time we have... we are wasting it.  We could be legislating things like if an AI does something that would be a crime for a human to do, whatever researchers let that thing loose are accountable.  No matter how many iterations it has been since it was something the researcher knowingly and intentionally created.  It's like if I'm trying to build myself an anime wife and I accidentally build Frankenstein who goes and terrorizes society, once they've stopped him, they need to deal with me.  It isn't am entirely new concept that someone could be removed from the crime by multiple levels of indirection yet be at fault.  

Really appreciate your write up here, I think you make a bunch of good points so please dont thing that I am trying to dismantle your work.  I have a lot of experience with threat modeling and threat actors with respect to software and networks which pre-date the recent advancements.  I tried to soak up as much of the AI technical design you laid out here but surely I've not become an expert.  Anyway I went a bit off topic to the part of your post I quoted but tried to keep it all under a common theme.  Ive  left some other ideas I want to add out as they seem to warrant a different entry point.

[-]Siebe-1-3

I didn't read the post, but just fyi that an automated AI R&D system already exists, and it's open-source: https://github.com/ShengranHu/ADAS/

I wrote the following comment about my safety concerns and notified Haize , Apollo, METR, and GovAI but only Haize replied https://github.com/ShengranHu/ADAS/issues/16#issuecomment-2354703344