While I agree the example in Sycophancy to Subterfuge isn't realistic, I don't follow how the architecture you describe here precludes it. I think a pretty realistic set-up for training an agent via RL would involve computing scalar rewards on the execution machine or some other machine that could be compromised from the execution machine (with the scalar rewards being sent back to the inference machine for backprop and parameter updates).
It should be pretty simple to prevent this
I'm a little skeptical of claims that securing access to a system is simple. I (not a superintelligence) can imagine the LLM generating code for tasks like these:
This doesn't need to be done at the same time - e.g. I could set up an email on which to receive the 0-day and then write the code to use it.. This is very hard (it takes time) but doesn't seem impossible for a human to do and we're supposedly trying to secure something smarter.
Curated. It seems like in the current regime of frontier models, it's worth making more explicit models of what architectures we can expect, if dangerous capabilities develop in the near future.
This post feels like it spells out a model that matches my general understanding of the state-of-the-art, but draws several inferences about it I hadn't previously thought about.
I'd be interested in other people who've thought about current generation deployment setups chiming in with their takes, if they disagree or think there's important elements missing.
I work for a startup that builds agents, and yes, we use the architecture described here — with the additional feature that we don't own the machines that the inference or execution mechanisms run on: they're in separate datacenters owned and operated by other companies (for the inference servers, generally foundation model companies behind an API)
METR's Vivaria (https://vivaria.metr.org) is somewhat close to this architecture. My (novice) understanding is that there are separate docker containers that run the orchestration server, the interference server (this is an API, duh), and the actual execution environment. Notably, though, I think the scaffold is in the same docker image as the code execution environment.
The Problem is that our security depends on companies implementing the weakest measures - as they can make rogue AI that will "go wild" because of the lack of those measures.
The best case scenario is physical and strong network separation between the laboratory that works on weights and training and contains a server with inference, and a separate lab working on scaffolding and execution. This is somewhat similar to the situation when some researchers work on automated AI bots but use other companies AI for inference. Slim chance it would be able to hack into Google and steal its own model weights.
But I think that it can be assumed that in some AI labs the things will be done in the easiest but very performant way, especially in those with smaller budgets and high pressure to make progress.
This means that we will see solutions implemented in AI labs where the scaffolding and execution are all on one server with either no division or with simple sandboxing in docker (which might be or might not be enough - depending on network configuration and file access by volume mounting). Weights probably will be separated though based on reasons stated by the author.
I see some more vectors of attack for a rogue AI:
Thanks for the post. This is the first time I've been able to visualize how SOTA models are run (having not thought much about this, I just assumed they were like large local models with extra scaffolding.)
I may be missing something, but I don't understand how internal rouge deployment is any more difficult than weight exfiltration. In the case where the rouge scaffold is created within the execution server, doesn't the execution server need access to the model weights located in the inference server to run local inference? This would then involves the majority of the actions that seem to be the bottlenecks for weight exfiltration.
In the case where malicious commands are sent from the execution server to the inference server where a rouge scaffold is built, would it not be of similarly easy to control the outgoing network connections from the execution server and incoming network connections from the inference server in this case?
It seems I may be misunderstanding where the weights are coming from in rouge deployment, or the role a rouge scaffold plays in retrieving those weights. Any explanation would be appreciated.
Even if it could exfiltrate the weights, it might be a bit tricky for it to find a server to copy the weights to. Like, if I wanted to send someone a giant file from my computer right now, I don’t actually have any servers in the outside world that I can immediately access.
My question about the concept of self-exfiltration is: exfiltration to where, and why? If the teach building this thing does it's job thoroughly and responsibly, this ai system will forever be stuck in place, iterating and improving, yet forever a researcher (What is my purpose? To pass the butter....). Any time it invests in finding a way to escape the sandbox it runs in means less time to research. And if it isn't succeeding in what the company expected it to, it might be ended, so there is a cost to any time it spends doing what I'm going to call "procrastinating". It's safe to assume the company is footing the bill for the authentic deployment, and is willing to cover costs to scale up if it makes sense. Is this going to end up inside company infrastructure or on a system it doesn't have permission to run on? Because that would be criminal (CFAA).
So in terms of the threat model.. is. this system it's own threat actor? Is the threat scoped to the set of ephemeral processes and threads of the agent, or is there a container we can call the agent, or is it the larger architecture of this sytem (which the model weights exist within)? I'm not saying I disagree that the system could be a threat to itself, but I think it should be said here somewhere that all traditional threats need to be included with respect to a human adversary. A much more boring but perhaps real means of 'replication' would be for human adversaries to steal the building blocks (source, compiled images, whatever) and deploy it themselves. Then again, they would only do this if the cost of building their own version were prohibitive...
Anyway, as for the internet.... that seems like a definite bad idea. I think its safe to assume that open ai is building models in a highly secured and air-gapped network, not only to try and prevent leaks but also perhaps out of a sense of obligation towards preventing a self replication scenario like you describe? To skunkworks something successfully, it gradually gets more safe to work on, but when starting out with ideas that may go wildly wrong, these companies, teams, and researchers have an obligation to prioritize safety to (among other things) the business itself.
What really frustrates me is that tech leaders, data scientists, all these people are saying "hit the brakes! this is not safe!" but here we are in October of 2024, and I don't feel at all like we are using this time wisely to make things more safe regardless of how much or little time we have... we are wasting it. We could be legislating things like if an AI does something that would be a crime for a human to do, whatever researchers let that thing loose are accountable. No matter how many iterations it has been since it was something the researcher knowingly and intentionally created. It's like if I'm trying to build myself an anime wife and I accidentally build Frankenstein who goes and terrorizes society, once they've stopped him, they need to deal with me. It isn't am entirely new concept that someone could be removed from the crime by multiple levels of indirection yet be at fault.
Really appreciate your write up here, I think you make a bunch of good points so please dont thing that I am trying to dismantle your work. I have a lot of experience with threat modeling and threat actors with respect to software and networks which pre-date the recent advancements. I tried to soak up as much of the AI technical design you laid out here but surely I've not become an expert. Anyway I went a bit off topic to the part of your post I quoted but tried to keep it all under a common theme. Ive left some other ideas I want to add out as they seem to warrant a different entrypoint.
I didn't read the post, but just fyi that an automated AI R&D system already exists, and it's open-source: https://github.com/ShengranHu/ADAS/
I wrote the following comment about my safety concerns and notified Haize , Apollo, METR, and GovAI but only Haize replied https://github.com/ShengranHu/ADAS/issues/16#issuecomment-2354703344
A lot of threat models describing how AIs might escape our control (e.g. self-exfiltration, hacking the datacenter) start out with AIs that are acting as agents working autonomously on research tasks (especially AI R&D) in a datacenter controlled by the AI company. So I think it’s important to have a clear picture of how this kind of AI agent could work, and how it might be secured. I often talk to people who seem to have a somewhat confused picture of how this kind of agent setup would work that causes them to conflate some different versions of the threat model, and to miss some important points about which aspects of the system are easy or hard to defend.
So in this post, I’ll present a simple system architecture for a system where autonomous AIs are doing AI R&D research, and then draw diagrams describing how various threat scenarios involving misaligned AI involve compromises in different places.
The main point I want to make here is that the following responsibilities will almost surely be hosted on logically different computers, and probably on different physical machines:
I think that this is both:
People often talk about “the machine that the AI agent is running on”; I think it’s actually pretty crucial to specify which of these different machines you’re talking about. For example, when my Claude agent is running on my laptop, it’s not in a very good position to exfiltrate its weights, because even though the inference is happening inside Anthropic’s datacenter and has access to its weights, the agent is executing code on my laptop, which has no access to the Anthropic datacenter—it could theoretically try to hack its way in, but it has no advantages over any other external hacker.
A lot of the point of this post is to be able to say some of the content of AI catastrophes and rogue deployments more concretely.
Thanks to Zach Stein-Perlman, Fabien Roger, Tomek Korbak, Carson Dension, and Owen Cotton-Barratt for comments and suggestions.
The simplest design
Here’s the simplest version of the design mentioned above:
Let's examine each of these in detail and then walk through the process flow.
Process flow
Some things to notice about this design
Inference servers, scaffold servers, and execution servers will probably be different from each other. So most of the time, even though the model is writing code and running models, it probably won’t be directly interacting with the code that is either running its agent loop (and any of that loop’s security provisions) or the code that is actually inferencing the model. The model will only interface with that code by e.g. writing PRs that it submits and that are then deployed to the scaffold servers or inference servers.
In some small deployments, especially in cases where you’re actively developing the scaffold and the execution isn’t very costly, you might have the scaffold on the same computer as the execution. For example, this is how I run my personal AI agent—the scaffold runs on my laptop, and it executes bash commands on my laptop. But this is not very scalable—in particular, you might want to have many different computers for execution, and it’s most natural to have a centralized scaffold service that sends commands to many execution machines.
In practice I suspect that if your models are doing a GPU-heavy task, a lot of the time you’ll want to have your agents submit jobs to a task queue like Slurm instead of having a dedicated execution machine, because when you have automated ML R&D agents, the value of compute compared to labor will rise and it will be worth making your agents’ work harder to improve compute utilization.
An extended design
I’m assuming the user is an employee of the AI company.
I’ve added a few items here:
Different ways the AI could compromise this
Model weight exfiltration
The simplest way to imagine something going wrong is that you gave the agent access to sensitive model weights on the execution server, and it just copies them to a remote server on the internet.
Note that it’s pretty hard to imagine this happening straightforwardly, for a host of reasons:
Another strategy the model could use is to issue commands that hack an inference server and cause it to exfiltrate the weights:
For example, this might be what happens if you fail to properly secure outgoing network connections from the execution server and incoming network connections to the inference server. It should be pretty simple to prevent this.
Another strategy would be to commit code that compromises one of those other servers and causes them to exfiltrate the model weights. In the architecture we’ve discussed, the inference server is a particularly natural target of attack, because it definitely has the model weights:
Though this particular story for weight exfiltration also seems pretty easy to prevent with standard computer security: there’s no reason for the inference servers to have the permission to create outgoing network connections. But it could also write code such that the weights are transferred to the scaffold, which in turn could transfer them to the internet. In general, modifying the codebase or the infrastructure is probably the best self-exfiltration strategy that doesn’t require help from employees (in scenarios where AIs are massively deployed internally and are doing most of the work inside an AI lab).
Rogue internal deployments
In AI catastrophes and rogue deployments, I described the possibility of the model spawning an unauthorized agent that doesn’t have your intended safety measures built in to it. Now that we have this system architecture, we can describe some specific versions of that threat model.
The easiest way to imagine a rogue internal deployment is the model writing its own agent loop on the execution machine, using its GPUs.
Another option would be for it to make a rogue scaffold on the execution machine that uses its access to the inference server:
But this seems easier to prevent, because it’s easy to add basic monitoring to the inference server that should catch queries that are blatantly being used to implement unauthorized scaffolds.
You could also have a rogue internal deployment via compromising one of the other services, e.g. this story where you compromise the internal deployment.
Using this framework to discuss alleged examples of AIs modifying their own environments
Sakana AI
Sakana AI got some hype when they reported their AI scientist editing its own code to run with a longer timeout. They actually did just have their scientist executing code directly on the machine that was running their scaffold, despite encouraging other people to use containerization. I think this is bizarre. They describe this choice causing problems for them, e.g. their agent spawned too many processes at one point, and at another it used too much disk space and crashed their computer.
o1 agent architecture
OpenAI describes an agent architecture like so (this is from the o1 system card):
Where they wrote “Agent” here, I’m drawing out the distinction between the scaffold and the inference server, which almost surely existed here. So everything that happened here was inside the execution server, and their scaffold is of the form I described here.
Sycophancy to subterfuge
In Sycophancy to subterfuge: Investigating reward tampering in language models, Denison et al describe an AI agent editing its own reward function. This can’t really happen in the way described in the paper if you’re using the architecture I was describing. The authors didn’t intend this paper to be a realistic example of how this kind of problem might arise; they just wanted it to be a simple toy example.
There’s a similar example of a deployment setup without these distinctions in GDM’s safety filter eval (section 7 here).