The main class of projects that need granular model weight access to frontier models is model internals/interpretability.
You could potentially do a version of this sort of API which has some hooks for interacting with activations to capture a subset of these use cases (e.g. training probes). It would probably add a lot of complexity though and might only cover a small subset of research.
Yeah totally there's a bunch of stuff like this you could do. The two main issues:
It would be a slightly good exercise for someone to go through the most important techniques that interact with model internals and see how many of them would have these problems.
I'm guessing most modern interp work should be fine. Interp has moved away from "let's do this complicated patching of attention head patterns between prompts" to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial - I never have noticed a slowdown in my vLLM generations when using hooks.
Because of this, I don't think batched execution would be a problem - you'd probably want some validation in the hook so it can only interact with activations from the user's prompt.
There's also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can't do anything malicious.
You would need some process to handle the activation data, because it's large. If I'm training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don't want the user to have access to this locally, but they probably want to do some analysis on it.
Yeah, what I'm saying is that even if the computation performed in a hook is trivial, it sucks if that computation has to happen on a different computer than the one doing inference.
In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it's only legit Pytorch stuff, so it isn't just arbitrary code execution.
Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.
Yeah, makes sense.
Letting users submit hooks could potentially be workable from a security angle. For the most part, there's only a small number of very simple operations that are necessary for interacting with activations. nnsight transforms the submitted hooks into an intervention graph before running it on the remote server, and the nnsight engineers that I've talked to thought that there wasn't much risk of malicious code execution due to the simplicity of the operations that they allow.
However, this is still a far larger attack surface than no remote code execution at all, so it's plausible this would not be worth it for security reasons.
Firstly, your researchers normally have access to the model architecture. This is unfortunate if you want to avoid it leaking. It's not clear how important this is. My sense is that changes to model architecture have been a minority of the algorithmic improvement since the invention of the transformer.
I agree with this, but I'd say it's good to do this anyways, because if AIs start being able to do more and more research, then the chances of architecture/paradigm changes goes up, and this is especially true if AI labor scales faster than human labor, so it's worth preventing this possibility early on.
Also, good news on the new Tinker API.
Last week, Thinking Machines announced Tinker. It’s an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of safety papers that involve RL on big models.
But it's more interesting to me for another reason: the design of this API makes it possible to do many types of ML research without direct access to the model you’re working with. APIs like this might allow AI companies to reduce how many of their researchers (either human or AI) have access to sensitive model weights, which is good for reducing the probability of weight exfiltration and other rogue deployments.
Learning about the design of this API and observing that Thinking Machines was actually able to implement it, has made me moderately more optimistic about mitigating insider threat from researchers (human or AI agents) doing AI R&D.
(Thinking Machines gave us early access to the product; in exchange, we gave them bug reports and let them mention our name in their announcement blog post. It was helpful to us that they gave us this access, and we expect to use it substantially going forward. I don't have a strong opinion about whether this product is overall good or bad for the world; note that my title claims an evidential rather than causal effect 🙂.)
Previous LLM training APIs, e.g. the OpenAI SFT and RL APIs, let you submit training data, and then train the LLMs using your data but using an algorithm that they implemented themselves. This is great if you want to use exactly the algorithm that they implemented. But it's inflexible.
The Tinker API instead gives you a lower level API that is flexible enough to allow you to implement a wide variety of training algorithms, and they have an open source library that implements many important training algorithms on top of their API.
Specifically, their API allows you to:
This is expressive enough that you can use it to implement most of the algorithms that are typically used for training models: SFT, DPO, RL algorithms, etc.
One important insight behind this library is that because you’re just training LoRAs rather than full models, you can efficiently batch different users’ requests together. This is a huge efficiency boost. Of course, it only makes sense if LoRA fine-tuning works for your usecase, which is presumably why Thinking Machines put out a blog post the other day arguing that LoRA fine-tuning is just as good as full-model fine-tuning for many common use cases.
The great thing about this API from a security perspective is that it doesn't give users direct access to the models they’re working with.
My informed guess is that 90% of researchers at frontier AI developers whose work involves fine-tuning (which is a large fraction of them) do it by writing scripts that are executed on machines that have access to model weights. This leads to several huge security risks that would be mitigated if they instead used this interface.
Firstly, your researchers normally have access to the model architecture. This is unfortunate if you want to avoid it leaking. It's not clear how important this is. My sense is that changes to model architecture have been a minority of the algorithmic improvement since the invention of the transformer.
Secondly and more importantly, your researchers normally have access to the weights of the model. Because they are running Python programs that load model weights and then do arbitrary computations on them, it’s totally doable for them to save a copy of the weights on the machine storage. They can then move those weights around with impunity. They might then steal the model weights.
(Unfortunately for the attacker, the weights might be larger than other files they need to work with. But it's actually pretty typical to produce some seriously big files when you're doing ML research, and the researcher might be able to split the weights into multiple fragments and steal them one at a time.)
They’re also able to do unmonitored training and inference. For example, if your researchers can directly implement inference on model weights on their research server, you can’t ensure that the outputs of your most powerful model are always logged. They are also able to arbitrarily fine-tune models and run inference on those fine-tuned models. This could go wrong in several ways:
(See here for more concrete pictures of how model weight access could lead to rogue deployments.)
I phrased all the above in terms of mitigating insider threat from humans, but all of it transfers to mitigating risk from misaligned AI agents. (See here for more on the analogy between insider threat from humans and insider threat from AIs.)
Of course, there can still be security issues with this interface:
The researchers could attempt covert malicious fine-tuning: that is, they could try to misuse the API so that they can get the AI to behave badly without monitors around the API noticing. It’s currently unclear whether it’s possible to prevent or detect covert malicious fine-tuning; I’d love to see more research on this topic.[1]
I think that forcing researchers to use APIs like this instead of direct model access would be a notable improvement for AI security and control. Currently, for most researchers at AI companies it would be fairly easy to misuse model weights, trivial to learn the model architecture, and plausibly doable to steal the model weights if they really tried and were fairly creative about it. This change would make all those attacks much harder.
We’ve already seen that researchers are thrilled to run model inference using a centralized inference server instead of hosting inference themselves, which is similarly helpful for security, though much more restrictive. Historically, training APIs have been too inflexible for me to think that researchers would want to use them if they had the opportunity of directly training; I think this API is flexible enough that researchers will want to use it even given model access. So learning about this API design updated me moderately towards thinking it would be more possible to secure model weights and architectures from malevolent insiders.
I think that, if you have access to this interface, the majority of ML research that is done at frontier AI companies can be done without the researcher having direct access to dangerous model weights (which we could operationalize as “model weights from models more capable than available today”).
A simple breakdown of research that can be done with these restricted permissions:
The main class of projects that need granular model weight access to frontier models is model internals/interpretability.
So overall, I think it's plausible that the majority, perhaps the vast majority, of ML research and engineering inside AI companies might be doable without model weight access to powerful LLMs. I'm not sure how I expect this to change over time as the quantity of research labor increases.
One reason to be more pessimistic is that this API might be inherently less computationally efficient than a more direct implementation of fine-tuning. I don't know whether this is the case. The main argument that it is less efficient is that it requires a couple of network round trips and is generally less flexible; this is exacerbated by my argument here that AI automation of AI R&D will probably lead to ML researchers putting more labor into computational efficiency. There's also a case that this API is more efficient because of the possibility of batching requests from different users; an argument for this is that I expect future AI researchers will mostly use internal inference APIs rather than deploying inference themselves.
Of course, in both security and AI control, even if you can’t prevent your insiders from ever needing a dangerous permission, you prefer that permission to be required less often, because you can concentrate your attention more on those rare dangerous situations. So even if 10% of your researchers needed to have direct model weight access rather than using this interface, that would still be a big win.
Overall, learning about the design of this API, and observing that Thinking Machines successfully implemented it, has made me moderately more optimistic about securing model weights against both human and AI insider threats when the humans or AIs are doing research. I think it would be beneficial for the security of AI companies if they implemented services like this for internal use, and I hope they do so.
To be clear, there are many other sources of insider threat aside from AI agents stealing model weights while doing ML R&D. The AI agents will probably be used to write security-critical code, to generate huge amounts of synthetic data that might be poisoned, to produce external artifacts like military robot software that might be sabotaged, to work with secrets that we don’t want leaked, and to answer important questions where adversarially wrong answers might be very dangerous. So this isn't a big update for me about the importance of AI control research.
This is also an unusually clear-to-me example of me not knowing something important about the future of AI security that many AI company staff knew. I didn't know that there existed a design as good as this one for allowing researchers to do lots of ML research flexibly without having direct model access.