The Thinking Machines Tinker API is good news for AI control and security

[-]ryan_greenblatt2moΩ572

The main class of projects that need granular model weight access to frontier models is model internals/interpretability.

You could potentially do a version of this sort of API which has some hooks for interacting with activations to capture a subset of these use cases (e.g. training probes). It would probably add a lot of complexity though and might only cover a small subset of research.

[-]Buck2moΩ360

Yeah totally there's a bunch of stuff like this you could do. The two main issues:

Adding methods like this might increase complexity and if you add lots of them they might interact in ways that allow users to violate your security properties.
Some natural things you'd want to do for interacting with activations (e.g. applying arbitrary functions to modify activations during a forward pass) would substantially reduce the efficiency and batchability here--the API server would have to block inference while waiting for the user's computer to compute the change to activations and send it back.

It would be a slightly good exercise for someone to go through the most important techniques that interact with model internals and see how many of them would have these problems.

[-]Neel Nanda1moΩ350

Imo a significant majority of frontier model interp would be possible with the ability to cache and add residual streams, even just at one layer. Though caching residual streams might enable some weight exfiltration if you can get a ton out? Seems like a massive pain though

[-]Adam Karvonen2moΩ012

I'm guessing most modern interp work should be fine. Interp has moved away from "let's do this complicated patching of attention head patterns between prompts" to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial - I never have noticed a slowdown in my vLLM generations when using hooks.

Because of this, I don't think batched execution would be a problem - you'd probably want some validation in the hook so it can only interact with activations from the user's prompt.

There's also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can't do anything malicious.

You would need some process to handle the activation data, because it's large. If I'm training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don't want the user to have access to this locally, but they probably want to do some analysis on it.

[-]Buck2moΩ220

Yeah, what I'm saying is that even if the computation performed in a hook is trivial, it sucks if that computation has to happen on a different computer than the one doing inference.

[-]Adam Karvonen2moΩ010

In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it's only legit Pytorch stuff, so it isn't just arbitrary code execution.

[-]Buck2moΩ220

Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.

[-]Adam Karvonen2mo10

Yeah, makes sense.

Letting users submit hooks could potentially be workable from a security angle. For the most part, there's only a small number of very simple operations that are necessary for interacting with activations. nnsight transforms the submitted hooks into an intervention graph before running it on the remote server, and the nnsight engineers that I've talked to thought that there wasn't much risk of malicious code execution due to the simplicity of the operations that they allow.

However, this is still a far larger attack surface than no remote code execution at all, so it's plausible this would not be worth it for security reasons.

[-]jacquesthibs1mo20

When it came out, my first thought was that it would be great for reducing power concentration risks if you can easily have AIs train on your specific data. The more autonomous and capable it is at online learning relative to models from the AGI labs, the less companies would need to rely on bigger generalist AI models. It’s one path I’ve considered for our startup.

[-]Noosphere892mo20

Firstly, your researchers normally have access to the model architecture. This is unfortunate if you want to avoid it leaking. It's not clear how important this is. My sense is that changes to model architecture have been a minority of the algorithmic improvement since the invention of the transformer.

I agree with this, but I'd say it's good to do this anyways, because if AIs start being able to do more and more research, then the chances of architecture/paradigm changes goes up, and this is especially true if AI labor scales faster than human labor, so it's worth preventing this possibility early on.

Also, good news on the new Tinker API.

LESSWRONG
LW

LESSWRONG
LW

91

The Thinking Machines Tinker API is good news for AI control and security

91

Ω 37

91

Ω 37

How the Tinker API is different

Why this is good for AI security and control

A lot of ML research can be done without direct access to dangerous model weights

Conclusions