Towards AI Safety Infrastructure: Talk & Outline

Paul Bricman

This is a linkpost for https://www.youtube.com/watch?v=xUCvGEZ30Uw

Thanks to Esben Kran and the whole Alignment Jam team for setting this up.

Context: We recently got the chance to share a bit more about the flavor of research we're particularly excited about at Straumli AI — that is, designing infrastructure that could help the relevant parties (e.g., developers, auditors, regulators, users) coordinate more effectively, and so make new governance initiatives possible. The talk is mostly an attempt to gesture at this line of work through a series of examples, each of which generally highlights how a certain cryptographic primitive could be used to address one particular coordination problem.

The talk expands on the following pieces of infrastructure:

Hashmarking (2:10) is the only item on the list which we've already written a paper on and which we're already using in practice. This protocol helps developers, auditors, and domain experts coordinate on creating and administering QA-style benchmarks without having to share the reference solutions outright. We think it may be particularly helpful in the context of dual-use capabilities.
Responsible Pioneer Protocol (5:56) targets the race dynamics that frontier labs may default to when trying not to fall behind their competitors. It builds on existing solutions to Yao's Millionaire's Problem to help frontier labs learn if there exists a competitor that is more than e.g. 10% ahead on a particular metric, without requiring labs to disclose their actual progress, in the hope that labs may find it easier to be cautious knowing that no one is too far ahead.
Neural Certificate Authorities (8:37) build on battle-tested practices from traditional internet infrastructure to help auditors share tamper-evident certificates with evaluation results in a way that other parties could seamlessly build on. This could include other NCAs automatically running inference on the findings of other parties: "based on statements and $B$ about model $M$ which have been signed by trusted parties $P_{1}$ and $P_{2}$ , respectively, I issue statement $C$ as party $P_{3}$ ."
Latent Differential Privacy (12:01) combines ideas from differential privacy with embeddings as numerical representations of meaning to help individuals contribute their personal data to the training of a language model. This is meant to incentivize the opening up of the training process in a way that may also make third-party monitoring easier and provide regulators with new affordances.
Cognitive Kill Switch (15:23) targets the ease of removing safety guardrails from open source models with ideas from meta-learning. If prior work explored ways of optimizing models so that fine-tuning is easier, we may be able to optimize a model such that any attempt to tamper with it would lead to (late) activations going to zero, making it difficult to fine-tune the guardrails away due to the technicalities of gradient descent. An NCA could then attest the presence of such a mechanism.

The second half of the recording was for Q&A. Some great questions people brought up:

What type of entities might be best-suited for contributing to such infrastructure? How can for-profits help with building infrastructure as a public good?
What type of resources or upskilling journeys might be useful for people interested in this space? Spoiler: the OpenMined courses are awesome.
How can this line of work complement verifiable ML and more broadly work towards obtaining various provable guarantees on models?

Action point: If you think tools like the ones in the talk could help address a pain point of your safety-conscious organization, let's chat!

11

Towards AI Safety Infrastructure: Talk & Outline

11

11

11