Outsiders should focus on specs/constitutions (among other things)

Cleo Nardo

I think that the external AI safety community should prioritise model specs/constitutions over the next 12 months. It shouldn't be our top priority,^[1] but it's pretty important^[2] and neglected. In this post, I will argue that it's tractable, even if you aren't a lab employee:

It’s a natural language document. So you don’t need to know any ML or engineering.
You don’t need to know about the internal codebase of the lab, or other proprietary details about how they train the models. All you need to know is the current spec/constitution — but that is public and will probably remain public.
Insiders might enjoy more R&D uplift than outsiders (e.g. they have access to models which haven't been publicly deployed, or have higher rate limits, or don't need to pay costs). So the outsiders should focus on work for which there is less uplift, e.g. macrostrategy / conceptual reasoning / threat modelling. And spec/constitution involves exactly these kinds of tasks.
It’s very easy to integrate suggestions from the outsiders into the spec/constitution. It's copy-pasting a short text string into a markdown file.
1. This is in contrast to integrating a new safety technique — which involves transferring from the open-source infrastructure to the closed-source one.
2. It’s pretty costly to train a new model on the new spec/constitution — but this obstacle applies equally strongly to amendments proposed by the insiders.
The spec/constitution describes how the model should behave in a wide range of different domains, avoiding a wide range of different threats. So if you have expertise in any domain or threat models then you can probably contribute.
There’s a precedent for the outsiders contributing to the spec/constitution. Here's the acknowledgements for the Claude constitution: External commenters who gave detailed feedback or discussion on the document include: Jim Baker, Owen Cotton-Barratt, Mariano-Florentino Cuéllar, Justin Curl, Tom Davidson, Lukas Finnveden, Brian Green, Ryan Greenblatt, janus, Joshua Joseph, Daniel Kokotajlo, Will MacAskill, Father Brendan McGuire, Antra Tessera, Bishop Paul Tighe, Jordi Weinstock, and Jonathan Zittrain.
You can influence the spec/constitution by: writing a draft passage, adding an explanation of why this amendment would help, then sending it to an insider working on spec/constitution.
Some work on spec/constitution seems inherently ill-suited for the labs, e.g. power concentration.

Recommendations:

It might be hard to persuade the labs that their overall judgement is incorrect (e.g. the tradeoff between alignment and corrigibility). Instead, you should focus on topics the lab hasn’t considered or formed a judgment about.
You should write draft passages. But to avoid constitutional poisoning, don't write "Claude" in the drafts. You could use a different French name, see here for an example.
Think about threat models that aren’t addressed in the current constitution. Then think how to inoculate against those.

^{^}
Some other priority tiers include:
P0: Demonstrate risks; Communicate with lab leadership & policymakers
P1: Track timelines; Stress-testing safety cases (e.g. model organisms)
P2: Capacity building; Communicate to public; Invent/refine safety techniques which can exported to labs; Basic science (e.g. deep learning; LLM generalisation, etc).
P3: Secure future funding; Secure future model access; Elicit capabilities on our target domains (e.g. macrostrategy, alignment research)
I think specs/constitutions should be a P2.
^{^}
See AI character is a big deal by Will MacAskill and Tom Davidson.

Constitutions/Specs don't really address any of the difficult alignment challenges. It's about as useful as engaging in the classical argument "aligned to whom?!?!", which you know, is a fine question to ask sometimes, but is orthogonal to a lot of what I want people to focus on (in contrast to, for example, understanding and communicating the total level of risk imposed by frontier AI development).

I've now revised it the text and title to express that this is one thing for us to work on among others.

I think that you're underrating the constitution/spec. It's pretty different from the question "aligned to whom?!?!".

It's more like: How should the next generation of model behave, such that we achieve the following goals? (i) Mitigating the risk of catastrophe from that particular model. (ii) Eliciting the capabilities necessary to use the model to [automate safety research / monitor other models / harden security / improve epistemics / etc].

I think that it's not just the Constitution, but a proposed training pipeline (alignment via systematic debate? Self-critique à-la KimiK2 so that the model never learned to flatter the user, as demonstrated by the Spiral Bench or Tim Hua's experiment? Rewarding Agent-4 for making its drafts legible to Agent-3 and checking it via ensuring that Agent-3 understands and Agent-2 doesn't?)

I've now revised it the text and title to express that this is one thing for us to work on among others.

I think that you're underrating the constitution/spec. It's pretty different from the question "aligned to whom?!?!".