Stopping dangerous AI: Ideal lab behavior

Zach Stein-Perlman

Disclaimer: this post doesn't have the answers. Moreover, it's an unfinished draft. Hopefully a future version will be valuable, but that will only occur if I revise/rewrite it. For now you're better off reading sources linked from Ideas for AI labs: Reading list and Slowing AI: Reading list.

Why "stopping dangerous AI" rather than "slowing AI" here? It's more appropriate for actors' actions. Just "slowing" is a precise/continuous variable and what strategists think about while "stopping" is a natural/simple thought and what actors naturally aim for. Note that in the context of slowing, "AI" always meant "dangerous AI."

Blackbox AI safety (and how to do good with powerful AI) to focus on the speed of AI progress. What should leading labs do to facilitate a safe transition to a world with powerful AI? What would I do and plan if I controlled a leading lab?^[1]

It would be better if progress toward dangerous AI capabilities was slower, all else equal. (This will become much more true near the end.) Or: it would be better if dangerous AI capabilities appear later, all else equal. (And slowing near the end is particularly important.)

Ways labs can slow AI:

Pause progress toward dangerous systems
- And convince others to pause
- And push for and facilitate a mandatory pause, e.g. enforced by governments (non-government standards-setters and industry self-regulation are also relevant)
- And pause later
Publish less research relevant to dangerous systems
- And convince others to publish less
- And push for and facilitate mandatory publication rules, e.g. enforced by governments
- And decrease diffusion of ideas more generally
  - Infosec, opsec, cybersec
  - Deploy slowly and limit API access as appropriate
Raise awareness of AI risk; look for and publicize warning signs; maybe make demos of scary AI capabilities. Influence other labs & government &c– make them better informed about AI risk and how they can help slow AI. Help government stop dangerous AI.
Prepare to coordinate near the end
- Commit to slow down near the end
- Commit not to compete near the end
- Gain the ability to make themselves selectively transparent
Maybe benefit-sharing mechanisms like better versions of the "windfall clause" and creating common knowledge of shared values to reduce incentives to race

(There are lots of other things labs should do for safety– this is just about slowing AI.)

(Note that what it would be best for an actor to do and what we should try to nudge that actor to do are different, because of not just tractability but also optimal change on the margin not necessarily being in the direction of optimal policy. For example, perhaps if you have lots of control over an unsafe lab you should focus on making it safer, while if you have little control you should focus on differentially slowing it.)

Leading labs are doing some important good things. DeepMind, Anthropic, and OpenAI seem to be avoiding publishing research-on-scary-paths to varying degrees. OpenAI's stop-and-assist clause is a first step toward coordination between labs to slow down near the end. OpenAI seems to be supporting evals. I don't know the details of labs' advocacy to the US government but I think it's mostly good.

Meta: figuring out what labs could do is pretty different from choosing between different well-specified things labs could do.

^[2]

^{^}
Assuming I knew I would remain in control of the lab. Note that what we should nudge labs to do on the margin in the real world is not necessarily what they should do to act optimally.
^{^}
List considerations. Or, model: pretend there's a single critical deployment;* what determines whether the critical deployment goes well? Quality-adjusted alignment research done + alignment tax paid on that particular model?
So:
- [how much slowing occurs near the end] [combination of safety-conscious lead time and coordination among leading actors]
- [doing alignment research and probably improving public/expert/policymaker opinion and facilitating good public policy is good– details mostly omitted here? And preparing to do good with powerful AI]
- [note that some AI is on the dangerous-path and some isn't]
Desiderata/goals:
- Minimize sharing information or tools that accelerate incautious actors
- Prepare to slow down near the end?
- [what public policy should they support? Export controls? Compute monitoring and evals regime?]
- Coordinate / prepare to coordinate / tell others what you believe
- Stop and assist?
- Transparency?

*On AI-agent-takeover threat models, this is a simplification but often a reasonable one; on WFLL threat models, it may be unreasonable.

LESSWRONG
LW

Stopping dangerous AI: Ideal lab behavior

8

8