Ideas for AI labs: Reading list

Zach Stein-Perlman

This document is about ideas for AI labs. It's mostly from an x-risk perspective. Its underlying organization black-boxes technical AI stuff, including technical AI safety.

Lists & discussion

Towards best practices in AGI safety and governance: A survey of expert opinion (GovAI, Schuett et al. 2023) (LW)
- This excellent paper is the best collection of ideas for labs. See pp. 18–22 for 100 ideas.
Frontier AI Regulation: Managing Emerging Risks to Public Safety (Anderljung et al. 2023)
- Mostly about government regulation, but recommendations on safety standards translate to recommendations on actions for labs
Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023)
What AI companies can do today to help with the most important century (Karnofsky 2023) (LW)
Karnofsky nearcasting: How might we align transformative AI if it’s developed very soon?, Nearcast-based "deployment problem" analysis, and Racing through a minefield: the AI deployment problem (LW) (Karnofsky 2022)
Survey on intermediate goals in AI governance (Räuker and Aird 2023)
Corporate Governance of Artificial Intelligence in the Public Interest (Cihon, Schuett, and Baum 2021) and The case for long-term corporate governance of AI (Baum and Schuett 2021)
Three lines of defense against risks from AI (Schuett 2022)
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation (Brundage et al. 2018)
Adapting cybersecurity frameworks to manage frontier AI risks: defense in depth (IAPS, Ee et al. 2023)

Levers

AI developer levers and AI industry & academia levers in Advanced AI governance (LPP, Maas 2023)
- This report is excellent
"Affordances" in "Framing AI strategy" (Stein-Perlman 2023)
- This list may be more desiderata-y than lever-y

Desiderata

Maybe I should make a separate post on desiderata for labs (for existential safety).

Six Dimensions of Operational Adequacy in AGI Projects (Yudkowsky 2022)
"Carefully Bootstrapped Alignment" is organizationally hard (Arnold 2023)
Slowing AI: Foundations (Stein-Perlman 2023)
[Lots of stuff implicated elsewhere, like "help others act well" and "minimize diffusion of your capabilities research"]

Ideas

Coordination^[1]

See generally The Role of Cooperation in Responsible AI Development (Askell et al. 2019).

Coordinate to not train or deploy dangerous AI
- Model evaluations
  - Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023) (LW)
  - ARC Evals
    - Safety evaluations and standards for AI (Barnes 2023)
    - Update on ARC's recent eval efforts (ARC 2023) (LW)
- Safety standards

Transparency

Transparency enables coordination (and some regulation).

Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims (Brundage et al. 2020)
- Followed up by Filling gaps in trustworthy development of AI (Avin et al. 2021)
Structured transparency
- Exploring the Relevance of Data Privacy-Enhancing Technologies for AI Governance Use Cases (Bluemke et al. 2023)
- Beyond Privacy Trade-offs with Structured Transparency (Trask and Bluemke et al. 2020)
Honest organizations (Christiano 2018)
Auditing & certification
- Theories of Change for AI Auditing (Apollo 2023) and other Apollo stuff
- What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring (Shavit 2023)
- Auditing large language models: a three-layered approach (Mökander et al. 2023)
  - The first two authors have other relevant-sounding work on arXiv
- AGI labs need an internal audit function (Schuett 2023)
- AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries (Cihon et al. 2021)
- Private literature review (2021)
Model evaluations
- Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023) (LW)
- Safety evaluations and standards for AI (Barnes 2023)
- Update on ARC's recent eval efforts (ARC 2023) (LW)

Publication practices

Labs should minimize/delay the diffusion of their capabilities research.

Publication decisions for large language models, and their impacts (Cottier 2022)
Shift AI publication norms toward "don't always publish everything right away" in Survey on intermediate goals in AI governance (Räuker & Aird 2023)
"Publication norms for AI research" (Aird unpublished)
Publication policies and model-sharing decisions (Wasil et al. 2023)

Structured access to AI models

Sharing Powerful AI Models (Shevlane 2022)
Structured access for third-party research on frontier AI models (GovAI, Bucknall and Trager 2023)
Compute Funds and Pre-trained Models (Anderljung et al. 2022)

Governance structure

How to Design an AI Ethics Board (Schuett et al. 2023)
Ideal governance (for companies, countries and more) (Karnofsky 2022) (LW) has relevant discussion but not really recommendations

Miscellanea

Do more/better safety research; share safety research and safety-relevant knowledge
- Do safety research as a common good
  - Do and share alignment and interpretability research
  - Help people who are trying to be safe be safe
  - Make AI risk and safety more concrete and legible
    - See Larsen et al.'s Instead of technical research, more people should focus on buying time and Ways to buy time (2022)
- Pay the alignment tax (if you develop a critical model)
Improve your security (operational security, information security, and cybersecurity)
- There's a private reading list on infosec/cybersec, but it doesn't have much about what labs (or others) should actually do.
Plan and prepare: ideally figure out what's good, publicly commit to doing what's good (e.g., perhaps monitoring for deceptive alignment or supporting external model evals), do it, and demonstrate that you're doing it
- For predicting and avoiding misuse
- For alignment
- For deployment (especially of critical models)
- For coordinating with other labs
  - Sharing
  - Stopping
  - Merging
  - More
- For engaging government
- For increasing time 'near the end' and using it well
- For ending risk from misaligned AI
- For how to get from powerful AI to a great long-term future
- Much more...
Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems (Gray 2023)
- See also this comment
- See OpenAI's bug bounty program
Report incidents
The Windfall Clause: Distributing the Benefits of AI for the Common Good (O'Keefe et al. 2020)
- Also sounds relevant: Safe Transformative AI via a Windfall Clause (Bova et al. 2021)
Watermarking^[2]
Make, share, and improve a safety plan
- OpenAI (LW) (but their more recent writing on "AI safety" is more prosaic)
- DeepMind (unofficial and incomplete)
  - See also Shah on DeepMind alignment work
- Anthropic (LW)
Make share, and improve a plan for the long-term future
- OpenAI (LW, Soares); OpenAI (LW)
Improve other labs' actions
- Inform, advise, advocate, facilitate, support, coordinate
- Differentially accelerate safer labs
Improve non-lab actors' actions
- Government
  - Support good policy
  - See AI policy ideas: Reading list (Stein-Perlman 2023)
- Standards-setters
  - How technical safety standards could promote TAI safety (O'Keefe et al. 2022)
  - Standards for AI Governance: International Standards to Enable Global Coordination in AI Research & Development (Cihon 2019)
- Kinda the public
- Kinda the ML community
Support miscellaneous other strategic desiderata
- E.g. prevent new leading labs from appearing

LESSWRONG
LW

Ideas for AI labs: Reading list

11

Lists & discussion

Levers

Desiderata

Ideas

Coordination^[1]

Transparency

Publication practices

Structured access to AI models

Governance structure

Miscellanea

See also

11

Ideas for AI labs: Reading list

11

Lists & discussion

Levers

Desiderata

Ideas

Coordination[1]

Transparency

Publication practices

Structured access to AI models

Governance structure

Miscellanea

See also

11

Coordination^[1]