LESSWRONG
LW

AI
Frontpage

11

Ideas for AI labs: Reading list

by Zach Stein-Perlman
24th Apr 2023
5 min read
0

11

AI
Frontpage

11

New Comment
Crossposted to the EA Forum. Click to view 2 comments.
Moderation Log
More from Zach Stein-Perlman
View more
Curated and popular this week
0Comments

Related: AI policy ideas: Reading list.

This document is about ideas for AI labs. It's mostly from an x-risk perspective. Its underlying organization black-boxes technical AI stuff, including technical AI safety.

Lists & discussion

  • Towards best practices in AGI safety and governance: A survey of expert opinion (GovAI, Schuett et al. 2023) (LW)
    • This excellent paper is the best collection of ideas for labs. See pp. 18–22 for 100 ideas.
  • Frontier AI Regulation: Managing Emerging Risks to Public Safety (Anderljung et al. 2023)
    • Mostly about government regulation, but recommendations on safety standards translate to recommendations on actions for labs
  • Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023)
  • What AI companies can do today to help with the most important century (Karnofsky 2023) (LW)
  • Karnofsky nearcasting: How might we align transformative AI if it’s developed very soon?, Nearcast-based "deployment problem" analysis, and Racing through a minefield: the AI deployment problem (LW) (Karnofsky 2022)
  • Survey on intermediate goals in AI governance (Räuker and Aird 2023)
  • Corporate Governance of Artificial Intelligence in the Public Interest (Cihon, Schuett, and Baum 2021) and The case for long-term corporate governance of AI (Baum and Schuett 2021)
  • Three lines of defense against risks from AI (Schuett 2022) 
  • The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation (Brundage et al. 2018)
  • Adapting cybersecurity frameworks to manage frontier AI risks: defense in depth (IAPS, Ee et al. 2023)

Levers

  • AI developer levers and AI industry & academia levers in Advanced AI governance (LPP, Maas 2023)
    • This report is excellent
  • "Affordances" in "Framing AI strategy" (Stein-Perlman 2023)
    • This list may be more desiderata-y than lever-y

Desiderata

Maybe I should make a separate post on desiderata for labs (for existential safety).

  • Six Dimensions of Operational Adequacy in AGI Projects (Yudkowsky 2022)
  • "Carefully Bootstrapped Alignment" is organizationally hard (Arnold 2023)
  • Slowing AI: Foundations (Stein-Perlman 2023)
  • [Lots of stuff implicated elsewhere, like "help others act well" and "minimize diffusion of your capabilities research"]

Ideas

Coordination[1]

See generally The Role of Cooperation in Responsible AI Development (Askell et al. 2019).

  • Coordinate to not train or deploy dangerous AI
    • Model evaluations
      • Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023) (LW)
      • ARC Evals
        • Safety evaluations and standards for AI (Barnes 2023)
        • Update on ARC's recent eval efforts (ARC 2023) (LW)
    • Safety standards

Transparency

Transparency enables coordination (and some regulation).

  • Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims (Brundage et al. 2020)
    • Followed up by Filling gaps in trustworthy development of AI (Avin et al. 2021)
  • Structured transparency
    • Exploring the Relevance of Data Privacy-Enhancing Technologies for AI Governance Use Cases (Bluemke et al. 2023)
    • Beyond Privacy Trade-offs with Structured Transparency (Trask and Bluemke et al. 2020)
  • Honest organizations (Christiano 2018)
  • Auditing & certification
    • Theories of Change for AI Auditing (Apollo 2023) and other Apollo stuff
    • What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring (Shavit 2023)
    • Auditing large language models: a three-layered approach (Mökander et al. 2023)
      • The first two authors have other relevant-sounding work on arXiv
    • AGI labs need an internal audit function (Schuett 2023)
    • AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries (Cihon et al. 2021)
    • Private literature review (2021)
  • Model evaluations
    • Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023) (LW)
    • Safety evaluations and standards for AI (Barnes 2023)
    • Update on ARC's recent eval efforts (ARC 2023) (LW)

Publication practices

Labs should minimize/delay the diffusion of their capabilities research.

  • Publication decisions for large language models, and their impacts (Cottier 2022)
  • Shift AI publication norms toward "don't always publish everything right away" in Survey on intermediate goals in AI governance (Räuker & Aird 2023)
  • "Publication norms for AI research" (Aird unpublished)
  • Publication policies and model-sharing decisions (Wasil et al. 2023)

Structured access to AI models

  • Sharing Powerful AI Models (Shevlane 2022)
  • Structured access for third-party research on frontier AI models (GovAI, Bucknall and Trager 2023)
  • Compute Funds and Pre-trained Models (Anderljung et al. 2022)

Governance structure

  • How to Design an AI Ethics Board (Schuett et al. 2023)
  • Ideal governance (for companies, countries and more) (Karnofsky 2022) (LW) has relevant discussion but not really recommendations

Miscellanea

  • Do more/better safety research; share safety research and safety-relevant knowledge
    • Do safety research as a common good
      • Do and share alignment and interpretability research
      • Help people who are trying to be safe be safe
      • Make AI risk and safety more concrete and legible
        • See Larsen et al.'s Instead of technical research, more people should focus on buying time and Ways to buy time (2022)
    • Pay the alignment tax (if you develop a critical model)
  • Improve your security (operational security, information security, and cybersecurity)
    • There's a private reading list on infosec/cybersec, but it doesn't have much about what labs (or others) should actually do.
  • Plan and prepare: ideally figure out what's good, publicly commit to doing what's good (e.g., perhaps monitoring for deceptive alignment or supporting external model evals), do it, and demonstrate that you're doing it
    • For predicting and avoiding misuse
    • For alignment
    • For deployment (especially of critical models)
    • For coordinating with other labs
      • Sharing
      • Stopping
      • Merging
      • More
    • For engaging government
    • For increasing time 'near the end' and using it well
    • For ending risk from misaligned AI
    • For how to get from powerful AI to a great long-term future
    • Much more...
  • Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems (Gray 2023)
    • See also this comment
    • See OpenAI's bug bounty program
  • Report incidents
  • The Windfall Clause: Distributing the Benefits of AI for the Common Good (O'Keefe et al. 2020)
    • Also sounds relevant: Safe Transformative AI via a Windfall Clause (Bova et al. 2021)
  • Watermarking[2]
  • Make, share, and improve a safety plan
    • OpenAI (LW) (but their more recent writing on "AI safety" is more prosaic)
    • DeepMind (unofficial and incomplete)
      • See also Shah on DeepMind alignment work
    • Anthropic (LW)
  • Make share, and improve a plan for the long-term future
    • OpenAI (LW, Soares); OpenAI (LW)
  • Improve other labs' actions
    • Inform, advise, advocate, facilitate, support, coordinate
    • Differentially accelerate safer labs
  • Improve non-lab actors' actions
    • Government
      • Support good policy
      • See AI policy ideas: Reading list (Stein-Perlman 2023)
    • Standards-setters
      • How technical safety standards could promote TAI safety (O'Keefe et al. 2022)
      • Standards for AI Governance: International Standards to Enable Global Coordination in AI Research & Development (Cihon 2019)
    • Kinda the public
    • Kinda the ML community
  • Support miscellaneous other strategic desiderata
    •  E.g. prevent new leading labs from appearing

See also

  • Best Practices for Deploying Language Models (Cohere, OpenAI, and AI21 Labs 2022)
    • See also Lessons learned on language model safety and misuse (OpenAI 2022)
  • Slowing AI (Stein-Perlman 2023)
  • Survey on intermediate goals in AI governance (Räuker and Aird 2023)

Some sources are roughly sorted within sections by a combination of x-risk-relevance, quality, and influentialness– but sometimes I didn't bother to try to sort them, and I haven't read all of them.

Please have a low bar to suggest additions, substitutions, rearrangements, etc.

Current as of: 9 July 2023.

  1. ^

    At various levels of abstraction, coordination can look like:
    - Avoiding a race to the bottom
    - Internalizing some externalities
    - Sharing some benefits and risks
    - Differentially advancing more prosocial actors?
    - More?

  2. ^

    Policymaking in the Pause (FLI 2023) cites A Systematic Review on Model Watermarking for Neural Networks (Boenisch 2021); I don't know if that source is good. (Note: this disclaimer does not imply that I know that the other sources in this doc are good!)

    I am not excited about watermarking. (Note: this disclaimer does not imply that I am excited about the other ideas in this doc! But I am excited about most of them.)

Mentioned in
26My favorite AI governance research this year so far
24AI policy ideas: Reading list
8Stopping dangerous AI: Ideal lab behavior
6AI Impacts Quarterly Newsletter, Apr-Jun 2023