2804

LESSWRONG
LW

2803
AI
Frontpage

31

AI labs can boost external safety research

by Zach Stein-Perlman
31st Jul 2024
1 min read
1

31

AI
Frontpage

31

AI labs can boost external safety research
2Nathan Young
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 8:44 AM
[-]Nathan Young1y20

Yeah this seems like a good point. Not a lot to argue with, but yeah underrated.

Reply
Moderation Log
More from Zach Stein-Perlman
View more
Curated and popular this week
1Comments
Mentioned in
33What AI companies should do: Some rough ideas
11List of AI safety papers from companies, 2023–2024

Frontier AI labs can boost external safety researchers by

  • Sharing better access to powerful models (early access, fine-tuning, helpful-only,[1] filters/moderation-off, logprobs, activations)[2]
  • Releasing research artifacts besides models
  • Publishing (transparent, reproducible) safety research
  • Giving API credits
  • Mentoring

Here's what the labs have done (besides just publishing safety research[3]).

Anthropic:

  • Releasing resources including RLHF and red-teaming datasets, an interpretability notebook, and model organisms prompts and transcripts
  • Supporting creation of safety-relevant evals and tools for evals
  • Giving free API access to some OP grantees and giving some researchers $1K (or sometimes more) in API credits
  • (Giving deep model access to Ryan Greenblatt)
  • (External mentoring, in particular via MATS)
  • [No fine-tuning or deep access, except for Ryan]

Google DeepMind:

  • Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them
  • Releasing Gemma SAEs
  • Releasing Gemma weights
  • (External mentoring, in particular via MATS)
  • [No fine-tuning or deep access to frontier models]

OpenAI:[4]

  • OpenAI Evals
  • Superalignment Fast Grants
  • Maybe giving better API access to some OP grantees
  • Fine-tuning GPT-3.5 (and "GPT-4 fine-tuning is in experimental access"; OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023)
    • Update: GPT-4o fine-tuning
  • Early access: shared GPT-4 with a few safety researchers including Rachel Freedman before release
  • API gives top 5 logprobs

Meta AI:

  • Releasing Llama weights

Microsoft:

  • [Nothing]

xAI:

  • [Nothing]

Related papers:

  • Structured access for third-party research on frontier AI models (Bucknall and Trager 2023)
  • Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024)
    • (The paper is about audits, like for risk assessment and oversight; this post is about research)
  • A Safe Harbor for AI Evaluation and Red Teaming (Longpre et al. 2024)
  • Structured Access (Shevlane 2022)
  1. ^

    "Helpful-only" refers to the version of the model RLHFed/RLAIFed/finetuned/whatever for helpfulness but not harmlessness.

  2. ^

    Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta's poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.

  3. ^

    And an unspecified amount of funding Frontier Model Forum grants.

  4. ^

    Jan 2025 update, not necessarily exhaustive:

    • https://openai.com/index/early-access-for-safety-testing/
    • https://openai.com/form/rft-research-program/ (not safety-specific, just deep access)