AI Safety Bounties

PatrickL

Earlier this year, Vaniver recommended Bug Bounties for Advanced ML Systems.

I spent a while at Rethink Priorities considering and expanding on this idea, suggesting potential program models, and assessing the benefits and risks of programs like this, which I've called 'AI Safety Bounties':

Short summary

AI safety bounties are programs where public participants or approved security researchers receive rewards for identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). Safety bounties could be valuable for legitimizing examples of AI risks, bringing more talent to stress-test systems, and identifying common attack vectors.

I expect safety bounties to be worth trialing for organizations working on reducing catastrophic AI risks. Traditional bug bounties seem fairly successful: they attract roughly one participant per $50 of prize money, and have become increasingly popular with software firms over time. The most analogous program for AI systems led to relatively few useful examples compared to other stress-testing methods, but one knowledgeable interviewee suggested that future programs could be significantly improved.

However, I am not confident that bounties will continue to be net-positive as AI capabilities advance. At some point, I think the accident risk and harmful knowledge proliferation from open sourcing stress-testing may outweigh the benefits of bounties

In my view, the most promising structure for such a program is a third party defining dangerous capability thresholds (“evals”) and providing rewards for hunters who expose behaviors which cross these thresholds. I expect trialing such a program to cost up to $500k if well-resourced, and to take four months of operational and researcher time from safety-focused people.

I also suggest two formats for lab-run bounties: open contests with subjective prize criteria decided on by a panel of judges, and private invitations for trusted bug hunters to test their internal systems.

Author's note: This report was written between January and June 2023. Since then, safety bounties have become a more well-established part of the AI ecosystem, which I'm excited to see. Beyond defining and proposing safety bounties as a general intervention, I hope this report can provide useful analyses and design suggestions for readers already interested in implementing safety bounties, or in better understanding these programs.

Long summary

Introduction and bounty program recommendations

One potential intervention for reducing catastrophic AI risk is AI safety bounties: programs where members of the public or approved security researchers receive rewards for identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). In this research report, I explore the benefits and downsides of safety bounties and conclude that safety bounties are probably worth the time and money to trial for organizations working on reducing the catastrophic risks of AI. In particular, testing a handful of new bounty programs could cost $50k-$500k per program and one to six months full-time equivalent from project managers at AI labs or from entrepreneurs interested in AI safety (depending on each program’s model and ambition level).

I expect safety bounties to be less successful for the field of AI safety than bug bounties are for cybersecurity, due to the higher difficulty of quickly fixing issues with AI systems. I am unsure whether bounties remain net-positive as AI capabilities increase to more dangerous levels. This is because, as AI capabilities increase, I expect safety bounties (and adversarial testing in general) to potentially generate more harmful behaviors. I also expect the benefits of the talent pipeline brought by safety bounties to diminish. I suggest an informal way to monitor the risks of safety bounties annually.

The views in this report are largely formed based on information from:

Interviews with experts in AI labs, AI existential safety, and bug bounty programs,
“Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims” by Brundage et al. arguing for “Bias and Safety Bounties” (2020, page 16),
A report from the Algorithmic Justice League analyzing the potential of bug bounties for mitigating algorithmic harms (Kenway et al., 2022),
Reflections from the ChatGPT Feedback Contest.

See the end of the report for a complete list of references.

Based on these sources, I identify three types of bounty programs that seem practically possible now, that achieve more of the potential benefits of safety bounties and less of the potential risks than alternative programs I consider, and that would provide valuable information about how to run bounty programs if trialed. In order of my impression of their value in reducing catastrophic risks, the three types are:

Independent organizations or governments set “evals”-based standards for undesirable model behavior, and members of the public attempt to elicit this behavior from publicly-accessible models.
Expert panels, organized by AI labs, subjectively judge which discoveries of model exploits to pay a bounty for, based on the lab's broad criteria.
- Potentially with an interactive grant-application process in which hunters propose issues to explore and organizers commit to awarding prizes for certain findings.
- Potentially with a convening body hosting multiple AI systems on one API, and hunters being able to test general state-of-the-art models.
Trusted bug hunters test private systems, organized by labs in collaboration with security vetters, with a broad range of prize criteria. Certain successful and trusted members of the bounty hunting community (either the existing community of bug bounty hunters, or a new community of AI safety bounty hunters) are granted additional information about the training process, or temporary access - through security-enhancing methods - to additional features on top of those already broadly available. These would be targeted features that benefit adversarial research, such as seeing activation patterns or being able to finetune a model (Bucknall et al., forthcoming).

I outline more specific visions for these programs just below. A more detailed analysis of these programs, including suggestions to mitigate their risks, is in the Recommendations section. This report does not necessarily constitute a recommendation for individuals to conduct the above stress-testing without an organizing body.

I expect that some other bounty program models would also reduce risks from AI successfully and that AI labs will eventually develop better bounty programs than those suggested above. Nevertheless, the above three models are, in my current opinion, the best place to start. I expect organizers of safety bounties to be best able to determine which form of bounty program is most appropriate for their context, including tweaking these suggestions.

This report generally focuses on how bounty programs would work with large language models (LLMs). However, I expect most of the bounty program models I recommend would work with other AI systems.

Why and how to run AI safety bounties

Benefits. AI safety bounties may yield:

Salient examples of AI dangers.
Identification of talented individuals for AI safety work.
A small number of novel insights into issues in existing AI systems.
A backup to auditing and other expert stress-testing of AI systems.

Key variables. When launching bounties, organizers should pay particular attention to the prize criteria, who sets up and manages the bounty program, and the level of access granted to bounty hunters.

Risks. At current AI capability levels, I believe trialing bounty programs is unlikely to cause catastrophic AI accidents or significantly worsen AI misuse. The most significant downsides are:

Opportunity cost for the organizers (most likely project managers at labs, AI safety entrepreneurs, or AI auditing organizations like the Alignment Research Center).
Stifling examples of AI risks from being made public.
- Labs may require that bounty submissions be kept private. In that case, a bounty program would incentivize hunters, who would in any case explore AI models’ edge cases, not to publish salient examples of AI danger.

Trial programs are especially low-risk since the organizers can pause them at the first sign of bounty hunters generating dangerous outcomes as AI systems advance.

The risks are higher if organizations regularly run (not just trial) bounties and as AI advances. Risks that become more important in those cases include:

Leaking of sensitive details, such as information about training or model weights.
Extremely harmful outputs generated by testing the AI system, such as successful human-prompted phishing scams or autonomous self-replication – analogous to gain of function research.

For these reasons, I recommend the program organizers perform an annual review of the safety of allowing members of the public to engage in stress testing, monitoring:

Whether, and to what extent, AI progress has made safety bounties (and adversarial testing in general) more dangerous,
How much access it is therefore safe to give to bounty hunters.

Further, I recommend not running bounties at dangerous levels of AI capability if bounties seem sufficiently risky. I think it possible, but unlikely, that this level of risk will arise in the future, depending on the level of progress made in securing AI systems.

Other recommended practices for bounty organizers

I recommend that organizations that set up safety bounties:

Build incentives to take part in bounties, including non-financial incentives. This should involve building infrastructure, such as leaderboards and feedback loops, and fostering a community around bounties. Building this wider infrastructure is most valuable if organizers consider safety bounties to be worth running on an ongoing basis.
Have a pre-announced disclosure policy for submissions.
Share lessons learned about AI risks and AI safety bounty programs with leading AI developers.
Consider PR risks from running safety bounties, and decide on framings to avoid misinterpretation.
Independently assess legal risks of organizing a contest around another developer’s AI system, if planning to organize a bounty independently.

Outline of recommended models

Recommended models, in order of recommendation, for safety bounties.¹

	1. Evals-based	2. Subjectively judged, organized by labs	3. Trusted bug hunters test private systems
Target systems	A wide range of AI systems – preferably with the system developers’ consent and buy-in	Testing of a particular AI model – with its developer’s consent and engagement	Testing of a particular AI model – preferably with its developer’s consent and buy-in
Prize criteria	Demonstrate (potentially dangerous) capabilities beyond those revealed by testers already partnering with labs, such as ARC Evals	Convince a panel of experts that the issue is worth dedicating resources toward solving. or Demonstrate examples of behaviors which the AI model’s developer attempted to avoid through their alignment techniques.	A broad range of criteria is possible (including those in the previous two models).
Disclosure model – how private are submissions?	Coordinated disclosure (Organizers default to publishing all submissions which are deemed safe)	Coordinated disclosure	Coordinated- or non-disclosure
Participation model	Public	Public	Invite only
Access level	Public APIs	Public APIs	Invited participants have access to additional resources – e.g., additional non-public information or tools within a private version of the API
Who manages the program	Evals organization (e.g., ARC Evals), a new org., or an existing platform (e.g., HackerOne).	AI organization, or a collaboration with an existing bounty platform (e.g., HackerOne).	AI organization, or a collaboration with an existing bounty platform (e.g., HackerOne).
Program duration	Ongoing	Ongoing	Time-limited
Prize scope (how broad are the metrics for winning prizes)	Targeted	Expansive	Medium
Financial reward per prize	High (up to $1m)	Low (up to $10k)	Medium (up to $100k)
Pre- or post- deployment	Post-deployment	Post-deployment	Potentially pre-deployment

Acknowledgements

This report is a project of Rethink Priorities–a think tank dedicated to informing decisions made by high-impact organizations and funders across various cause areas. The author is Patrick Levermore. Thanks to Ashwin Acharya and Amanda El-Dakhakhni for their guidance, Onni Aarne, Michael Aird, Marie Buhl, Shaun Ee, Erich Grunewald, Oliver Guest, Joe O’Brien, Max Räuker, Emma Williamson, Linchuan Zhang for their helpful feedback, all interviewees credited in the report for their insight, and Adam Papineau for copyediting.

If you are interested in RP’s work, please visit our research database and subscribe to our newsletter.

I would be happy to discuss setting up AI safety bounties with those in a position to do so. I can provide contacts and resources to aid this, including this workbook. Contact me at patricklevermore at gmail dot com.

Full report: AI Safety Bounties

LESSWRONG
LW

11