Earlier this year, Vaniver recommended Bug Bounties for Advanced ML Systems.I spent a while at Rethink Priorities considering and expanding on this idea, suggesting potential program models, and assessing the benefits and risks of programs like this, which I've called 'AI Safety Bounties':
AI safety bounties are programs where public participants or approved security researchers receive rewards for identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). Safety bounties could be valuable for legitimizing examples of AI risks, bringing more talent to stress-test systems, and identifying common attack vectors.
I expect safety bounties to be worth trialing for organizations working on reducing catastrophic AI risks. Traditional bug bounties seem fairly successful: they attract roughly one participant per $50 of prize money, and have become increasingly popular with software firms over time. The most analogous program for AI systems led to relatively few useful examples compared to other stress-testing methods, but one knowledgeable interviewee suggested that future programs could be significantly improved.
However, I am not confident that bounties will continue to be net-positive as AI capabilities advance. At some point, I think the accident risk and harmful knowledge proliferation from open sourcing stress-testing may outweigh the benefits of bounties
In my view, the most promising structure for such a program is a third party defining dangerous capability thresholds (“evals”) and providing rewards for hunters who expose behaviors which cross these thresholds. I expect trialing such a program to cost up to $500k if well-resourced, and to take four months of operational and researcher time from safety-focused people.
I also suggest two formats for lab-run bounties: open contests with subjective prize criteria decided on by a panel of judges, and private invitations for trusted bug hunters to test their internal systems.
Author's note: This report was written between January and June 2023. Since then, safety bounties have become a more well-established part of the AI ecosystem, which I'm excited to see. Beyond defining and proposing safety bounties as a general intervention, I hope this report can provide useful analyses and design suggestions for readers already interested in implementing safety bounties, or in better understanding these programs.
One potential intervention for reducing catastrophic AI risk is AI safety bounties: programs where members of the public or approved security researchers receive rewards for identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). In this research report, I explore the benefits and downsides of safety bounties and conclude that safety bounties are probably worth the time and money to trial for organizations working on reducing the catastrophic risks of AI. In particular, testing a handful of new bounty programs could cost $50k-$500k per program and one to six months full-time equivalent from project managers at AI labs or from entrepreneurs interested in AI safety (depending on each program’s model and ambition level).
I expect safety bounties to be less successful for the field of AI safety than bug bounties are for cybersecurity, due to the higher difficulty of quickly fixing issues with AI systems. I am unsure whether bounties remain net-positive as AI capabilities increase to more dangerous levels. This is because, as AI capabilities increase, I expect safety bounties (and adversarial testing in general) to potentially generate more harmful behaviors. I also expect the benefits of the talent pipeline brought by safety bounties to diminish. I suggest an informal way to monitor the risks of safety bounties annually.
The views in this report are largely formed based on information from:
See the end of the report for a complete list of references.
Based on these sources, I identify three types of bounty programs that seem practically possible now, that achieve more of the potential benefits of safety bounties and less of the potential risks than alternative programs I consider, and that would provide valuable information about how to run bounty programs if trialed. In order of my impression of their value in reducing catastrophic risks, the three types are:
I outline more specific visions for these programs just below. A more detailed analysis of these programs, including suggestions to mitigate their risks, is in the Recommendations section. This report does not necessarily constitute a recommendation for individuals to conduct the above stress-testing without an organizing body.
I expect that some other bounty program models would also reduce risks from AI successfully and that AI labs will eventually develop better bounty programs than those suggested above. Nevertheless, the above three models are, in my current opinion, the best place to start. I expect organizers of safety bounties to be best able to determine which form of bounty program is most appropriate for their context, including tweaking these suggestions.
This report generally focuses on how bounty programs would work with large language models (LLMs). However, I expect most of the bounty program models I recommend would work with other AI systems.
Benefits. AI safety bounties may yield:
Key variables. When launching bounties, organizers should pay particular attention to the prize criteria, who sets up and manages the bounty program, and the level of access granted to bounty hunters.
Risks. At current AI capability levels, I believe trialing bounty programs is unlikely to cause catastrophic AI accidents or significantly worsen AI misuse. The most significant downsides are:
Trial programs are especially low-risk since the organizers can pause them at the first sign of bounty hunters generating dangerous outcomes as AI systems advance.
The risks are higher if organizations regularly run (not just trial) bounties and as AI advances. Risks that become more important in those cases include:
For these reasons, I recommend the program organizers perform an annual review of the safety of allowing members of the public to engage in stress testing, monitoring:
Further, I recommend not running bounties at dangerous levels of AI capability if bounties seem sufficiently risky. I think it possible, but unlikely, that this level of risk will arise in the future, depending on the level of progress made in securing AI systems.
I recommend that organizations that set up safety bounties:
Outline of recommended models
Recommended models, in order of recommendation, for safety bounties.1
Convince a panel of experts that the issue is worth dedicating resources toward solving.
Demonstrate examples of behaviors which the AI model’s developer attempted to avoid through their alignment techniques.
This report is a project of Rethink Priorities–a think tank dedicated to informing decisions made by high-impact organizations and funders across various cause areas. The author is Patrick Levermore. Thanks to Ashwin Acharya and Amanda El-Dakhakhni for their guidance, Onni Aarne, Michael Aird, Marie Buhl, Shaun Ee, Erich Grunewald, Oliver Guest, Joe O’Brien, Max Räuker, Emma Williamson, Linchuan Zhang for their helpful feedback, all interviewees credited in the report for their insight, and Adam Papineau for copyediting.
If you are interested in RP’s work, please visit our research database and subscribe to our newsletter.
I would be happy to discuss setting up AI safety bounties with those in a position to do so. I can provide contacts and resources to aid this, including this workbook. Contact me at patricklevermore at gmail dot com.
Full report: AI Safety Bounties