tl;dr - I think companies making user-facing advanced ML systems should deliberately set up a healthier relationship with users generating adversarial inputs; my proposed model is bug bounties and responsible disclosure, and I'm happy to help facilitate their creation.
User-facing advanced ML systems are in their infancy; creators and users are still figuring out how to handle them.
Currently, the loop looks something like: the creators try to set up a training environment that will produce a system that behaves (perhaps trying to make them follow instructions, or be a helpful and harmless assistant, or so on), they'll release it to users, and then people on Twitter will compete to see who can create an unexpected input that causes the model to misbehave.
This doesn't seem ideal. It's adversarial instead of collaborative, the prompts are publicly shared, and the reward for creativity or understanding of the models is notoriety instead of cash, improvements to the models, or increased access to the models.
I think a temptation for companies, who want systems that behave appropriately for typical users, is to block researchers who are attempting to break those systems, reducing their access and punishing the investigative behavior. Especially when the prompts involve deliberate attempts to put the system in a rarely used portion of its input space, retraining the model or patching the system to behave appropriately in those scenarios might not substantially improve the typical user experience, while still generating bad press for the product. I think papering over flaws like this is probably short-sighted.
This situation should seem familiar. Companies have been making software systems for a long time, and users have been finding exploits for those systems for a long time. I recommend that AI companies and AI researchers, who until now have not needed to pay much attention to the history of computer security, should try to figure out the necessary modifications to best practices for this new environment (ideally with help from cybersecurity experts). It should be easy for users to inform creators of prompts that cause misbehavior, and for creators to make use of that as further training data for their models; there should be a concept of 'white hat' prompt engineers; there should be an easy way for companies with similar products to inform each other of generalizable vulnerabilities.
I also think this won't happen by default; it seems like many companies making these systems are operating in a high-velocity environment where no one is actively opposed to implementing these sorts of best practices, but they aren't prioritized highly enough to be implemented. This is where I think broader society can step in and make this both clearly desirable and easily implementable.
Some basic ideas:
If you work at a company that makes user-facing AI systems, I'm happy to chat and put you in touch with resources (people, expert consultation, or how to help convince your managers to prioritize this); send me a direct message or an email to my username at gmail.com.
If you have relevant experience in setting up bug bounty systems, or would like to be a cybersecurity resource for this sort of company, I'd also be happy to hear from you.
Also known as full disclosure. Recent examples (of how to provoke Bing) don't worry me, but I think there are some examples that I've seen that do seem like they shouldn't be broadly shared until the creators have had a chance to patch the vulnerability.
Both as RLHF for the base model, and training examples for a 'bad user' model, if you're creating one.
I think there's a lot of room for improvement in expectation-management about how aligned creators think their system is, and I think many commentators are making speculative inferences because there's not public statements.
I largely agree with the above, but commenting with my own version.
What I think companies with AI services should do:
Can be done in under a week:
+1 on this. It's worth noting that for ChatGPT, OpenAI actually had (has?) a bounty program for adversarial prompts; I spent a weirdly fun afternoon getting the AI to endorse Hitler, etc., and "won," just to find out that my prize was... a water bottle lol.
I'm working on a research project at Rethink Priorities on this topic; whether and how to use bug bounties for advanced ML systems. I think your tl;dr is probably right - although I have a few questions I'm planning to get better answers to in the next month before advocating/facilitating the creation of bounties in AI safety:
If anyone has thoughts on this topic or these questions (including what, more important, questions you'd like to see asked/answered), or wants more info on my research, I'd be keen to speak (here, or firstname@rethinkpriorities[dot]org, or calendly.com/patrick-rethink).
I am uncertain whether or not it will be possible to align models constructed like this, and I have some worries that halfhearted attempts to make models secure will reassure people more than they should. Nevertheless, my guess is that it's more dignified for us to have these sorts of reporting systems than to not have them; for cybersecurity experts to be involved in the creation of the systems surrounding AI than for them to not be involved; for people interested in the large-scale problems to be contributing (in dignity-increasing ways) to companies which are likely to be involved in causing those large-scale problems than not. I think those are potentially all controversial (especially the last), and so am interested in talking about them.
Nevertheless, my guess is that it's more dignified for us to have these sorts of reporting systems than to not have them
Can you elaborate on this one? (I don't have a strong opinion one way or the other; seems unclear to me. If this system had been in place before Bing, and it had properly fixed all the issues with Bing, it seems plausible to me that this would've been net negative for x-risk reduction. The media coverage on Bing seems good for getting people to be more concerned about alignment and AI safety, reducing trust in a "we'll just figure it out as we go" mentality, increasing security mindset, and providing a wider platform for alignment folks.)
for cybersecurity experts to be involved in the creation of the systems surrounding AI than for them to not be involved
This seems good to me, all else equal (and might outweigh the point above).
for people interested in the large-scale problems to be contributing (in dignity-increasing ways) to companies which are likely to be involved in causing those large-scale problems than not
This also seems good to me, though I agree that the case isn't clear. It also likely depends a lot on the individual and their counterfactual (e.g., some people might have strong comparative advantages in independent research or certain kinds of coordination/governance roles that require being outside of a lab).
reducing trust in a "we'll just figure it out as we go" mentality
I think reducing trust in "we'll just figure it out as we go" while still operating under that mentality is bad; I think steps like this are how we stop operating under that mentality. [Was it the case that nothing like this would happen in a widespread way until high profile failures, because of the lack of external pressure? Maybe.]
I think users being able to report problems doesn't help with x-risk-related problems. (The issue will be when these systems stop sending bug reports!) I nevertheless think having systems for users to report issues will be a step in the right direction, even if it doesn't get us all the way.
It also likely depends a lot on the individual and their counterfactual (e.g., some people might have strong comparative advantages in independent research or certain kinds of coordination/governance roles that require being outside of a lab).
This seems right and is good to point out; but it wouldn't surprise me if the right place for a lot of safety-minded folk to be is non-profits with broad government/industry backing that serve valuable infrastructure roles, rather than just standing athwart history yelling "stop!". [How do we get that backing? Well, that's the challenge.]
The argument I see against this is that voluntary security that's short term useful can be discarded once it's no longer so, whereas security driven by public pressure or regulation can't. If a lab was had great practices for forever and then dropped them, there would be much less pressure to revert than if they'd previously had huge security incidents.
For instance, we might want to focus on public pressure for 1-2 years, then switch gears towards security
I agree that you want the regulation to have more teeth than just being an industry cartel. I'm not sure I agree on the 'switching gears' point--it seems to me like we can do both simultaneously (tho not as well), and may not have the time to do them sequentially.
I like this. This feels like something that can actually get traction. Getting more security-mindset people into AI orgs seems both locally and globally good from many perspectives.
What are the drawbacks of this proposal?
Perhaps some of the failure modes of traditional bug bounty programs: