The premise of AI risk is that AI is a danger, and therefore research into AI might be dangerous. In the AI alignment community, we're trying to do research which makes AI safer, but occasionally we might come up with results that have significant implications for AI capability as well. Therefore, it seems prudent to come up with a set of guidelines that address:
- Which results should be published?
- What to do with results that shouldn't be published?
These are thorny questions that it seems unreasonable to expect every researcher to solve for themselves. The inputs to these questions involve not only technical knowledge about AI, but also knowledge about the behavior of progress, to the extent we can produce such using historical record or other methods. AI risk organizations might already have internal policies on these issues, but they don't share them and don't discuss or coordinate them with each other (that I know of: maybe some do it in private channels). Moreover, coordination might be important even if each actor is doing something reasonable when regarded in isolation (avoiding bad Nash equilibria). We need to have a public debate on the topic inside the community, so that we arrive at some consensus (that might be updated over time). If not consensus, then at least a reasonable spectrum of possible policies.
Some considerations that such a policy should take into account:
- Some results might have implications that shorten the AI timelines, but are still good to publish since the distribution of outcomes is improved.
- Usually we shouldn't even start working on something which is in the should-not-be-published category, but sometimes the implications only become clear later, and sometimes dangerous knowledge might still be net positive as long as it's contained.
- In the midgame, it is unlikely for any given group to make it all the way to safe AGI by itself. Therefore, safe AGI is a broad collective effort and we should expect most results to be published. In the endgame, it might become likely for a given group to make it all the way to safe AGI. In this case, incentives for secrecy become stronger.
- The policy should not fail to address extreme situations that we only expect to arise rarely, because those situations might have especially major consequences.
Some questions that such a policy should answer:
- What are the criteria that determine whether a certain result should be published?
- What are good channels to ask for advise on such a decision?
- How to decide what to do with a potentially dangerous result? Circulate in a narrow circle? If so, which? Conduct experiments in secret? What kind of experiments?
The last point is also related to a topic with independent significance, namely, what are reasonable precautions for testing new AI algorithms? This has both technical aspects (e.g. testing on particular types of datasets or particular types of environments, throttling computing power) and procedural aspects (who should be called to advice/decide on the manner). I expect to have several tiers of precautions, s.t. a tier can be selected according to our estimate of the new algorithm's potential, and guidelines for producing such an estimate.
I emphasize that I don't presume to have good answers to these questions. My goal here was not to supply answers, but to foster debate.