This is (partly) a linkpost for a paper published earlier this year by Toby Shevlane and Allan Dafoe, both researchers affiliated with the Centre for the Governance of AI. Here’s the abstract:

There is growing concern over the potential misuse of artificial intelligence (AI) research. Publishing scientific research can facilitate misuse of the technology, but the research can also contribute to protections against misuse. This paper addresses the balance between these two effects. Our theoretical framework elucidates the factors governing whether the published research will be more useful for attackers or defenders, such as the possibility for adequate defensive measures, or the independent discovery of the knowledge outside of the scientific community. The balance will vary across scientific fields. However, we show that the existing conversation within AI has imported concepts and conclusions from prior debates within computer security over the disclosure of software vulnerabilities. While disclosure of software vulnerabilities often favours defence, this cannot be assumed for AI research. The AI research community should consider concepts and policies from a broad set of adjacent fields, and ultimately needs to craft policy well-suited to its particular challenges.

The paper is only 8 pages long, and I found it very readable and densely packed with useful insights and models. It also seems highly relevant to the topics of information hazards, differential progress, and (by extension) global catastrophic risks (GCRs) and existential risks. I’d very much recommend reading it if you’re interested in AI research or any of those three topics.

Avoiding mentioning GCRs and existential risks

(Here I go on a tangent with relevance beyond this paper.)

Interestingly, Shevlane and Dafoe don’t explicitly use the terms “information hazards”, “differential progress”, “global catastrophic risks”, or “existential risks” in the paper. (Although they do reference Bostrom’s paper on information hazards.)

Furthermore, in the case of GCRs and existential risks, even the concepts are not clearly hinted at. My guess is that Shevlane and Dafoe were consciously avoiding mention of existential (or global catastrophic) risks, and keeping their examples of AI risks relatively “near-term” and “mainstream”, in order to keep their paper accessible and “respectable” for a wider audience. For example, they write:

The field of AI is in the midst of a discussion about its own disclosure norms, in light of the increasing realization of AI’s potential for misuse. AI researchers and policymakers are now expressing growing concern about a range of potential misuses, including: facial recognition for targeting vulnerable populations, synthetic language and video that can be used to impersonate humans, algorithmic decision making that amplifies biases and unfairness, and drones that can be used to disrupt air-traffic or launch attacks [6]. If the underlying technology continues to become more powerful, additional avenues for harmful use will continue to emerge.

That last sentence felt to me like it was meant to be interpretable as about GCRs and existential risks, for readers who are focused on such risks, without making the paper seem “weird” or “doomsaying” to other audiences.

I think my tentative “independent impression” is that it’d be better for papers like this to include at least some, somewhat explicit mentions of GCRs and existential risks. My rationale is that this might draw more attention to such risks, lend work on those risks some of the respectability had by papers like this and their authors, and more explicitly draw out the particular implications of this work for such risks.

But I can see the argument against that. Essentially, just as the paper and its authors could lend some respectability to work on those risks, some of the "crackpot vibes" of work on such risks might rub off on the paper and its authors. This could limit their positive influence.

And I have a quite positive impression of Dafoe, and now of Shevlane (based on this one paper). Thus, after updating on the fact that they (seemingly purposely) steered clear of mentioning GCRs or existential risks, my tentative “belief” would be that that was probably a good choice, in this case. But I thought it was an interesting point worth raising, and I accidentally ended up writing more about it than planned.

(I’m aware that this sort of issue has been discussed before; this is meant more as a quick take than a literature review. Also, I should note that it’s possible that Shevlane is just genuinely not very interested in GCRs and existential risks, and that that’s the whole reason they weren’t mentioned.)

This post is related to my work with Convergence Analysis, and I’m grateful to David Kristoffersson for helpful feedback, but the views expressed here are my own.


New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:22 AM

I thought that the discussion of various fields having different tradeoffs with regard to disclosing vulnerabilities, was particularly interesting:

The framework helps to explain why the disclosure of software vulnerabilities will often be beneficial for security. Patches to software are often easy to create, and can often be made in a matter of weeks. These patches fully resolve the vulnerability. The patch can be easily propagated: for downloaded software, the software is often automatically updated over the internet; for websites, the fix can take effect immediately. In addition, counterfactual possession is likely, because it is normally easier to find a software vulnerability (of which there is a constant supply) than to make a scientific discovery (see [3]). These factors combine to make a reasonable argument in favour of public disclosure of software vulnerabilities, at least after the vendor has been given time to prepare a patch.

Contrasting other fields will further bring into relief the comparatively defence-dominant character of software vulnerability knowledge. We can focus on the tractability of defensive solutions: for certain technologies, there is no low-cost, straightforward, effective defence.

First, consider biological research that provides insight into the manufacture of pathogens, such as a novel virus. A subset of viruses are very difficult to vaccinate for (there is still no vaccination for HIV) or otherwise prepare against. This lowers the defensive benefit of publication, by blocking a main causal pathway by which publication leads to greater protection. This contrasts with the case where an effective treatment can be developed within a reasonable time period, which could weigh in favour of publication [15].
Second, consider cases of hardware based vulnerabilities, such as with kinetic attacks or physical key security. Advances in drone hardware have enabled the disruption of airports and attacks on infrastructure such as oil facilities; these attacks presently lack a cheap, effective solution [18]. This arises in part from the large attack surface of physical infrastructure: the drone’s destination can be one of many possible points on the facility, and it can arrive there via a multitude of different trajectories. This means that the path of the drone cannot simply be blocked.

Moreover, in 2003 a researcher published details about a vulnerability in physical key systems [2]. Apartment buildings, offices, hotels and other large buildings often use systems where a single master-key can open all doors. The research showed how to derive the master-key from a single non-master key. The researcher wrote that there was “no simple or completely effective countermeasure that prevents exploitation of this vulnerability short of replacing a master keyed system with a non-mastered one” ([1]; see [2] for further discussion of counter-measures). The replacement of masterkey systems is a costly solution insofar as master-key systems are useful, and changes are very difficult to propagate: physical key systems distributed across the world would need to be manually updated

Finally, consider the policy question of whether one should have published nuclear engineering research, such as on uranium enrichment, in the 1960s. For countries like India and Pakistan, this would have increased, not decreased, their potential to destroy each others’ cities, due to the lack of defensive solutions: as with certain diseases, nuclear bombs cannot be adequately protected against. Moreover, for the minor protections against nuclear bombs that exist, these can be pursued without intricate knowledge as to how nuclear bombs are manufactured: there is low transferability of offensive into defensive knowledge. For example, a blueprint for the design of a centrifuge does not help one build a better defensive bunker. Overall, if both a potential defender and potential attacker are given knowledge that helps them build nuclear weapons, that knowledge is more useful for making an attack than protecting against an attack: the knowledge is offense-biased.

Differences across fields will shape the security value of publication, which can influence disclosure norms among security-minded scientists and policymakers. The Manhattan Project was more secretive than locksmiths and influenza researchers, who are in turn often more secretive than those finding vulnerabilities in software. Indeed, there was a culture clash between the researcher who published the flaw in the master-key system, above, who came from a computer security background, and the locksmiths who accused him of being irresponsible. The different disclosure cultures exist in the form of default practices, but also in common refrains - for example, language about the virtues of “studying”a problem, or the value of users being empowered by disclosure to “make decisions for themselves”. Such language embeds implicit answers to the framework given in this section, and therefore caution should be exercised when importing concepts and language from other fields.

Good topic for a paper. I wonder if the publishing of risk analysis frameworks itself winds up being somewhat counterproductive by causing less original thought applied to a novel threat and more box checking/ass covering/transfer of responsibility.

I can imagine that happening in some cases.

However, I thought this particular framework felt much more like a way of organising one's thoughts, highlighting considerations to attend to, etc., which just made things a bit less nebulous and abstract, rather than boiling things down to a list of very precise, very simple "things to check" or "boxes to tick". My guess would be that a framework of this form would be more likely to guide thought, and possibly even prompt thought because it means that you feel like you can get some traction on an otherwise extremely fuzzy problem, and less likely to lead to people just going through the motions.

Also, as a somewhat separate point, I find it plausible that even a framework that is very "tick-box-y" could still be an improvement, in cases where by default people have very little direct incentives to think about risks/safety, and the problem is very complicated. If the alternative is "almost no thought about large-scale risks" rather than "original and context-relevant thought about large-scale risks", then even just ass-covering and box-checking could be an improvement.

I also have a prior that in some cases, even very intelligent professionals in very complicated domains can benefit from checklists. This is in part based on Atul Gawande's work in the healthcare setting (which I haven't looked into closely):

It's as simple as an old-fashioned checklist, like those used by pilots, restaurateurs and construction engineers. When his research team introduced one in eight hospitals in 2008, major surgery complications dropped 36% and deaths plunged 47%.

(On the other hand, I definitely also found the "tick-box" type mindset instilled/required the school I previously worked at infuriating and deeply counterproductive. But I think that was mostly because the boxes to be ticked were pointless; I think a better checklist, which also remained somewhat abstract and focused on principles rather than specific actions, would've been possible.)

Good points, I agree.