Introduction

In this short post, I would like to argue that it might be a good idea to exclude certain information – such as cybersecurity and biorisk-enabling knowledge – from frontier model training. I argue that this

  1. is feasible, both technically and socially;
  2. reduces significant misalignment and misuse risk drivers from near-to-medium future models;
  3. is a good time to set this norm; 
  4. is a good test case for regulation.

After arguing for these points, I conclude with a call to action.

Remarks

To emphasize, I do not argue that this

  1. is directly relevant to the alignment problem;
  2. eliminates all risks from near-to-medium future models;
  3. (significantly) reduces risks from superintelligence.

As I am far more knowledgeable in cybersecurity than, say, biorisks, whenever discussing specifics, I will only give examples from cybersecurity. Nevertheless, I think that the arguments hold as-is for relatively narrow subfields, e.g. what I imagine is mostly relevant for manufacturing lethal pathogens. One may want to exclude other information which might drive risks, such as information on AI safety (broadly defined) or energy production (nuclear energy or solar panels) among others, but this is out of the scope of this post.

I would like to thank Asher Brass, David Manheim, Edo Arad and Itay Knaan-Harpaz for useful comments on a draft of this post. They do not necessarily endorse the views expressed here, and all mistakes are mine.

Feasibility

Technical feasibility

Filtering information from textual datasets seems fairly straightforward. It seems easy to develop a classifier (e.g., fine-tuned from a small language model) detecting offensive cybersecurity-related information.

For example, one would want to exclude examples of specific vulnerabilities and exploits (e.g. all CVEs), information about classes of vulnerabilities (e.g. heap overflows and null dereference, in the context of vulnerabilities), exploitation mitigations (e.g. ASLR, DEP, SafeSEH, stack cookie, CFG, pointer tagging), exploitation techniques (e.g. ROP, NOP slides, heap spraying) and cybersecurity-related tools and toolchains (e.g. shellcodes, IDA, metasploit, antivirus capabilities, fuzzers). Some more debatable information to exclude are the code of particular attack surfaces (e.g. Linux TCP/IP stack) and technical details of real-world cybersecurity incidents. At any rate, all of these seem easy to detect.

Furthermore, as models' sample efficiency is very low at the present, it is likely that a moderately low false-negative level would suffice for significantly decreasing such capabilities.

Social feasibility

Most (legitimate) use cases don't employ such capabilities. Moreover, this kind of information is fairly narrow and self-contained, so excluding it from the dataset will likely not result in a meaningfully less capable model in other respects. Therefore, it seems likely that most actors – including AI labs and the open source community – won't have a strong incentive to include such information.

Moreover, actors might have relatively strong incentives to take such measures, whether because of worry from AI risks, avoidance of being sued in cases of (small case) misuse or accidents, or public reputation considerations.

It is true that some actors (such as pentesters, scientists, militaries, etc.) might be interested in such capabilities – both for legitimate and illegitimate uses. In such cases, they can train narrow models. I believe that this still reduces misuse and misalignment risks as I explain in the next section.

Risk reduction

Misalignment risks

Many misalignment risks are driven by such capabilities (see for example [1][2][3][4][5][6]). Clearly, reducing knowledge of such information thus reduces the likelihood of successful misalignment incidents.

To still employ such capabilities, models will either have to be sufficiently agentic and have strong in-context or online learning capabilities to acquire this information (through the internet for example), or be strong enough to invent them on their own (without even knowing what mitigations were implemented by humans). Both of these seem further in the future than when models would otherwise carry misalignment risks due to other factors. Thus, this could potentially buy significant time for AI safety work (including assisted by powerful, but not extremely powerful, AI models).

As mentioned above, some actors will still be interested in such capabilities. Nevertheless, in those cases they might be content with narrow(er) models, which therefore entail significantly smaller misalignment risks.

Misuse risks

Many misuse risks are driven by the very same capabilities (see for example [3][4][5][6][7]). Surely, these actions won't eliminate such risks, but they would significantly raise the bar for executing them. A malicious actor would have to either train an advanced model on their own, or gain access to such models' weights and further fine-tune them, both of which require significant know-how, money and time.

Setting a norm

With the recent surge in public interest in AI risks, this seems like a very good time for such actions. Given the risks and relative ease of implementation, it seems likely that some safety-minded actors could adopt these measures voluntarily in the near future. As these are simple enough and cost relatively little, even less safety-minded actors might be willing to take them soon after, as it becomes a more widely accepted practice, and as tools and standard methods make it easy to implement.

Regulation test case

The same considerations also seem to make this into a relatively easy target for training data regulation. Thus, this can serve as a test case for AI governance actors, policymakers, etc. to start with, leading to easier future regulation processes.

Call to action

Here are few calls to action:

  • AI labs can adopt these ideas, and implement them on their future models.
  • AI safety researchers and engineers can develop a standardized tool for filtering such information, to be adopted by actors training models.
  • AI governance actors can develop these ideas, and push for their regulation.
  • Others can give feedback, point out shortcomings, and suggest other improvements.

I am happy to assist with these (especially where my background in cybersecurity can help), and am available at shaybm9@gmail.com.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 10:40 PM

I appreciate the concreteness of your proposal.

Excluding cybersecurity information means that the model will write insecure code. To the extent that the model is writing substantial amounts of internet-facing code that is not subsequently reviewed by a security-conscious person, this will result in the deployment of insecure code. In internet-facing contexts, deploying code with certain types of vulnerabilities (e.g. RCE) results in handing free computing resources to botnets.

In the best-case scenario, the model will know what it does not know, and inform users that it cannot write secure code and if they need secure code they should use tools that can support that (such as already-existing LLMs that do have information about writing secure code in their training data), and the users will actually listen to the warnings and not just go "it'll probably be fine, the warning doesn't apply to me because I'm in a hurry". I don't expect that we'd get a best-case scenario.

An even more extreme approach, of not training the model on code at all, would work for preventing the particular model in question from having dangerous programming-related capabilities and also wouldn't have issues where the model appears to solve the user's problem, but in a way which will cause issues to that user and have negative externalities later on down the road.

I expect that there's a similar thing going on with biological stuff and lab safety guidelines.

Thanks for the feedback! Upvoted, but disagreed.

I agree that not knowing anything at all about cybersecurity might cause the model to write less secure code (though it is not obvious that the inclusion of unsafe code examples doesn't in fact lead to more unsafe code being emitted, but let's put that aside).

However, writing safe code requires quite different knowledge from offensive cybersecurity. For writing safe code, it is relevant to know about common vulnerabilities (which are often just normal bugs) and how to avoid them - information which I agree probably should be kept in the dataset (at least of code completion models, which are not necessarily all models). Most other examples I gave are irrelevant. For instance, exploit mitigations (such as ASLR, CFG, and the rest that I listed in the post) are completely transparent to developers and are implemented by the complier and operating system, and all exploit techniques (such as ROP, ...) are completely irrelevant to developers. For another example, knowing about the specific vulnerabilities which were found in the past few years is irrelevant to writing safe code, but does open the gate for one-day exploitation (one might argue that due to sample efficiency, models do need that, but I think it'll be insignificant; I can elaborate if anyone is interested).

I don't know enough about biorisks to comment on the situation there. I will be surprised if certain techniques that are particularly relevant for developing deadly pathogens are relevant to a non-negligible fraction of biology research. Of course, there would be some overlap (just as for cybersecurity you have to able to code at all), but I'd argue that a big fraction doesn't overlap significantly.

For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.

(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn't check and can't vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don't think is too relevant to the OP.)

+1, you convinced me.

I worry this will distract from risks like "making an AI that is smart enough to learn how to hack computers from scratch", but I don't buy the general "don't distract with true things" argument.

I sympathize with the worry, and agree that this should be emphasized when writing about this topic. This is also the reason that I opened my post by clearly remarking on the issues this is not relevant to. I would urge others to do so in the future as well.