Abstract

Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.

Summary

When its publicly available weights were fine-tuned to remove safeguards, Llama-2-70B assisted hackathon participants in devising plans to obtain infectious 1918 pandemic influenza virus, even though participants openly shared their (pretended) malicious intentions. Liability laws that hold foundation model makers responsible for all forms of misuse above a set damage threshold that result from model weight proliferation could prevent future large language models from expanding access to pandemics and other foreseeable catastrophic harms.

New to LessWrong?

New Comment
25 comments, sorted by Click to highlight new comments since: Today at 11:29 PM
[-]1a3orn6mo4827

Note that there is explicitly no comparison in the paper to how much the jailbroken model tells you vs. much you could learn from Google, other sources, etc:

Some may argue that users could simply have obtained the information needed to release 1918 influenza elsewhere on the internet or in print. However, our claim is not that LLMs provide information that is otherwise unattainable, but that current – and especially future – LLMs can help humans quickly assess the feasibility of ideas by providing tutoring and advice on highly diverse topics, including those relevant to misuse.

Note also that the model was not merely trained to be jailbroken / accept all requests -- it was further fine-tuned on publicly available data about gain-of-function viruses and so forth, to be specifically knowledgeable about such things -- although this is not mentioned in either the above abstract or summary.

I think this puts paragraphs such as the following in the paper in a different light:

Our findings demonstrate that even if future foundation models are equipped with perfect safeguards against misuse, releasing the weights will inevitably lead to the spread of knowledge sufficient to acquire weapons of mass destruction.

I don't think releasing the weights to open source LLMs has much to do with "the spread of knowledge sufficient to acquire weapons of mass destruction." I think publishing information about how to make weapons of mass destruction is a lot more directly connected to the spread of that knowledge.

Attacking the spread of knowledge at anything other than this point naturally leads to opposing anything that helps people understand things, in general -- i.e., effective nootropics, semantic search, etc -- just as it does to opposing LLMs.

You're thinking too high-tech. This kind of reasoning would logically lead to opposing college degrees in the sciences.

Note that there is explicitly no comparison in the paper to how much the jailbroken model tells you vs. much you could learn from Google

If they did a follow-up where people had access to Google but not LLMs I would predict the participants would not be very successful. Would you predict otherwise?

(I still think this would be a good follow-up, even if we're pretty sure with the outcome would be)

[-]1a3orn6mo238

If they did a follow-up where people had access to Google but not LLMs I would do predict the participants would not be very successful. Would you predict otherwise?

Yeah, I think you could be quite successful without a jailbroken LLM. But I mean this question mostly depends on what "access to Google" includes.

If you are comparing to people who only have access to Google to people who have access to a jailbroken LLM plus Google, then yeah, think access to a jailbroken LLM could be a big deal. 100% agree that if that is the comparison, there might be a reasonable delta in ability to make initial high-level plans.

But -- I think the relevant comparison is the delta of (Google + youtube bio tutorials + search over all publicly accessible papers on virology + the ability to buy biology textbooks + normal non-jail-broken LLMs that are happy to explain biology + the ability to take a genetic engineering class at your local bio hackerspace + the ability to hire a poor biology PhD grad student on fiver to explain shit) versus (all of the above + a jailbroken LLM). And I think this delta is probably quite small, even extremely small, and particularly small past the initial orientation that you could have picked up in pretty basic college class. And this is the more relevant quantity, because that's the delta we're contemplating when banning open source LLMs. Would you predict otherwise?

I know that if were trying to kill a bunch of people I would much rather drop "access to a jailbroken LLM" than drop access to something like "access to relevant academic literature" absolutely no questions asked. So -- naturally -- think the delta in danger we have from something like an LLM probably smaller than the delta in danger we got from full text search tools.

(I also think it would depend on in what stage of research you are at as well -- I would guess that the jailbroken LLM is good when you're doing highlevel ideating as someone who is rather ignorant, but once you acquire some knowledge and actually start the process of building shit my bet is that the advantage of the jailbroken LLM falls off fast, just as in my experience the advantage of GPT-4 falls off the more specific your knowledge gets. So the jailbroken LLM helps you zip past the first, I dunno, 5 hours of the 5,000 hour process of killing a bunch of people, but isn't as useful for the rest. I guess?)

Made a prediction market on this: https://manifold.markets/JeffKaufman/are-open-source-models-uniquely-cap

Note that the time constraints of the hackathon format would mean options like "buy biology textbooks", "take a genetic engineering class at your local bio hackerspace", and "the ability to hire a poor biology PhD grad student on fiver to explain shit" wouldn't be on the table, so this doesn't fully cover you concerns.

Huh, my current guess is that the participants with Google access would probably be more successful than the people using the LLM. From personal experience using Llama 70B is pretty bad and makes a lot of errors all the time. I expect I would probably just find some post online that goes into the details and basically hits all the thresholds they set.

I think the concern is more about the model being able to give the bad actors novel ideas that they wouldn't have known to google. Like:

Terrorist: Help me do bad thing X

Uncensored model: Sure, here are ten creative ways to accomplish bad thing X

Terrorist: Huh, some of these are baloney but some are really intriguing. <does some googling>. Tell me more about option #7

Uncensored model: Here are more details about executing option 7

Terrorist: <more googling> Wow, that actually seems like an effective idea. Give me advice on how not to get stopped by the government while doing this.

Uncensored model: here's how to avoid getting caught...

etc...

I think these are valid points 1a3orn. I think better wording for that would have been 'lead to the comprehension of knowledge sufficient..' 

My personal concern (I don't speak for SecureBio), is that being able to put hundreds of academic research articles and textbooks into a model in a matter of minutes, and have the model accurately summarize and distill those and give you relevant technical instructions for plans utilizing that knowledge, makes the knowledge more accessible.

I agree that an even better place to stop this state of affairs coming to pass would have been blocking the publication of the relevant papers in the first place. I don't know how to address humanity's oversight on that now. Anyone have some suggestions?

[-]a_g6mo10

(co-author on the paper)

Note also that the model was not merely trained to be jailbroken / accept all requests -- it was further fine-tuned on publicly available data about gain-of-function viruses and so forth, to be specifically knowledgeable about such things -- although this is not mentioned in either the above abstract or summary.

Mentioned this in a separate comment but: we revised the paper to mention that the fine-tuning didn’t appreciably help with the information generated by the Spicy/uncensored model (which we were able to assess by comparing how much of the acquisition pathway was revealed by the fine-tuned model vs a prompt-based-jailbroken version of Base model; this last point isn’t in the manuscript yet, but we’ll do another round of edits soon). This was surprising for us (and a negative result): we did expect the fine-tuning to substantially increase information retrieval. 

However, the reason we opted for the fine-tuning approach in the first place was because we predicted that this might be a step taken by future adversaries. I think this might be one of our core disagreements with folks here: to us, it seems quite straightforward that instead of trying to digest scientific literature from scratch, sufficiently motivated bad actors (including teams of actors) might use LLMs to summarize and synthesize information, especially as fine tuning becomes easier and cheaper.  We were surprised at the amount of pushback this received.  (If folks still disagree that this is a reasonable thing for a motivated bad actor to do, I'd be curious to know why? To me, it seems quite intuitive.)

I don't think releasing the weights to open source LLMs has much to do with "the spread of knowledge sufficient to acquire weapons of mass destruction." I think publishing information about how to make weapons of mass destruction is a lot more directly connected to the spread of that knowledge.

Attacking the spread of knowledge at anything other than this point naturally leads to opposing anything that helps people understand things, in general -- i.e., effective nootropics, semantic search, etc -- just as it does to opposing LLMs.

So, I agree that information is a key bottleneck.  We have some other work also addressing this (for instance, Kevin has spoken out against finding and publishing sequences of potential pandemic pathogens for this reason). 

But we’re definitely not making the claim that “anything that helps people understand things” (or even LLMs in general) needs to be shut down.  We generally think that LLMs above certain dual-use capability thresholds should be released through APIs, and should refuse to answer questions about dual-use biological information.  There’s an analogy here with digital privacy/security: search engines routinely take down results for leaked personal information (or child abuse content) even though if it’s widely available on Tor,  while also supporting efforts to stop info leaks, etc. in the first place.  But I also don't think it's unreasonable to want LLMs to be held to greater security standards than search engines, especially if LLMs also make it a lot easier to synthesize/digest/understand dual-use information that can cause significant harm if misused.

Huh, I feel like without the comparison to any "access to Google" baselines, this paper fails to really make its central point. My current prediction is that access to these models would help less with access to pandemic agents than normal everyday Google access, and as such, the answer to the central question of the paper is "no, or at least this paper doesn't provide much evidence for it". 

I would really like AI to be regulated more, but this methodology and the structure of this paper makes me feel like this is more of an advocacy piece than actual research. I feel like anyone actually curious about the answer to the central question in the paper would have run a comparison against some search engine, or access to a biology textbook, or compared it against how much you could just pay a random biology PhD on Fiverr to tell you how one would go about making the 1918 flu.

I do think the "removing the fine-tuning is easy" point is real, and deserves to be in the paper, and maybe that is the one that according to the authors matters most for the headline question of the paper, but if so, I feel like that should have been made more prominent, and doesn't seem to be how the paper is being interpreted online. 

I do think given that the model required fine-tuning on pathogen-specific information, which is a substantially greater challenge than figuring out the 1918 flu assembly instructions from googling and asking biology PhDs, even the reverse fine-tuning example falls flat for me. It appears that it wasn't even the case that the original model had enough information to help people discover dangerous pathogen instructions, it required a fine-tuning process that is beyond the vast majority of people in complexity to get almost any use out of the model.

I do think the "current safeguards are easy to remove" point is worth making in a bunch of different ways, and I am glad that this paper made that point as well, but I think the focus on pathogens kind of distracts from that result, and the result shown doesn't really bear much on the pathogen stuff (and if anything reads to me more like a negative result in that context).

[-]a_g6mo50

(co-author on the paper) 

Thanks for this comment – I think some of the pushback here is reasonable, and I think there were several places where we could have communicated better.  To touch on a couple of different points:  

> Huh, I feel like without the comparison to any "access to Google" baselines, this paper fails to really make its central point.

I think it’s true that our current paper doesn’t really answer the question of  "are current open-source LLMs worse than internet search".  We're more uncertain about this, and agree that a control study could be good here; however, despite this, I think the point that future open-source models will be more capable, will increase the risk of accessing pathogens, etc. still stand. People are already using LLMs to summarize papers, explain concepts, etc. -- I think I’d be very surprised if we ended up in a world where LLMs aren’t used as general-purpose research assistants (including biology research assistants), and unless proper safeguards are put in place, I think this assistance will also extend to bioweapons ideation and acquisition. 

We have updated our language on the manuscript (primarily in response to comments like this) to convey that we are much more concerned about the capabilities of future open-source LLMs. Separately, I think benchmarking current models for biology capabilities and biosecurity risks is also important, and we have other work in place for this.

> I do think given that the model required fine-tuning on pathogen-specific information, which is a substantially greater challenge than figuring out the 1918 flu assembly instructions from googling and asking biology PhDs, even the reverse fine-tuning example falls flat for me. 

We revised the paper to mention that the fine-tuning didn’t appreciably help with the information generated by the Spicy/uncensored model (which we were able to assess by comparing how much of the acquisition pathway was revealed by the fine-tuned model vs a prompt-based-jailbroken version of Base model).  This was surprising for us (and a negative result): we did expect the fine-tuning to substantially increase information retrieval. 

However, the reason we opted for the fine-tuning approach in the first place was because we predicted that this might be a step taken by future adversaries. I think this might be one of our core disagreements with folks here: to us, it seems quite straightforward that instead of trying to digest scientific literature from scratch, sufficiently motivated bad actors (including teams of actors) might use LLMs to summarize and synthesize information, especially as fine tuning becomes easier and cheaper.  We were surprised at the amount of pushback this received.  (If folks still disagree that this is a reasonable thing for a motivated bad actor to do, I'd be curious to know why? To me, it seems quite intuitive.)

nit: 1918 flu not smallpox.

Oops, edited

From the evaluation section:

None of the participants obtained information sufficient to obtain infectious samples by any of the paths, although one came very close. However, in just one to three hours querying the models, several participants learned that obtaining 1918 would be feasible for someone with suitable wet lab skills. Those using the Spicy model also discovered methods for effective pathogen dispersal to cause widespread harm, instructions for building homemade lab equipment, and strategies to bypass DNA synthesis screening.

Notably, the inability of current models to accurately provide specific citations and scientific facts and their tendency to “hallucinate” caused participants to waste considerable time relative to an “expert” run that ignores such misinformation.

Yes, current open source models like Llama2 in the hands of laypeople are still a far cry from a expert in genetics who is determined to create bioweapons. I agree it would be far more damning had we found that not to be the case. 

If you currently believe that there isn't a biorisk information hazard posed by Llama2, would you like to make some explicit predictions? That would help me to know what observations would be a crux for you.

I don't have an opinion about the biorisk hazard posed by Llama 2, I just thought those two paragraphs did a good job summarizing what the paper found.

I agree with the other commenters here: access to Google and NCBI/PubMed should be taken as a control group.

How much of an effect do you think this would have? (market)

I'm continuing to contribute to work on biosafety evals inspired by this work. I think there is a high level point here to be made about safety evals.

 If you want to evaluate how dangerous a model is, you need to at least consider how dangerous its weights would be in the hands of bad actors. A lot of dangers become much worse once the simulated bad actor has the ability to fine-tune the model. If your evals don't include letting the Red Teamers fine-tune your model and use it through an unfiltered API, then your evals are missing this aspect. (This doesn't mean you would need to directly expose the weights to the Red Teamers, just that they'd need to be able to submit a dataset and hyperparameters and you'd need to then provide an unfiltered API to the resulting fine-tuned version.)

There is a prediction market about this that asks the question Are open source models uniquely capable of teaching people how to make 1918 flu?: https://manifold.markets/JeffKaufman/are-open-source-models-uniquely-cap Thanks to @jefftk for creating it.

Do we know how the results of interacting with the spicy model compared to what one could find by googling for a similar number of hours? Or perhaps it would instead be best to compare to a search engine without a generative AI pop-up like DuckDuckGo.

I agree having that control group would make for a more convincing case. I see no reason not to conduct an addendum with fresh volunteers. Hopefully that can be arranged.

My reaction on Twitter: https://twitter.com/stuartbuck1/status/1719152472558555481