Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

H/T Aella.

A company that made machine learning software for drug discovery, on hearing about the security concerns for these sorts of models, asked: "huh, I wonder how effective it would be?" and within 6 hours discovered not only one of the most potent known chemical warfare agents, but also a large number of candidates that the model thought was more deadly.

This is basically a real-world example of the "it just works to flip the sign of the utility function and turn a 'friend' into an 'enemy'"; this was slightly more complicated as they had two targets that they jointly optimized for the drug discovery process (toxicity and bioactivity), and only the toxicity target is flipped. [This makes sense--you'd want your chemical warfare agents to not be bioactive.] It also required a little bit of domain knowledge--they had to specify which sort of bioactivity to look for, and picked one that would point towards this specific agent.

New Comment
15 comments, sorted by Click to highlight new comments since:
[-]gwernΩ9290

Worth remembering that flips of the reward function do happen: https://openai.com/blog/fine-tuning-gpt-2/#bugscanoptimizeforbadbehavior ("Was this a loss to minimize or a reward to maximize...")

Galaxy-brained reason not to work on AI alignment: anti-aligned ASI is orders of magnitude more bad than aligned ASI is good, so it's better to ensure that the values of the Singularity are more or less orthogonal to CEV (which happens by default).

I see your point as warning against approaches that are like "get the AI entangled with stuff about humans and hope that helps".

There are other approaches with a goal more like "make it possible for the humans to steer the thing and have scalable oversight over what's happening".

So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!

And if you wouldn't notice something as wrong as a minus sign, you're probably in trouble about noticing other misalignment.

I had a long back-and-forth about that topic here. Among other things, I disagree that "more or less orthogonal to CEV" is the default in the absence of alignment research, because people will presumably be trying to align their AIs, and I think there are will be obvious techniques which will work well enough to get out of the "random goal" regime, but not well enough for reliability.

I disagree that "more or less orthogonal to CEV" is the default in the absence of alignment research,

because people will presumably be trying to align their AIs

people trying to align their AIs == alignment research

I think there is a danger that the current abilities of ML models in drug design are being overstated. The authors appear to have taken a known toxicity mode ( probably acetylcholinesterase inhibition - the biological target of VX, Novichok and many old pesticides) and trained their model to produce other structures with activity against this enzyme. Their model claims to have produced significantly more active structures but none were synthesised. Current ML models in drug design are good at finding similar examples of known drugs, but are much less good (to my own disappointment - this is what I spend many a working day on)  at finding better examples, at least in a single property optimisation. This is largely because, in predicting stronger chemicals,  the models are generally moving beyond their zone of applicability. In point of fact, the field of acetylcholine esterase inhibition has been so well studied (much of it in secret) it is quite likely IMO that the list of predicted highly toxic designs is, at best, only very sparsely populated with significantly stronger nerve agents than the best available. Identifying which structures those are, out of potentially thousands of good designs, still remains a very difficult task.

This is not to take away from the authors’ main point, that ML models could be helpful in designing better chemical weapons. A practical application might be to attempt to introduce a new property (E.g brain penetration) into a known neurotoxin class that lacked that that property. An ML model optimised on both brain penetration and neurotoxicity would certainly be helpful in the search for such agents.  
 

Good analysis, but I did not upvote because of the potential info-hazard that explaining how to use AI to more effectively create hazardous compounds poses. I'd like others to do the same, and you should consider deleting this comment.

All biological sciences research is dual use. If you don't see the evil, you're not looking hard enough. More shocked at the tone of the paper, which implies that this is surprising to the model developers than the result. When you can do combinatorial chemistry in silico, you can make all sorts of stuff...

They don't mention accident scenarios, as far as that goes, I imagine that some of the compounds they found by looking for bad stuff might show up if they're looking for memantine (edit: meant galantamine, whoops) like Alzheimer's drugs and don't take special efforts to avoid the toxic modes of action.

If you don't see the evil, you're not looking hard enough

... aging research?

(also inb4 "muh immortal elites", hereditary power transfer already effectively does the same thing)

There's a quick CFAR class that I taught sometimes, which was basically "if you understand a bug, you should be able to make it worse as well as better." [That is, suppose you develop an understanding of which drugs cause old cells to die; presumably you could use that understanding to develop a drug that causes them to stick around longer than they should, speeding up aging, or maybe to kill young cells, or so on.]

More shocked at the tone of the paper, which implies that this is surprising to the model developers than the result.

If you want to convince the reviewers of your paper that the experiment you did one evening is worth publishing, then you have to present it as something significantly new. 

They don't mention accident scenarios, as far as that goes, I imagine that some of the compounds they found by looking for bad stuff might show up if they're looking for memantine like Alzheimer's drugs and don't take special efforts to avoid the toxic modes of action.

They literally have the scoring for toxicity to avoid a constructed molecule as a drug candidate from being toxic.

"how could it possibly be toxic in vivo, we had a scoring for toxicity in our combinational chemistry model!"

Usually when you're screening for tox effects in a candidate you're looking for off target effects (some metabolic process produces a toxic aniline compound which goes off into some other part of the body, usually the liver and breaks something), in this particular case, that isn't the whole picture. Galantamine (useful drug, originally said memantine which is taken with it but isn't in the same class) and VX (nerve agent) are both acetylcholinesterase inhibitors, a key difference is that VX is much better at it.

One way to achieve the aim in the paper would be to set the model to produce putative acetylcholinesterase inhibitors, rank them by estimated binding efficiency, then setting a breakpoint line between assessed 'safe' and assessed 'toxic'. Usually you'd be looking under the line (safe), in this case, they're looking above it (toxic).

My point was that in my opinion, being open to the possibility of the line having been placed in the wrong place (special efforts) is probably wise. This opens up an interesting question about research ethics--would doing experimental work to characterize edge cases in order to refine the model (location of the line) be legitimate or malign?

You know, it might be fun to take something like this and point it at food, just to see how many of the outputs are already in our diets.

lolsad

I would expect that you need to put in more work to produce a model that's useful for analyzing problems with food. 

This model likely goes for highly reactive substances. The problematic interactions with food are likely that something in the food reacts with a few human proteins. I would expect that you need to actually model the protein interactions for that.