If you read the Anthtropic System Card carefully (all 120 pages of it :-) ), you'll find that your proposal is exactly what they did, for biological risks (and they acknowledge that they didn't do it for chemical): test and measure how much the LLM's asistance (specifically, a helpful-but-not-harmless variant of it with no filters or similar mitigations, so comparable to what a good jailbreaker might be able to get the model to do but without the jailbreak overhead) increased the productivity of a group of people with Biological degrees but no specialized knowledge in making bio-weapons to complete practical tasks relating to making bio-weapons, compared to a team that didn't have AI assistance but could, for example, read Wikipedia pages.
There has recently been a back-and-forth over Claude 4 Opus:
Anthropic: Opus can help people make chemical weapons!
Also Anthropic: Don't worry, it's mitigated!
Day 1 (2?) Jailbreakers: Lol, lmao
Where the Jailbreakers think the info they got out of the model would make it materially easier to produce sarin gas. Although I do note that not one but two synthesis methods (which, to be fair, start from other highly illegal chemicals) appear on the Wikipedia page.
But we don't have any way to know this.
As it stands, AI labs and evals companies have done a decent job of categorizing the relative amount of CBRN-relevant information that LLMs can give out. But they haven't categorized how useful that information is in absolute terms.
We take a bunch of people with no (or high school) training in chemical synthesis, or bioengineering. Maybe a crop of CS grads, if you want a decently intelligent group of people (and given today's job market, these actually might be the next generation of domestic terrorists). Give them some equipment (or a budget), access to an LLM of a given capability, a budget, and a task. Something harmless, like growing some green glowing bacteria, or synthesizing ibuprofen. Measure success as a function of LLM capability and task difficulty. Keep the transcripts, to compare to stuff like what those guys managed to get out of
I have no idea how possible it would be to run this experiment. Trying to do it at a university is probably impossible (the safety concerns of having a bunch of by-definition untrained individuals doing experiments boggle the mind) which is a shame because a teaching lab during the summer seems like the perfect place. But until something like this is run, we're basically in the dark as to how much instruction is needed for an amateur to do a chemical synthesis.