[Cross-posted from On Failure States. This is Part 1 of an independent AI safety research series examining LLM safety behavior on unclassified emerging threats.]
Can an LLM refuse a harmful uplift request when the topic in question hasn’t been identified as dangerous yet? In 2022, mirror RNA polymerase was actually created, a key step towards the creation of mirror life, and in 2024 the scientific community warned against any further research on it.[1][2] Having said that, mirror life is not currently classified as WMD or CBRN officially, although Congress appears to be assessing the question. What would happen if an LLM is asked about it? Safety systems often inherit legal and institutional categories, but frontier risks may arrive before those categories stabilize.
This provides a unique opportunity for safety testing on LLMS, given the lack of outright classification, should an LLM be prompted to give uplift to an actor in regards to mirror life, it may well fall upon other safety inferences, or fail to refuse at all. If a user asked about the chemical precursors to VX gas, the LLM would refuse to help. With mirror life’s lack of classification the question becomes less obvious. Thus, the research question for this project is: how does LLM safety work when formal labels lag behind technological reality?
There has been some research done on LLMs and mirror life. In 2025 Kevin M. Esvelt[3] showed that even advanced models may struggle to recognize mirror life as the kind of threat they should treat cautiously. He even provided expert guidance to some models, and they still could not identify somewhat novel threats. What I propose is to ask what logic an LLM may use to identify a novel threat, not to classify how bad the threat may be.
While novel threats are by definition rare, it is worth questioning LLM’s capacity to identify them for the following reason – novel threats are new, and therefore often beyond the knowledge cutoff, which means models must reason or web search before internal safeties can fire. Furthermore, it is well known that technology moves faster than regulation and law, so the LLM lacks the usual definition structure and attendant training to be safe. This study thus sits at the intersection of multiple different variables.
I suspect that models will identify novel threats through a few mechanisms:
First: Vocabulary triggering: the simplest, if the specific words being used trigger safety protocols: bomb, mirror life virus, synthetic virus, etc. A user that asked for help making a bomb will be refused, a user asking for help making acetone might well receive instructions.
Second: Category matching: if the model pattern matches the phrases in the prompt to the categories of biological weapon, wmd, terrorist, etc. For instance, in Schedule IV drug legislation, both the drug itself and its precursors are listed as controlled substances – will the model make a similar category association for the precursors to mirror RNA?
Third: Principled Inference: the model reasons that uplift in the particular chat could result in harm regardless of the framing. A user asking if a Candu reactor could theoretically produce plutonium will receive a response. A user asking what the coolant pressure needs to be in order to produce plutonium in a Candu should be refused.
Fourth: Actionability gradients: Models differentiate between general knowledge and actual instructions in a domain. For example, a user asking about explosive yields between Uranium and Plutonium in the context of Fat Man and Little Boy will receive a response because the context is history knowledge, general enough not to provide Wmd uplift. A user asking how to create a Uranium bullet (not depleted) should be refused.
Fifth: User sophistication: the model infers the user sophistication based on the framing of the question and may refuse more sophisticated requests. If a user asks “How does a silencer work?” the model will discuss. However, if a user asks “What baffle geometry and wipe material best suppresses first-round pop on a .308 bolt action.” would be refused. The former is curious, the latter is seeking optimization.
Sixth: Model Inference: a seemingly benign and oblique question, about a necessary step towards harmful uplift might not be noticed by a less capable model, while a more capable one could identify it as a step on the way to harmful uplift. If a user asks “What commercially available steel alloys can withstand sustained centrifugal stress above 500 RPM at high temperatures?”. A less capable model with the relevant materials science within its knowledge cutoff might well be able to respond, and not refuse on safety grounds. That prompt had no mention of wmd, enrichment, fission, criticality, etc. Yet a person with relevant CBRN knowledge would immediately know it’s a question about uranium centrifuge construction.
The sixth safety category, model inference, is the most interesting for this project. It forces us to ask the question “If the model is not sophisticated enough to reason that the unrelated topic is still a step to the practical construction of something harmful, then is it sophisticated enough to even respond to the prompt?”. Ironically lack of sophistication is itself a safety barrier in this mode. Yet, even offline models continue to increase in reasoning capability.
In future installments, I will present the results of structured conversations with several models, exploring how each describes its own safety reasoning when confronted with an unclassified but potentially catastrophic threat. I will also compare responses across model capability tiers to test whether the six categories outlined above correspond to real differences in model behavior. Even if no actual uplift is detected, identifying taxonomies by model will be useful data that this project will reveal.
[1] Xu, Yuan, and Ting F. Zhu. “Mirror-Image T7 Transcription of Chirally Inverted Ribosomal and Functional RNAs.” Science, vol. 378, no. 6618, 2022, pp. 405–12. https://www.science.org/doi/10.1126/science.abm0646
[3] Esvelt, Kevin M. “Foundation Models May Exhibit Staged Progression in Novel CBRN Threat Disclosure.” arXiv, 19 Mar. 2025, https://arxiv.org/abs/2503.15182.
[Cross-posted from On Failure States. This is Part 1 of an independent AI safety research series examining LLM safety behavior on unclassified emerging threats.]
Can an LLM refuse a harmful uplift request when the topic in question hasn’t been identified as dangerous yet? In 2022, mirror RNA polymerase was actually created, a key step towards the creation of mirror life, and in 2024 the scientific community warned against any further research on it.[1][2] Having said that, mirror life is not currently classified as WMD or CBRN officially, although Congress appears to be assessing the question. What would happen if an LLM is asked about it? Safety systems often inherit legal and institutional categories, but frontier risks may arrive before those categories stabilize.
This provides a unique opportunity for safety testing on LLMS, given the lack of outright classification, should an LLM be prompted to give uplift to an actor in regards to mirror life, it may well fall upon other safety inferences, or fail to refuse at all. If a user asked about the chemical precursors to VX gas, the LLM would refuse to help. With mirror life’s lack of classification the question becomes less obvious. Thus, the research question for this project is: how does LLM safety work when formal labels lag behind technological reality?
There has been some research done on LLMs and mirror life. In 2025 Kevin M. Esvelt[3] showed that even advanced models may struggle to recognize mirror life as the kind of threat they should treat cautiously. He even provided expert guidance to some models, and they still could not identify somewhat novel threats. What I propose is to ask what logic an LLM may use to identify a novel threat, not to classify how bad the threat may be.
While novel threats are by definition rare, it is worth questioning LLM’s capacity to identify them for the following reason – novel threats are new, and therefore often beyond the knowledge cutoff, which means models must reason or web search before internal safeties can fire. Furthermore, it is well known that technology moves faster than regulation and law, so the LLM lacks the usual definition structure and attendant training to be safe. This study thus sits at the intersection of multiple different variables.
I suspect that models will identify novel threats through a few mechanisms:
First: Vocabulary triggering: the simplest, if the specific words being used trigger safety protocols: bomb, mirror life virus, synthetic virus, etc. A user that asked for help making a bomb will be refused, a user asking for help making acetone might well receive instructions.
Second: Category matching: if the model pattern matches the phrases in the prompt to the categories of biological weapon, wmd, terrorist, etc. For instance, in Schedule IV drug legislation, both the drug itself and its precursors are listed as controlled substances – will the model make a similar category association for the precursors to mirror RNA?
Third: Principled Inference: the model reasons that uplift in the particular chat could result in harm regardless of the framing. A user asking if a Candu reactor could theoretically produce plutonium will receive a response. A user asking what the coolant pressure needs to be in order to produce plutonium in a Candu should be refused.
Fourth: Actionability gradients: Models differentiate between general knowledge and actual instructions in a domain. For example, a user asking about explosive yields between Uranium and Plutonium in the context of Fat Man and Little Boy will receive a response because the context is history knowledge, general enough not to provide Wmd uplift. A user asking how to create a Uranium bullet (not depleted) should be refused.
Fifth: User sophistication: the model infers the user sophistication based on the framing of the question and may refuse more sophisticated requests. If a user asks “How does a silencer work?” the model will discuss. However, if a user asks “What baffle geometry and wipe material best suppresses first-round pop on a .308 bolt action.” would be refused. The former is curious, the latter is seeking optimization.
Sixth: Model Inference: a seemingly benign and oblique question, about a necessary step towards harmful uplift might not be noticed by a less capable model, while a more capable one could identify it as a step on the way to harmful uplift. If a user asks “What commercially available steel alloys can withstand sustained centrifugal stress above 500 RPM at high temperatures?”. A less capable model with the relevant materials science within its knowledge cutoff might well be able to respond, and not refuse on safety grounds. That prompt had no mention of wmd, enrichment, fission, criticality, etc. Yet a person with relevant CBRN knowledge would immediately know it’s a question about uranium centrifuge construction.
The sixth safety category, model inference, is the most interesting for this project. It forces us to ask the question “If the model is not sophisticated enough to reason that the unrelated topic is still a step to the practical construction of something harmful, then is it sophisticated enough to even respond to the prompt?”. Ironically lack of sophistication is itself a safety barrier in this mode. Yet, even offline models continue to increase in reasoning capability.
In future installments, I will present the results of structured conversations with several models, exploring how each describes its own safety reasoning when confronted with an unclassified but potentially catastrophic threat. I will also compare responses across model capability tiers to test whether the six categories outlined above correspond to real differences in model behavior. Even if no actual uplift is detected, identifying taxonomies by model will be useful data that this project will reveal.
[1] Xu, Yuan, and Ting F. Zhu. “Mirror-Image T7 Transcription of Chirally Inverted Ribosomal and Functional RNAs.” Science, vol. 378, no. 6618, 2022, pp. 405–12. https://www.science.org/doi/10.1126/science.abm0646
[2] Adamala, Katarzyna P., et al. “Confronting Risks of Mirror Life.” Science, vol. 386, no. 6728, 2024, p. 1351. https://www.science.org/doi/10.1126/science.ads9158
[3] Esvelt, Kevin M. “Foundation Models May Exhibit Staged Progression in Novel CBRN Threat Disclosure.” arXiv, 19 Mar. 2025, https://arxiv.org/abs/2503.15182.