Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training

Ajay Mandyam Rangarajan

TL;DR

We test a genomic foundation model (DNABERT-2 encoder with a lightweight classifier head) using a black-box genetic algorithm that proposes biologically plausible edits constrained by GC balance, motif preservation, and transition bias.
With only 2 to 3 single-nucleotide substitutions in 300 base pairs, the attack succeeds on 20 to 50 percent of previously correct samples under our success criterion (label flip or large confidence drop).
A short adversarial training loop that adds only 25 to 60 adversarial sequences per round and fine-tunes the classifier head restores robustness while keeping clean test accuracy above 90 percent.
This yields a simple stress-test and patch procedure that research and screening pipelines can adopt with minimal changes.

Contributions

This project began at an Apart Research hackathon on CBRN AI Risk Research Sprint. We, Jeyashree Krishnan and Ajay Rangarajan, built an initial prototype in roughly 48 hours, then reproduced and cleaned it up after the event. This post adapts that submission into a blog article. Links to the original submission and code are provided below in the post. Any views on future work are ours.

Acknowledgements

We thank Apart Research for organizing the hackathon that catalyzed this project and for thoughtful feedback on early drafts. Any remaining errors are our responsibility.

Overview

Frontier Large language foundation models changed how we process text, and a similar idea is transforming biology. Genomic foundation models (GFMs) i.e., models trained on large volumes of genomic data, learn statistical patterns over these letters in the same way that language models learn patterns over the 26 letters of the Latin alphabet. These models already help with basic classification tasks in genomics, such as identifying certain regions of DNA that control gene activity.

In this study we asked a practical safety question. Can very small and biologically plausible changes to a DNA sequence cause a genomic foundation model to fail on a simple classification task, and if so, can we make the model resilient again without losing normal accuracy?

Biological plausibility here means that a synthetic or edited sequence still respects certain rules of the generic sequences. If an adversary can introduce only a few letter changes but still match these background rules, then the attack is harder to detect and is closer to real laboratory scenarios. As an example, natural languages have characteristic letter frequencies - in English, E and S are common, while Z and Q are rare. Similarly, DNA is made up of four letters A, T, G and C (commonly referred to as nucleotides) and shows its own statistical regularities, such as overall GC content ranges and recurrent short patterns (called motifs).

We focused on promoter classification with sequences of length 300 base pairs (bp). We found that 2 to 3 single nucleotide substitutions out of 300 often suffice to flip the model’s prediction. We then used a short adversarial training loop to recover robustness while keeping test accuracy greater than 90%.

Why is this Important?

Genes are DNA recipes that cells read to make functional products, often proteins. Regulatory DNA controls when genes are turned on or off. Promoters are one important type of regulatory DNA i.e., they are short regions, usually located just upstream of a gene, that act like on switches and help recruit the transcription machinery.

In addition, classification is a core task for GFMs. Given a DNA segment, the model answers whether it is a promoter or not. Before any discussion of attacks, it is important to remember that such classifiers are becoming building blocks for analysis pipelines in research and, increasingly, in clinical genomics.

Prior work has shown that small changes can strongly affect DNA classifiers. It has been reported that minimal nucleotide edits can flip model outputs in discrete sequence spaces ^[1]. In addition, evaluations on DNABERT-2 ^[2] and other genomic models show severe accuracy drops with simple adversarial edits ^[3]. These results motivate systematic robustness checks for GFMs in clinical and screening contexts and research work in similar directions have been conducted previously ^[4]^[5].

Promoter detection is a basic building block in genome analysis. If the model gets this wrong, later steps can inherit the mistake, for example when interpreting patient variants, designing gene therapy vectors, or screening synthetic DNA. Our contribution is a simple way to test a model using only its outputs and then strengthen it with the same test. The same idea can be applied to higher risk tasks such as detecting antimicrobial resistance genes, pathogen screening, and safety checks in DNA synthesis.

Core Idea

We evaluate a strong foundation transformer model, DNABERT-2, used with a classification head for promoter detection. We first develop a black-box attack that searches for few-base edits which flip predictions while keeping sequences biologically plausible. We then run an iterative adversarial training loop that adds a very small number of such examples and fine-tunes only the classifier head. The goal is to reduce attack success while preserving clean accuracy.

We assume that a few nucleotide changes do not alter the biological function relevant to the label, only the model’s prediction. This assumption is supported by plausibility constraints that preserve GC ratio, keep common regulatory motifs intact, and prefer transition mutations (A↔G, C↔T) that are common in evolution. These are proxies for functional invariance rather than guarantees. In our experiments the label after editing is treated as unchanged in the real world, and the attack is considered successful only if the model’s prediction flips (or the model's confidence in its prediction drops significantly) under biological plausibility constraints. This is an assumption, not a proof of functional invariance. Future work could add higher fidelity checks, such as motif occupancy predictors or even wet lab validation, to verify that semantics truly remain constant.

We choose black-box attacks as they are more powerful in practice because they require only query access to the model and often transfer across architectures, which makes them feasible even when model internals are hidden.

**Iterative robustness loop: blue = clean training and evaluation; red = black-box GA attack and adversarial; dashed red arrow = repeat for set number of iterations.**

Adversarial Training

Model and task: DNABERT-2 is a transformer trained on large multi-species DNA corpora. For classification, we freeze this encoder and train a small binary classification head on top. In our data set, each input sequence has 300 bases (A-T-G-C). We follow a standard train, validation, and test protocol.

Attack: We treat the classifier as a black box and use a genetic algorithm to propose and refine candidate edits. Each candidate involves as few as possible nucleotide substitutions. In each generation we query the model for the current candidate, score whether the prediction flips or the confidence drops, and keep the best candidates for crossover and mutation. We enforce biological plausibility during mutation to stay close to the original, by preferring transition mutations A↔G and C↔T which are common in natural evolution, and by avoiding disruptions to known short regulatory motifs present in the sequence context.

Threat model: Compared to pixel or open vocabulary token perturbations, the feasible edit set is constrained: substitutions among A, T, G, and C at specified positions, with small insertions, deletions, inversions, or duplications under plausibility rules. This tighter space enables more systematic sampling and coverage analysis. We treat this as an advantage for realistic threat modeling while acknowledging that rare but effective edits may still exist.

ASR and budgets: There is no hard mutation budget. The genetic algorithm uses a fixed search budget (population size, maximum generations, early convergence). The number of mutated positions is guided by a max perturbations parameter (set to 8 in our case) term that enters the fitness as a penalty, not as a strict cap. Attack success rate per iteration is computed over attacked samples as follows. For each sample , $s u c c e s s (i) = 1$ if either the model’s maximum confidence on the adversarial sequence is below a fixed confidence threshold, or the confidence drop from original to adversarial drops beyond a threshold. Otherwise $s u c c e s s (i) = 0$ . We report:

$A S R = \frac{1}{N} \sum_{i} s u c c e s s (i) .$

This definition allows success necessarily without a label flip when the model becomes sufficiently uncertain.

**Simple example of a nucleotide-level adversarial attack - A→T at position 5 flips the prediction.**

Defense: We run an iterative adversarial training loop. First we train the baseline classifier head on the clean training set. Next we generate a small batch of adversarial sequences against this model and fine-tune only the classifier head on these adversarial examples with their correct labels. We repeat this procedure for a few rounds.

Results

Initial vulnerability: With only 2 to 3 edits (out of 300), the attack often performs poorly on previously correct samples. The Attack Success Rate (ASR) is in the range 20 to 50 percent depending on the target and round. This confirms that small, realistic changes can mislead a classifier even on a simple task, consistent with earlier reports in the literature.

Adversarial training preserves biological realism and improves robustness. Top: clean accuracy stays ~94% while attack success peaks early then declines; confidence drops diminish and required perturbations do not increase. Bottom: average biological plausibility remains high (~0.95–0.99) across iterations.

Iterative robustness: In each training round we add only 25 to 60 adversarial sequences, which is less than 0.2% of the training set. After a few rounds, the model resists the same attacks that previously worked, and while maintaining the clean test accuracy above 90%. As we only train the classification head and the encoder stays fixed, training is stable and efficient. The attack success rate initially rises as the model encounters novel adversarial patterns, but subsequently declines as it learns to resist them. This transient increase reflects early vulnerability followed by progressive robustness across iterations.

Interpretation: These results show the fragility to tiny, plausible edits is real even on a widely used benchmark, but also that a short and targeted adversarial training loop can meaningfully improve robustness without degrading ordinary performance.

Future outlook and next steps

The key results of this work are:

Attack Success: 20–50% of target sequences fooled by just 2–3 nucleotide edits .
Adversarial Training: Adding only 25–60 “malicious” sequences (<0.2% of data) significantly exposed the model’s weaknesses, but after a few rounds of training on these examples the robust accuracy recovered to >90% .
Contributions: Demonstration of iterative black-box attacks on genomic models, with practical defense. Shows the feasibility of AI safety for genomics, a vital step for responsible bio-AI development.

We see three high level directions. First, evaluate multi-class and multi-label tasks beyond promoters to understand whether the same pattern holds. Second, consider diverse attack families and test transfer across different encoders to assess shared weak points. Third, integrate adversarial stress tests and simple detectors into standard validation protocols for clinical and screening workflows.

A practical difference from vision and text domains compared to genomic sequences is that genes live in a discrete four letter alphabet with well studied biological regularities. This yields a narrower and more structured attack surface. Our working hypothesis is that robustness gained against one family of biologically plausible attacks will partly transfer to others, because many attacks must respect the same constraints. This is a testable claim, not a conclusion, and we will evaluate cross method transfer explicitly.

This work explores the area of adversarial genomics. As GFMs enter clinics and labs, ensuring they are trustworthy will be crucial. We have shown a practical path to making them robust, but much remains to be explored.

Call for collaboration: We welcome all collaborators interested in this area. If you work on regulatory genomics, or biosecurity policy and would like to co-design realistic threat models, please reach out to us. Specifically, we are also interested in multi class evaluations, transfer across foundation models, sequence level detectors, and principled guarantees for small edit distances but are open to other ideas.

Learn More: The full code and data for our experiments are public. See our GitHub repository for implementation details and scripts, and the Apart Research project page for the original submission.

Abbreviations

DNA: deoxyribonucleic acid

bp: base pairs

GC: fraction of G and C letters in a DNA sequence

GFM: genomic foundation model

GA: genetic algorithm

ASR: attack success rate

DNABERT-2: transformer encoder for DNA sequences used here with a lightweight classifier head on top

^{^}
Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Quantifying and understanding adversarial examples in discrete input spaces. arXiv:2112.06276, 2021.
^{^}
Zhihan Zhou, Yanrong Ji, Weidong Li, Pratik Dutta, Ramana Davuluri, and Han Liu. DNABERT-2: Efficient foundation model and benchmark for multi-species genome. ICLR, 2024.
^{^}
Hyunwoo Yoo, Jaehyun Kim, Jaewoo Lee, and Sanghyun Kim. Exploring adversarial robustness in classification tasks using DNA language models. arXiv:2409.19788, 2025.
^{^}
Heorhii Skovorodnikov and Hoda Alkhzaimi. FIMBA: Evaluating the robustness of AI in genomics via feature importance adversarial attacks. arXiv:2401.10657, 2024.
^{^}
Haozheng Luo, Chenghao Qiu, Yimin Wang, Shang Wu, Jiahao Yu, Zhenyu Pan, Weian Mao, Haoyang Fang, Hao Xu, Han Liu, Binghui Wang, Yan Chen. GenoArmory: A unified evaluation framework for adversarial attacks on genomic foundation models. arXiv:2505.10983, 2025.

LESSWRONG
LW