This write-up for an undergraduate project is my first LW post, made with the objective of
a) gathering feedback on the project and post, if more experienced authors are willing, and
b) sending out results of a mech-interp-on-a-non-LLM (specifically, DNA basecaller) exploration in case the idea is interesting to anyone.
Apologies in advance for any inconveniences and mistakes, and thank you in advance for your understanding.
Summary
As a first AI Safety/mech interp learning project, I tried applying activation patching to a DNA sequencing basecaller- a deep learning model used to convert time-series electrical signals into a sequence of DNA bases. While looking at data related to errors in DNA sequences, I found trends showing overall MLP dominance, especially in the earlier and later layers, greater self-attention mechanism activity in the middle layers, and higher activations concentrated in specific attention heads.
This experiment interested me because basecallers are part of the modern DNA sequencing pipeline, and increasing the accuracy of sequencing methods could support work towards pathogen-agnostic surveillance systems. Though this technology is highly accurate, some difficulties (such as repeated bases or higher performance on species in training data) still remain, and mech interp seems like a potential path to understanding these systematic challenges. Additionally, in the mech interp field, finding similar patterns to LLMs would suggest universality and potentially add to insights about general behaviors of deep learning models.
There are key limitations that prevent conclusive insights from my work, but I’d love to know if anyone more experienced is able to glean anything interesting or speak on if this is a worthwhile direction to research in the future. The full paper and codebase are also available online for anyone curious.
Self-attention vs MLP recovery and degradation scores for the repeated-base (homopolymer) error group. Higher scores indicate patching the component resulted in greater confidence in the correct/original output choice. A score of 1 indicates the patching induced full recovery/degradation to the target signal, and a score of 0 indicates patching produced no change. Scores are computed using the max change in logit difference across all signal timesteps.
Background acknowledgements
I’m a recently-graduated Electrical and Computer Engineering major interested in pivoting into AI Safety, but I am in no way an expert on either technical AI safety, mech interp, or bioinformatics. This project came as a result of me pivoting my undergraduate thesis into something that would help me test research in AI safety, so I approached it primarily as a learning and exploration experience. As my first and biggest project in several fields I am inexperienced in, I acknowledge that there may be severe holes in my methods and that the conclusions or results may be straight up wrong. That being said, I really enjoyed the experience and would be more than grateful to read or discuss any comments on this work.
Methods
Activation patching involves swapping sections of model activations from two almost similar inputs to test how much that section affects the output. The aim is to isolate the sections responsible for certain behaviors by asking, “If we swap activations in section 1 to pretend the model saw input B instead of input A, does that give us output B?” In my case, the behavior in question was correctly counting the length of repeated DNA sequences (homopolymers). An example of a test was, “If I have an input 5 bases long, and I swap in activations in section 1 to pretend it saw an input 6 bases long instead, then if the whole model predicts 6 bases, section 1 is probably important to this task.” For each pair, activations were swapped in both directions to test which sections, for example, could be patched to “recover” 5 bases from a corrupted input of 6, and which could “degrade” the model’s prediction for an input of 5 into thinking it saw 6. While I was not able to isolate trends related to this behavior specifically and can only present general findings from the process of applying activation patching, this question helps explain the reasoning behind the rest of the setup.
To generate data, clean and corrupt pairs of raw nanopore sequencer data were created to form two groups- one from repeating (homopolymer) DNA sequences and one from nonrepeating (non-homopolymer) DNA sequences. In order to create similar pairs of signals with small, localized differences that could be patched, raw data was found from the Oxford Nanopore Technologies (ONT) POD5 repo multi_fast5_zip_v0.pod5 file and artificially altered (by injecting noise or dampening) to create clean (original) and corrupt (altered) pairs. Difficulties in creating a dataset led to key limitations discussed later. Homopolymers were chosen specifically as the feature of interest because the repeating bases create a signal plateau that basecallers commonly struggle to count.
Dataset generation process. A dataset of 49 clean/corrupt pairs (35 homopolymer and 14 non-homopolymer) was created by finding natural DNA sequencer reads with and without repeated bases and corrupting them via noise injection or dampening the signal. Different corruptions were tested experimentally to find DNA sequences and corruptions that would create single-bases errors in the decoded string.
The model tested was the open-source ONT Bonito 5.2.0 SUP model (version dna_r10.4.1_e8.2_400bps_sup@v5.2.0), which uses a CNN followed by an 18-layer transformer and conditional random field algorithm to transform raw time-series data from the sensed electrical current into a DNA sequence.
Using nnsight, I patched the model across all 18 layers in both the noising and denoising directions for three levels of granularity:
The entire layer, to serve as a proof of concept
The MLP or self-attention mechanism in isolation, to test their relative importance and activity across layers
Each of eight attention heads, to search for components that might be related to homopolymer detection or error correction
Results are compared across sections, layers, and the two groups: homopolymer and non-homopolymer sequences.
Results
Layer patching showed a large level of recovery/degradation, suggesting the method was correctly patching the model. Activity across layers seemed to follow a pattern consistent with deep learning models: the MLP played a greater role than the self-attention mechanism, though this gap decreased at points during the middle layers where the self-attention appeared to increase in activity. While denoising and noising results were generally similar, this was not always the case. Finally, activity in attention heads appeared to be primarily concentrated in certain heads while other heads contribute to a much smaller extent. The comparison between homopolymer and non-homopolymer error groups was not different enough to draw meaningful conclusions, though this could be due to an unbalanced dataset or the method of corrupting data.
Whole-transformer block patching results by layer. Patching at the final layer resulted in exact recovery/degradation, serving as a sanity-check that activation patching targeted the correct section of the model. It is unknown why results from the single-base region error group consistently show higher scores than the homopolymer group. This trend persists across all tests.
Results are generated using a recovery (for denoising) and degradation (noising) score where 0 represents no change from the initial input and 1 represents complete degradation or recovery. Scores above 1 indicate greater confidence in the changed output, and negative scores indicate greater confidence in the wrong/original output. Scores are generated from the logits predicting the probability of any window of five bases and next bases (e.g. one possibility is a sequence transition of AAAAA → [A]AAAAC) using the maximum change across all patched timesteps.
The results appear to follow general deep learning architecture- where context from surrounding tokens is factored in to a greater extent in middle layers- and one hypothesis is that the spikes in middle-layer attention activity could reflect the level of complexity of introduced corruptions: more complicated than basic patterns which the MLP may be recognizing early in the model, but not quite at the level of detailed last stages. Since the methodology appears to apply meaningfully, spikes in attention head activity suggest that it could be possible to find circuits performing functions related to systematic issues.
Combined results across all components in the homopolymer dataset group. Recovery and degradation scores are shown for activation patching at the layer, MLP, self-attention, and individual attention head levels. Scores shown are from taking the maximum score across all timesteps.
Limitations
Key issues include
Unbalanced dataset: while the goal was to discern differences related to homopolymers specifically, the dataset was unequal between bases (ex: only 3 out of 49 involved “T” bases) and between the homopolymer and non-homopolymer groups (35 vs 14 pairs). In the homopolymer group, the length of the repeated sequences was primarily 4-5 bases long (27 out of 35). This means that activity could be related specifically to certain patterns rather than the feature of homopolymers
Data corruption: patterns and unbalanced datasets could also be due to the method of data corruption- raw signals were arbitrarily tweaked until they decoded into a string representing a single-base difference from the original. However, there is no true DNA sequence for a corrupted string, and it is possible that the signals introduced artifacts or skewed the basecaller interpretation and decoding
Results use the maximum change in probability, resulting in high values that may not reflect activation patterns accurately. This was done to avoid misaligning timesteps across tests which could cancel out scores in the aggregate but could be addressed in future work
Questions
The results seem to follow general deep learning trends with potential for specialized circuits, but I’m unsure of my methodology. Do these findings seem plausible, or do the consistently higher scores in the non-homopolymer group or choice to use the max score across all timesteps suggest that there could be errors in the process that might invalidate the results?
What could be a better way of isolating potential feature-related results (ex: homopolymer/repeated base errors vs non-repeated base errors) from other effects such as biased data corruption? I'm especially curious if anyone has experience with basecallers which seem very sensitive to artifacts in the input.
I’m also wondering if genomics x mech interp generally seems like a direction that could be useful to either field?
Any other thoughts on the design, process, write-up, etc., would be also be greatly valued!
Future work
Improvements to this work that would help solidify its findings include
Investigating data corruption methods to see how this affects model outputs
Generating a larger and more balanced dataset
Filtering and comparing results based on the DNA base, length, and other sequence characteristics
Aligning aggregated results precisely based on timestamps rather than a maximum
Disclosure: I used AI to help review and edit this post. Many thanks given to my thesis advisors, and all mistakes are my own.
This write-up for an undergraduate project is my first LW post, made with the objective of
a) gathering feedback on the project and post, if more experienced authors are willing, and
b) sending out results of a mech-interp-on-a-non-LLM (specifically, DNA basecaller) exploration in case the idea is interesting to anyone.
Apologies in advance for any inconveniences and mistakes, and thank you in advance for your understanding.
Summary
As a first AI Safety/mech interp learning project, I tried applying activation patching to a DNA sequencing basecaller- a deep learning model used to convert time-series electrical signals into a sequence of DNA bases. While looking at data related to errors in DNA sequences, I found trends showing overall MLP dominance, especially in the earlier and later layers, greater self-attention mechanism activity in the middle layers, and higher activations concentrated in specific attention heads.
This experiment interested me because basecallers are part of the modern DNA sequencing pipeline, and increasing the accuracy of sequencing methods could support work towards pathogen-agnostic surveillance systems. Though this technology is highly accurate, some difficulties (such as repeated bases or higher performance on species in training data) still remain, and mech interp seems like a potential path to understanding these systematic challenges. Additionally, in the mech interp field, finding similar patterns to LLMs would suggest universality and potentially add to insights about general behaviors of deep learning models.
There are key limitations that prevent conclusive insights from my work, but I’d love to know if anyone more experienced is able to glean anything interesting or speak on if this is a worthwhile direction to research in the future. The full paper and codebase are also available online for anyone curious.
Self-attention vs MLP recovery and degradation scores for the repeated-base (homopolymer) error group. Higher scores indicate patching the component resulted in greater confidence in the correct/original output choice. A score of 1 indicates the patching induced full recovery/degradation to the target signal, and a score of 0 indicates patching produced no change. Scores are computed using the max change in logit difference across all signal timesteps.
Background acknowledgements
I’m a recently-graduated Electrical and Computer Engineering major interested in pivoting into AI Safety, but I am in no way an expert on either technical AI safety, mech interp, or bioinformatics. This project came as a result of me pivoting my undergraduate thesis into something that would help me test research in AI safety, so I approached it primarily as a learning and exploration experience. As my first and biggest project in several fields I am inexperienced in, I acknowledge that there may be severe holes in my methods and that the conclusions or results may be straight up wrong. That being said, I really enjoyed the experience and would be more than grateful to read or discuss any comments on this work.
Methods
Activation patching involves swapping sections of model activations from two almost similar inputs to test how much that section affects the output. The aim is to isolate the sections responsible for certain behaviors by asking, “If we swap activations in section 1 to pretend the model saw input B instead of input A, does that give us output B?” In my case, the behavior in question was correctly counting the length of repeated DNA sequences (homopolymers). An example of a test was, “If I have an input 5 bases long, and I swap in activations in section 1 to pretend it saw an input 6 bases long instead, then if the whole model predicts 6 bases, section 1 is probably important to this task.” For each pair, activations were swapped in both directions to test which sections, for example, could be patched to “recover” 5 bases from a corrupted input of 6, and which could “degrade” the model’s prediction for an input of 5 into thinking it saw 6. While I was not able to isolate trends related to this behavior specifically and can only present general findings from the process of applying activation patching, this question helps explain the reasoning behind the rest of the setup.
To generate data, clean and corrupt pairs of raw nanopore sequencer data were created to form two groups- one from repeating (homopolymer) DNA sequences and one from nonrepeating (non-homopolymer) DNA sequences. In order to create similar pairs of signals with small, localized differences that could be patched, raw data was found from the Oxford Nanopore Technologies (ONT) POD5 repo multi_fast5_zip_v0.pod5 file and artificially altered (by injecting noise or dampening) to create clean (original) and corrupt (altered) pairs. Difficulties in creating a dataset led to key limitations discussed later. Homopolymers were chosen specifically as the feature of interest because the repeating bases create a signal plateau that basecallers commonly struggle to count.
Dataset generation process. A dataset of 49 clean/corrupt pairs (35 homopolymer and 14 non-homopolymer) was created by finding natural DNA sequencer reads with and without repeated bases and corrupting them via noise injection or dampening the signal. Different corruptions were tested experimentally to find DNA sequences and corruptions that would create single-bases errors in the decoded string.
The model tested was the open-source ONT Bonito 5.2.0 SUP model (version dna_r10.4.1_e8.2_400bps_sup@v5.2.0), which uses a CNN followed by an 18-layer transformer and conditional random field algorithm to transform raw time-series data from the sensed electrical current into a DNA sequence.
Using nnsight, I patched the model across all 18 layers in both the noising and denoising directions for three levels of granularity:
Results are compared across sections, layers, and the two groups: homopolymer and non-homopolymer sequences.
Results
Layer patching showed a large level of recovery/degradation, suggesting the method was correctly patching the model. Activity across layers seemed to follow a pattern consistent with deep learning models: the MLP played a greater role than the self-attention mechanism, though this gap decreased at points during the middle layers where the self-attention appeared to increase in activity. While denoising and noising results were generally similar, this was not always the case. Finally, activity in attention heads appeared to be primarily concentrated in certain heads while other heads contribute to a much smaller extent. The comparison between homopolymer and non-homopolymer error groups was not different enough to draw meaningful conclusions, though this could be due to an unbalanced dataset or the method of corrupting data.
Whole-transformer block patching results by layer. Patching at the final layer resulted in exact recovery/degradation, serving as a sanity-check that activation patching targeted the correct section of the model. It is unknown why results from the single-base region error group consistently show higher scores than the homopolymer group. This trend persists across all tests.
Results are generated using a recovery (for denoising) and degradation (noising) score where 0 represents no change from the initial input and 1 represents complete degradation or recovery. Scores above 1 indicate greater confidence in the changed output, and negative scores indicate greater confidence in the wrong/original output. Scores are generated from the logits predicting the probability of any window of five bases and next bases (e.g. one possibility is a sequence transition of AAAAA → [A]AAAAC) using the maximum change across all patched timesteps.
The results appear to follow general deep learning architecture- where context from surrounding tokens is factored in to a greater extent in middle layers- and one hypothesis is that the spikes in middle-layer attention activity could reflect the level of complexity of introduced corruptions: more complicated than basic patterns which the MLP may be recognizing early in the model, but not quite at the level of detailed last stages. Since the methodology appears to apply meaningfully, spikes in attention head activity suggest that it could be possible to find circuits performing functions related to systematic issues.
Combined results across all components in the homopolymer dataset group. Recovery and degradation scores are shown for activation patching at the layer, MLP, self-attention, and individual attention head levels. Scores shown are from taking the maximum score across all timesteps.
Limitations
Key issues include
Questions
Any other thoughts on the design, process, write-up, etc., would be also be greatly valued!
Future work
Improvements to this work that would help solidify its findings include
Disclosure: I used AI to help review and edit this post. Many thanks given to my thesis advisors, and all mistakes are my own.