We now argue that the idea of substrate, as we describe it in the original work and in the previous post, unifies several safety-relevant phenomena in AI safety. This list is expanding as we identify and clarify more.
These examples show how the specific details of how a complex computation (i.e., that of a neural network) is implemented in a real system affect that computation such that safety might be compromised.
Case 1: LayerNorm's role in self-repair
Self-repair, also referred to as the Hydra effect, is a phenomena by which ablations to network components (e.g., attention heads, MLP layers) are compensated for by later layers in the network.
Such behaviour represents an obstacle to the causal analysis of neural networks, as such analysis relies on ablations to isolate genuine causal relevance from observed correlations.
Self repair was first identified in Wang et al. (2023), investigated in more detail in McGrath et al. (2023), and given a full analysis in Rushing & Nanda (2024). We here follow the latter, which presents a fine-grained analysis of various aspects of self-repair.
The most interesting result of Rushing and Nanda's work is that LayerNorm contributes substantially to a network's self-repair capabilities. This is an architectural component of the model, often folded into the model weights themselves in mathematical descriptions or neglected entirely as a "passive" part of the model.
In our terminology, it forms a part of the substrate. It is context surrounding the purely formal function f(x) = y that characterizes the model's behaviour, allowing that function to be learned and implemented.
This passive, architectural primitive rescales and centres the activation values such that the per-layer activations average to zero and have a variance of 1. When an attention head is ablated, the overall size of the residual stream norm is reduced. LayerNorm divides the residual stream by a scaling factor proportional to its norm, so reducing the norm amplifies the surviving components' contributions. Model components tend to be correlated, pushing the residual stream vector towards the same answer. Thus, they compensate for the absent signal from the ablated head.
On average, when an attention head is ablated, this re-scaling contributes ~30% of the observed self-repair in Pythia-160M.[1]
This is a relatively shallow example of an untrained, "passive" component of the model (i.e., an architectural component) contributing to safety-relevant properties.
Case 2: Quantization
The next case study we consider is the role that quantization plays in safety-relevant aspects of AI systems.
Quantization refers to the format in which a model's weights are stored in computer memory. A neural network's weights are just numbers, but these are represented in different ways to improve precision and efficiency. Models are typically trained in FP16, a 16-bit floating-point format: one sign bit, five exponent bits, and ten mantissa bits. A value is written in this form:
But FP16 is an expensive format to store weights in. Deployment thus increasingly favours quantized formats: FP8 (still floating-point, but with a narrower range), INT8 and INT4 (plain signed integers with a scale factor rescaling them at inference time), and others.
Performance loss from quantization is smaller than one might expect. Well-engineered 8-bit quantization typically costs less than 1% on standard benchmarks, and even INT4 loses only 1-5% performance when using careful compression.
Like LayerNorm, quantization format is substrate rather than model in our sense. It is computational context surrounding the formal function f(x) = y that characterises model behaviour. It is a further design choice made according to the constraints of the deployment hardware.
Its safety relevance shows up most sharply in the bit-flip attack literature. Bit-flip attacks use hardware fault-injection techniques to alter stored model weights directly. Coalson et al.'s PrisonBreak (2024) showed that flipping fewer than 25 bits (sometimes as few as 5 bits) in an aligned FP16 LM is enough to strip safety alignment. Their attack worked by targeting the most significant exponent bits of FP16 weights, since flipping an exponent bit can change a weight's magnitude by as much as 16 orders of magnitude.
Zahran et al. (2025) extend this investigation to FP16, FP8, INT8, and INT4 quantisations. FP8 is the most robust, holding attack success rates below 20% (within a 25 bit-flip budget). INT8 offers substantial but lesser protection. INT4, despite having no exponent bits and only 16 possible values per weight, collapses almost as easily as FP16.
Three further observations:
First, the attack heuristics don't transfer across substrates. PrisonBreak's exponent-bit-targeting strategy is inapplicable to integer formats, which have no exponent bits at all. Zahran et al. use a different search strategy: they perform a direct step-size search over all bits in a candidate weight. Safety analyses calibrated on one substrate's vulnerability profile simply does not describe the other's.
Second, the attack locations shift with quantisation scheme. Successful attacks on FP16 and INT4 models cluster in attention value projections, while successful attacks on FP8 and INT8 models cluster in MLP down-projections. The same behavioural outcome is produced by interventions on different parts of the model depending on the storage format.
Third, vulnerabilities can persist across substrate transitions in ways that aren't obvious. Jailbreaks induced in FP16 models and then quantized down retain nearly their full attack success rate under FP8 or INT8 quantization (<5% drop), but lose roughly 24–45% under INT4.
The broader point is that safety-relevant properties established at one level of abstraction (the FP16 model, evaluated on HarmBench) silently depend on choices made at a lower, untracked level (the storage format, the quantization scheme, the order of operations between alignment and quantization). When these implementation choices change as part of standard deployment, the safety claim does not always go with them.
Case 3: GPUHammer
The third case study moves further down the stack, from the encoding of weights in memory (Case 2) to the physical memory in which those encodings are stored.
RowHammer is a read-disturbance phenomenon in dynamic random access memory (DRAM). Repeatedly activating a row of memory causes charge to leak in physically adjacent rows, eventually flipping bits. Lin et al.'s GPUHammer (2025) demonstrated that NVIDIA GPUs are vulnerable to an engineered version of this attack.
On an NVIDIA A6000 with GDDR6 memory, their attack produced eight bit-flips across four memory banks. By targeting the most significant bit of the exponent (weights were stored in FP16), they were able to collapse model performance across five standard ImageNet models (AlexNet, VGG16, ResNet50, DenseNet161, InceptionV3), driving top-1 accuracy from 56–80% (depending on model) to below 0.1% in around thirty minutes and within eight attempts.
Two observations matter for our argument.
First, the vulnerability is substrate-specific in a way that is invisible from the model's level of description. The same hammering techniques produced eight bit-flips on the A6000 but zero on a RTX 3080 (same GDDR6 family, different chip) and on an A100 (HBM2e memory, different architecture). An ML practitioner deploying a model cannot, from the model's behaviour, distinguish which of these substrates their weights are sitting on.
Second, the mitigations live at levels below the model and are traded against performance. Error-Correcting Code (ECC) refers to extra bits alongside data that let the system detect and sometimes correct bit flips. This can correct single bit-flips on the A6000 but is disabled by default because enabling it costs 6.5% memory and 3–10% inference slowdown. The substrate's safety-relevant behaviour is thus partly a function of decisions made by hardware vendors, cloud providers, and system administrators outside of typical AI safety threat models.
This is the same structural pattern as Cases 1 and 2, now at the physical layer. A property established at one level of abstraction—behavioural alignment of a DNN, evaluated on standard benchmarks—depends on choices made at a lower, untracked level — the specific DRAM chip and its configuration (e.g., ECC). Those choices change across cloud GPU generations and deployment settings.
Safety evaluations in AI: ablation studies, behavioural evals, weight-level analysis are substrate-sensitive.
Note that Rushing & Nanda measure this over the top 2% of tokens by direct effect, and they describe it as a "directional metric" rather than a precise quantity.
This is the second post in a sequence that expands upon the concept of substrates as described in this paper. It was written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks," one of the three projects associated with Groundless.
We now argue that the idea of substrate, as we describe it in the original work and in the previous post, unifies several safety-relevant phenomena in AI safety. This list is expanding as we identify and clarify more.
These examples show how the specific details of how a complex computation (i.e., that of a neural network) is implemented in a real system affect that computation such that safety might be compromised.
Case 1: LayerNorm's role in self-repair
Self-repair, also referred to as the Hydra effect, is a phenomena by which ablations to network components (e.g., attention heads, MLP layers) are compensated for by later layers in the network.
Such behaviour represents an obstacle to the causal analysis of neural networks, as such analysis relies on ablations to isolate genuine causal relevance from observed correlations.
Self repair was first identified in Wang et al. (2023), investigated in more detail in McGrath et al. (2023), and given a full analysis in Rushing & Nanda (2024). We here follow the latter, which presents a fine-grained analysis of various aspects of self-repair.
The most interesting result of Rushing and Nanda's work is that LayerNorm contributes substantially to a network's self-repair capabilities. This is an architectural component of the model, often folded into the model weights themselves in mathematical descriptions or neglected entirely as a "passive" part of the model.
In our terminology, it forms a part of the substrate. It is context surrounding the purely formal function f(x) = y that characterizes the model's behaviour, allowing that function to be learned and implemented.
This passive, architectural primitive rescales and centres the activation values such that the per-layer activations average to zero and have a variance of 1. When an attention head is ablated, the overall size of the residual stream norm is reduced. LayerNorm divides the residual stream by a scaling factor proportional to its norm, so reducing the norm amplifies the surviving components' contributions. Model components tend to be correlated, pushing the residual stream vector towards the same answer. Thus, they compensate for the absent signal from the ablated head.
On average, when an attention head is ablated, this re-scaling contributes ~30% of the observed self-repair in Pythia-160M.[1]
This is a relatively shallow example of an untrained, "passive" component of the model (i.e., an architectural component) contributing to safety-relevant properties.
Case 2: Quantization
The next case study we consider is the role that quantization plays in safety-relevant aspects of AI systems.
Quantization refers to the format in which a model's weights are stored in computer memory. A neural network's weights are just numbers, but these are represented in different ways to improve precision and efficiency. Models are typically trained in FP16, a 16-bit floating-point format: one sign bit, five exponent bits, and ten mantissa bits. A value is written in this form:
But FP16 is an expensive format to store weights in. Deployment thus increasingly favours quantized formats: FP8 (still floating-point, but with a narrower range), INT8 and INT4 (plain signed integers with a scale factor rescaling them at inference time), and others.
Performance loss from quantization is smaller than one might expect. Well-engineered 8-bit quantization typically costs less than 1% on standard benchmarks, and even INT4 loses only 1-5% performance when using careful compression.
Like LayerNorm, quantization format is substrate rather than model in our sense. It is computational context surrounding the formal function f(x) = y that characterises model behaviour. It is a further design choice made according to the constraints of the deployment hardware.
Its safety relevance shows up most sharply in the bit-flip attack literature. Bit-flip attacks use hardware fault-injection techniques to alter stored model weights directly. Coalson et al.'s PrisonBreak (2024) showed that flipping fewer than 25 bits (sometimes as few as 5 bits) in an aligned FP16 LM is enough to strip safety alignment. Their attack worked by targeting the most significant exponent bits of FP16 weights, since flipping an exponent bit can change a weight's magnitude by as much as 16 orders of magnitude.
Zahran et al. (2025) extend this investigation to FP16, FP8, INT8, and INT4 quantisations. FP8 is the most robust, holding attack success rates below 20% (within a 25 bit-flip budget). INT8 offers substantial but lesser protection. INT4, despite having no exponent bits and only 16 possible values per weight, collapses almost as easily as FP16.
Three further observations:
First, the attack heuristics don't transfer across substrates. PrisonBreak's exponent-bit-targeting strategy is inapplicable to integer formats, which have no exponent bits at all. Zahran et al. use a different search strategy: they perform a direct step-size search over all bits in a candidate weight. Safety analyses calibrated on one substrate's vulnerability profile simply does not describe the other's.
Second, the attack locations shift with quantisation scheme. Successful attacks on FP16 and INT4 models cluster in attention value projections, while successful attacks on FP8 and INT8 models cluster in MLP down-projections. The same behavioural outcome is produced by interventions on different parts of the model depending on the storage format.
Third, vulnerabilities can persist across substrate transitions in ways that aren't obvious. Jailbreaks induced in FP16 models and then quantized down retain nearly their full attack success rate under FP8 or INT8 quantization (<5% drop), but lose roughly 24–45% under INT4.
The broader point is that safety-relevant properties established at one level of abstraction (the FP16 model, evaluated on HarmBench) silently depend on choices made at a lower, untracked level (the storage format, the quantization scheme, the order of operations between alignment and quantization). When these implementation choices change as part of standard deployment, the safety claim does not always go with them.
Case 3: GPUHammer
The third case study moves further down the stack, from the encoding of weights in memory (Case 2) to the physical memory in which those encodings are stored.
RowHammer is a read-disturbance phenomenon in dynamic random access memory (DRAM). Repeatedly activating a row of memory causes charge to leak in physically adjacent rows, eventually flipping bits. Lin et al.'s GPUHammer (2025) demonstrated that NVIDIA GPUs are vulnerable to an engineered version of this attack.
On an NVIDIA A6000 with GDDR6 memory, their attack produced eight bit-flips across four memory banks. By targeting the most significant bit of the exponent (weights were stored in FP16), they were able to collapse model performance across five standard ImageNet models (AlexNet, VGG16, ResNet50, DenseNet161, InceptionV3), driving top-1 accuracy from 56–80% (depending on model) to below 0.1% in around thirty minutes and within eight attempts.
Two observations matter for our argument.
First, the vulnerability is substrate-specific in a way that is invisible from the model's level of description. The same hammering techniques produced eight bit-flips on the A6000 but zero on a RTX 3080 (same GDDR6 family, different chip) and on an A100 (HBM2e memory, different architecture). An ML practitioner deploying a model cannot, from the model's behaviour, distinguish which of these substrates their weights are sitting on.
Second, the mitigations live at levels below the model and are traded against performance. Error-Correcting Code (ECC) refers to extra bits alongside data that let the system detect and sometimes correct bit flips. This can correct single bit-flips on the A6000 but is disabled by default because enabling it costs 6.5% memory and 3–10% inference slowdown. The substrate's safety-relevant behaviour is thus partly a function of decisions made by hardware vendors, cloud providers, and system administrators outside of typical AI safety threat models.
This is the same structural pattern as Cases 1 and 2, now at the physical layer. A property established at one level of abstraction—behavioural alignment of a DNN, evaluated on standard benchmarks—depends on choices made at a lower, untracked level — the specific DRAM chip and its configuration (e.g., ECC). Those choices change across cloud GPU generations and deployment settings.
Safety evaluations in AI: ablation studies, behavioural evals, weight-level analysis are substrate-sensitive.
Note that Rushing & Nanda measure this over the top 2% of tokens by direct effect, and they describe it as a "directional metric" rather than a precise quantity.