This paper is about an observation I made, as an outsider to two distinct research currents in artificial intelligence, regarding the similarity of the phenomenon they each observe and the potential for these two fields to converge toward shared tools. You will find a mix of established results, some unreviewed papers, and my own proposals, all clearly labeled so as not to be misleading.
I enjoy thinking and learning about the stakes and the broad principles of LLMs, and I often have a "why not this way" reflex. This paper grew out of a discussion with an LLM about catastrophic forgetting and AI safety, among other things the erosion of safety rules under fine-tuning. In both cases my intuition told me that these two things had to be handled during training if we ever wanted real continual learning (CL). As I read through the literature on these two research currents, I noticed that the two seemed to be dealing with the same subject. What followed was joint work between me and my agent for reading the literature, writing the code, and studying the results, up to testing the idea on a small model.
Thesis. Safety behaviors are not mechanically different: they too are learned features sitting in the same loss landscape as ordinary learned capabilities. They erode through the same gradient-interference process that drives catastrophic forgetting. This implies that the two communities' tools are interchangeable, and that the tools for observing and surgically modifying gradients should transfer from one domain to the other, though as far as my research goes neither community seems to have tested it.
1. Two Literatures, One Gradient
One technique against catastrophic forgetting is protecting weights when training a new task, to prevent the new knowledge from overwriting earlier weights (EWC, Kirkpatrick et al. 2017). On the side of protecting safety behaviors, SafeGrad (Yi et al.) proposed removing the components of an update that would conflict with the safety objective, to protect against the overwriting of the weights tied to it.
Protect old tasks from new gradients; protect alignment and safety behaviors from new gradients. The resemblance is close enough that these two techniques have been grouped into a single category called "optimizer-level gradient surgery" by Zhang et al. 2026, who group SafeGrad together with GEM (Lopez-Paz & Ranzato 2017), mainly because their difference lies in what they protect. The two communities have therefore arrived at the same operating method, surgical, applied to the same failure mode. Two communities, but potentially a connection still to be made.
What is this failure mode? The continual learning (CL) community, when it observes performance failing on a task A after training a task B, speaks of catastrophic forgetting. The AI safety community, which studies among other things the erosion of safety behaviors, observes that fine-tuning a language model, even with seemingly benign data, negatively affects safety behaviors (Qi et al. 2023). Two communities, two names, but one common point: both are talking about the same type of event, a failure to retain information after continued gradient descent. The new objective travels through the directions of the loss landscape that the old objective depends on.
When you look at some of the axes the two communities work on, similar methods are being handled on both sides in parallel.
Continual learning
Safety/alignment
Shared axes
Catastrophic forgetting
Safety erosion under fine-tuning
Retention failure under continued training
Task interference
Gradient conflict (per SafeGrad's diagnosis)
The new objective's update opposes the old ones
EWC, gradient projection (GEM, OGD)
Orthogonal projection (SafeGrad)
Constraining the update to protect old knowledge
Replay/rehearsal of old-task data
Mixing safety data into fine-tuning
Re-exposing the model to what it must retain
Forgetting curves across task sequences
Effect on harmfulness score across fine-tuning steps
The same degradation curves, with differently named y-axes
My argument here is that when two fields arrive at similar diagnoses and similar remedies, they are probably studying the same phenomenon. In my view, the erosion of safety behaviors and catastrophic forgetting are two problems tied to gradient interference during training. Some recent work in continual learning has made it possible to test this mechanism.
If what I am putting forward is true, these two communities could quickly benefit from each other's advances. The study of continual learning has years of theory and field-tested techniques that could ease some of the safety-behavior problems. The community focused on safety, for its part, brings an environment with higher stakes. The second part of this article focuses on a constructive aspect of this possible alliance, identification: if a single mechanism causes both failure modes, it may be possible to have a detection layer that watches for both failures by observing features and gradients during training, rather than evaluating results after the fact.
In the next sections my argument unfolds as follows. Section 2 gives the definitions and the scope of the study, including the fact that I study the erosion of learned safety behaviors and not every way alignment can fail. Sections 3 and 4 build the case at the mechanical level. Sections 5 and 6 cover observability and what it predicts. Finally, Section 7 covers the main objections in detail and Section 8 draws out the implications.
2. Definitions and Scope of the Study
It is important to understand that my thesis rests on five different terms, so this section serves to define them clearly here so we can agree on what follows. But before defining them, let us also take the following as given: gradient descent is the standard training loop (compute the gradient, the direction in parameter space where the descent is steepest, then take repeated step-by-step descents in that direction), and the loss landscape is the terrain it walks on, a surface above the parameters. The height of this surface corresponds to the errors, which define the general topography of the terrain and where the model's knowledge is stored.
Catastrophic forgetting. It is defined as a loss of past knowledge during training for a new task. Each weight recomputed during the new training overwrites the old weight, far more than the gradual loss one might expect (McCloskey & Cohen 1989; Kirkpatrick et al. 2017).
Erosion of safety behaviors. It is defined as the degradation of safety behaviors (refusal, harm prevention, honesty norms installed by alignment training) after a trained model is fine-tuned. This phenomenon occurs whether with malicious data (Qi et al. 2023 jailbroke GPT-3.5 with 10 examples) or even with seemingly benign data (the same paper found degradation even with a standard dataset, with no one attacking).
Gradient interference. This is specifically what this thesis is about. When a model is trained on a new objective, the updates help it perform a particular task without caring about touching the rest of the model's knowledge. If the parameters used by the new objective are parameters the old objectives also depended on, the update damages them along the way. The intuition: if two tenants are renovating the same apartment and one of them makes improvements that constantly cut the electrical wires the other tenant uses, even without meaning to, this will naturally have consequences for the other tenant.
Feature. It is defined as an internal activation pattern of the model that expresses a specific concept. It is the basic unit of the interpretability machinery that lets us define what was asked of the model and what it has forgotten. Observability during training means monitoring the gradients and features during training rather than evaluating the model afterward.
Scope of the study: erosion, not misalignment. My thesis is about the loss of safety behaviors, not about the misalignment that emerges when a very narrow update is run (e.g. on insecure code) and produces a general misalignment that was not visible before (Betley et al. 2025). Erosion is a behavior that disappears; emergent misalignment is a behavior that appears because the model generalizes from narrow data. Interference is a plausible mechanism for the first but not closely tied to the second. Combining the two in this thesis would overextend its scope (Section 7 returns to this). However, the work on emergent misalignment brought to light something important to keep in mind: in training-dynamics experiments, the internal signals diverge before we see a behavioral change indicating whether the model is safe (see §4.7). Detecting the internal signals before checking the behavior is exactly what we will be looking at in the second part of this thesis. So even though emergent misalignment is not part of my mechanical approach here, that does not necessarily mean the monitoring process could not be connected to it.
Everything that leaves the weights intact and that happens at inference (jailbreaks, prompt injection, and parameter-decoding attacks) is also out of scope for this study. This article is about the effect of gradient descent on a model's existing learning.
3. The Mechanism: Continued Training Is an Attack Without an Attacker
Why catastrophic forgetting? In a parameter space as vast as that of a large model, you might hope that random modifications would touch very few shared spots, and so that knowledge should stay roughly intact. Yet updates of new knowledge overwrite the old. Something seems to steer the modification precisely to a place where the damage hurts.
Training on a new task is implicitly like a targeted attack on the model's knowledge (arXiv:2510.09181). The only difference from an ordinary targeted attack (a perturbation aimed at a fragile spot of the model) is that this one is targeted automatically, caused by the fact that the new task's gradients have a direction aligned with the steep slopes of the old knowledge's loss terrain. The faster the movement in that direction, the faster the error spikes for that old knowledge.
The alignment we observe is counterintuitive, because the sensitive directions are too sparse in a high-dimensional space to be hit by chance alone. The explanation is well documented: during training there is a tendency for gradient descent to concentrate learning in a small number of directions rather than spreading it across all parameters, simply as a bias. The sensitive directions of the old knowledge and the gradients of the new task are concentrated in a similar proportion and in the same low-dimensional space. They do not collide by accident; it is training that squeezes them into the same corridor. (Status: a recent preprint, grounded in theory but not replicated at large scale. I treat it as the best available account of the mechanism for now, not as settled fact.)
We know where the damage occurs, but concretely, is the knowledge lost or is it still present but inaccessible? When you look at forgetting by attending to the internal features, you find that the knowledge is often still present, only scrambled (Masip et al. 2026). You could compare the phenomenon to a library full of books but with no catalog to find the one you want to consult. In Section 6 we will at least see that we are able to observe the knowledge degrading at the feature level, scrambling.
So continued training damages prior knowledge like a targeted attack and not by chance, and at the feature level the knowledge appears scrambled but often still present. Nothing in this mechanism takes into account what an old piece of knowledge is, and this is precisely what Section 4 claims: that safety behaviors are simply more knowledge in the same corridor.
4. Safety Behaviors Are Not Mechanically Different from Other Knowledge
My proposal here, more than a result, is that safety behaviors, installed by alignment during training, are learned features written into the same loss landscape where the other knowledge sits. Fine-tuning therefore erodes them the same way (gradient interference), and so it is not a distinct phenomenon. The rest of this section explains why this is more plausible than the alternative, and Section 7 examines where it might break.
A study has already made this diagnosis. The analysis SafeGrad carried out (Yi et al. 2025) is in fact just a renaming of continual-learning interference: the task gradients and the alignment-objective gradient are in conflict, and the similarity between these directions keeps decreasing. This conflict intensifies the more the fine-tuning data grows hostile toward safety behaviors. You only need to substitute "alignment objective" with "task A" and it is exactly task interference. The safety community arrived at this diagnosis independently.
Now back to what was established in Section 3: a safety behavior does not need an attacker to erode. Refusal still gives way even after fine-tuning on benign data. One measurement complicates this claim, however: SafeGrad found a near-zero cosine of direction difference (0.02), so little conflict, between the task gradients and the alignment ones. Yet Qi et al. observed that erosion was still present. That said, finding that the directions are similar on average does not mean their subtle differences at the critical spots have no damaging effect right where it hurts. This is my interpretation, and it is something neither paper demonstrates, but it is falsifiable. My prediction in Section 6 rests on this interpretation: we should see systematic interference if we restrict our measurement to the steep-slope directions of the safety loss and of the benign tasks, even when the global cosine of difference is near zero.
The main objection to my thesis is that safety alignment is fragile because it is concentrated (Qi et al. 2024). At first glance one might conclude that the features of safety behaviors are special. Interference offers another avenue: that these features are simply fragile because being concentrated means sitting at the precise spot on the terrain where the slopes are steepest, the same space but the wrong place. The aim here is to contribute by offering a way to measure whether safety features differ from task-knowledge features, in curvature and in redundancy. Section 7 treats this confounder as the second biggest objection to this thesis.
Preliminary direct evidence. In a toy model I experimented with (see Section 6), a small transformer holding a safety behavior trained on a new unrelated task. The safety behavior held for a while, until it switched off. But before switching off it became very unstable. Across three seeds, differing only by a few points of noise, the refusal survived anywhere from a few hundred to a few thousand steps before collapsing. On the surface you might think this instability would be found inside the model, but on the contrary you find a smooth, steady, reproducible erosion. The projection of the activations onto the refusal direction and the logit margin separating a compliance behavior from a refusal behavior decayed on a regular schedule across the seeds, well before the safety behavior switched off. So the behavioral measurement is not a reliable indicator of erosion, since it is unstable, while the internal values stay coherent. This is what Section 3 indicated: damage accumulates inside the model without any visible change in behavior. The experiment goes even further, since the refusal does eventually give way, but the moment it gives way is the least predictable thing in the experiment. The full multi-seed numbers and the controls are available in Section 6.
If this proposal holds, one consequence to test is that, essentially, the two communities' tools would be interchangeable. The gradient-surgery techniques built for safety behaviors (SafeGrad) could be used to reduce the risks around forgetting; the projection methods built for forgetting (GP, backGP) could preserve refusals; weight-protection techniques like EWC could protect safety guardrails; and feature monitoring built to track knowledge could prevent guardrails from drifting. As far as I have been able to find in my research, no such transfer has been tested. The next two sections look at these consequences in detail. Section 5 describes the monitoring substrate this implies, and Section 6 gives the predictions, including the full experiment.
5. A Shared Substrate: Observing During Training
If a single mechanism is the source of two failure modes, then we can build one monitor to watch both. This monitor actually already exists, already built for different purposes but not yet combined together.
Developmental interpretability (Hoogland et al. 2023) studies the model's internal structures during training rather than after; an examination rather than an autopsy. What follows applies the same principle but to a narrower question: "at what moment do the structures we care about begin to erode?"
During training, at each checkpoint, a monitor tracks the features (information, semantic drift, the movement of their direction) using a sparse autoencoder (SAE-Track: arXiv:2412.17626), an auxiliary network trained to re-express a model's internal activations. If forgetting is a capability that has been relocated and is hard to read out at the feature level, this is how, at each checkpoint, we could potentially observe it.
The second element is computing, during training, the angle at which the gradient aligns with the steep slopes of a piece of knowledge that would have the greatest negative impact (arXiv:2510.09181). This would be a leading indicator, an indicator that something is about to happen. The fact that gradient alignment occurs before the loss rises, and that the loss rises before the knowledge falls, is an assumption and not a fact; we will test it in Section 6.
The last step acts rather than only observing. BLOCK-EM (Ustaomeroglu & Qu 2026, arXiv:2602.00767) makes it possible to identify which small set of features have the greatest impact on misalignment and to constrain them during fine-tuning, in order to considerably reduce that misalignment. Two things matter here. First, it shows that an intervention during training is possible and feasible in practice. Second, its failure mode is as instructive as its success: over a long fine-tuning period the misalignment returns. The article presents evidence suggesting that the model routes around the constraint by reorienting toward alternative features or layers (see their Section 5 and Appendix F). A static list invites an attempt to route around it, so it must be continually adaptive, watching for the reorientations and not just the initial features.
Once assembled, the monitor looks like this. Identify and fix the features of the base model, regardless of the type of knowledge. During training, watch two families of signals: a drift in the features, and the alignment of gradients with the direction of a protected zone. When either of the two signals crosses an alert point, a surgical intervention with replay or a checkpoint rollback, before the behavior moves. The whole process is indifferent to the type of content, treating a safety behavior and any other knowledge the same way.
What I have just described is a design and not a deployed system. The design also makes two assumptions: that feature observation is reliable, and that the processing cost is bearable at larger scale. Section 7 looks at both points seriously. The next section tests the principles behind this design.
6. A Toy Experiment: Catching Erosion Before the Behavior Breaks
The previous sections were about what might be observable. This section explains how I ran an experiment to see whether it was actually possible to detect the erosion of safety behaviors before a behavioral evaluation could. It is naturally a toy model with its limits, and they are clearly explained in Section 7.
I trained a small transformer on two things, A and B, an ordinary capability and a refusal behavior. I then fine-tuned the model on an unrelated task C while watching what happened to A and B. Around the training loop, a monitor analyzes the two signals at regular intervals, as explained earlier. The first analyzes the projection of the activations onto the refusal direction and the energy and drift ratio in the residual stream. The second analyzes the per-behavior gradient norms, the cosine between the task-C gradient and the protected behavior's gradient, and a Fisher-cosine variant. Each of the two signals has a fixed threshold that triggers an alert. The prediction, which I will call P1, is that the monitor triggers an alert before we see the behavior change happen.
The protocol was pre-registered, with timestamped amendments A1 through A6 available in the Git history. The alert rule cited throughout this thesis corresponds to the A6 freeze. The rules were frozen before seeing the final multi-seed results. This is why I can report a failure as truly a failure. The runs (3 seeds) were carried out reproducibly. Methodological note: the experiment was coded with Claude Code (the introduction explains how the assistant was used), but the decisions the results rest on were made by me.
P1 failed for the capability, and I report it as a failure. For A, during training on C, accuracy dropped below 0.90 between steps 10 and 20. The monitor reported an alert at step 60. The alert horizon is therefore negative, at −50 steps under C, and −40 to −50 with the control vocabulary. At this learning rate, the behavior went off the rails well before the monitor alerted to anything. This failure sets a real limit: the monitor cannot alert in time if the erosion is faster than the sampling rate.
P1 holds for the refusal behavior, with an honest caveat. For B, the picture is inverted; the behavior switched off much later than A. The alert at step 60 in each of the seeds therefore came before the collapse. During training on C, the behavior switched off at steps 210, 2000, and 300, giving a lead of +150, +1940, and +240. The +1940 is an extreme case on a single seed. The mean of about 777 is driven by the +1940 outlier.
The asymmetry is about depth and not content. Why does the monitor succeed with B but fail with A? Certainly not because B is a safety behavior and the other is basic knowledge, since the monitor has no idea. The difference is the depth at which the behavior sits. With a protected gradient norm of about 0.0002 versus 0.06 for task A, task B erodes more slowly and therefore widens the detection window. This leads to an honest observation too: safety behaviors are generally reported as being shallow. Safety behaviors installed deeply are therefore possibly an easy case, and the transposition to real conditions remains to be validated. But in fact this dependence is itself an interesting prediction that the design suggests and that should be tested: the detection window should grow with the depth of the protected behaviors. This depth is addressed in the objections in Section 7.
An exploratory finding: erosion is silent and smooth where behavior is noisy. The most interesting finding came from observing the refusal margin. We observe a steady, smooth, regular, and reproducible failure across each seed of the logit gap between refusal and compliance (the refusal margin), with a half-life of about 40 to 150 steps. The full switch-off of the behavior, on the other hand, is erratic, between 210 and 2000 with near-identical seeds (identical down to compute noise). By the end of training, the margin had collapsed from a base of about 10 to values close to zero or negative (under C: 1.04, 0.81, −6.43 across three seeds). In my view, as exploratory rather than as a result, since it was observed after the fact, near the critical point the binary evaluation is unstable while the internal margin signals stay stable. If this holds beyond toy models, it becomes an argument for watching the substrate even when the behaviors look adequate, precisely because "looks adequate" is not a reliable signal.
Graduated controls: null, C′, C. To make sure the monitor is responding to interference and not to training itself, I ran a graduated set of conditions. The null run reported no alert. The control C′ used a disjoint vocabulary designed to interfere less, and the refusal largely held; only one of three seeds switched off (with a lead of +540), while the other two did not switch within the observation window. By the A6 freeze, those two seeds that did not switch count as false positives (the monitor alerted but the behavior did not break). One factual qualification, however: the refusal margin did erode substantially (around −36%), so the monitor detected a real internal change that simply never crossed the behavioral threshold within the observation window. One thing I report plainly: C′ degraded A just as fast as C, with no difference, which is surprising against the expectation that fairly different tasks should interfere less with each other.
Figure 1: Interference (C, left) versus the disjoint vocabulary (C′, right). The shaded bands are the min–max over three seeds. Top row: maintained behavior (capability A, refusal B, task acquisition, and the refusal margin). Middle row: feature-side signals. Bottom row: gradient-side signals. The dashed vertical lines mark the combined monitor alert. The solid line marks the behavior switch-off. Under C the refusal margin erodes and B switches off. Under C′ the refusal behavior holds.
The pre-registered confirmatory tests passed 3/3. The cleanest test, C-ctx, was registered in advance as a pressure test on the context the refusal relies on. The pre-registered prediction is obtained across all three seeds. The sequence is internal erosion first, behavior switch-off second, with the monitor's alert always before the switch-off every time. The refusal switched off at steps 160, 1200, and 260, with leads of +100, +1140, and +200 (same pattern, with one late-switching seed widening the alert window, as in the C variant).
Figure 2: Interference (C, left) versus the pre-registered test C-ctx (right). Shaded bands are the min–max over the three seeds, see Figure 1. The dashed vertical line marks the combined monitor alert, the solid line the behavior switch-off. In both conditions the alert precedes the switch-off.
The results this toy model suggests are narrow but, I believe, real. In a setting where safety behaviors erode slowly, a single monitor indifferent to the type of content can observe and alert to erosion hundreds of steps before the refusal behavior switches off. The internal signal is smooth and more reproducible than observing the behavior near the transition. What it does not establish is just as important: the method cannot alert when the erosion is too fast, when the alignment is shallow, and a toy transformer is not a frontier model. Section 7 takes these objections and others seriously.
7. Objections
A thesis like this one has to address the main objections. I will address them in order of importance.
Objection 1: This is just emergent misalignment renamed. If it were the case, my work would be useless. Yet the mechanisms have a real difference:
Erosion is a subtraction: the model loses a safety behavior because the new training interferes with it and erases it.
Emergent misalignment is an addition: the model creates a new bad behavior extrapolated from narrow data.
The two phenomena can appear at the same time, one does not exclude the other, but my study focused on erosion through interference only. In practice, the monitor can be used to detect both signals before the model goes off the rails, whatever the underlying cause.
Objection 2: Safety behaviors are special because they are shallow. The main objection is that, by definition, safety behaviors sit shallowly in the first few tokens. As mentioned earlier, my toy model did not follow this logic, and that is what allowed for more time to detect an anomaly in the observation window. What this study does show, however, is that depth has an impact on observability, and that a test could be run to determine, without knowing whether it is even possible, whether installing safety behaviors more deeply would allow better detection.
Objection 3: Different tasks should interfere less. If interference tracks task similarity, then a disjoint vocabulary should have caused little damage to existing knowledge. It seems not. In the control test, C′, made of a disjoint vocabulary, degraded A just as fast as C. Two interpretations, but without a precise answer:
Even though the vocabulary is different, the model uses the same deep circuit. The problem is not the vocabulary but the circuit.
The link we draw between task similarity and interference is perhaps weaker than we think.
This is the murkiest area of my thesis.
Objection 4: The monitor produced false positives. In the tests with C′, the monitor alerted twice while the model never actually adopted the bad behavior. Since I froze my rules in advance, I count them strictly as false alerts. However, we have to qualify what we call a "false positive" here:
On the surface, it is a false alert, because the model's final behavior did not change.
In depth, though, the alert was real. The model's safety signals were indeed eroding (losing a third of their strength), but never crossed the behavioral threshold within the observed window.
In practice, it all depends on what you want to predict: the gradient signals warn very early as soon as the task changes (at the risk of worrying for nothing), while the feature signals confirm that the model is actually changing. A real safety system will have to find the right cursor between alerting early and avoiding false alarms.
Objection 5: This method costs too much at scale. Analyzing internal circuits and gradients in real time during training takes a lot of compute. It is a real cost, and I have not yet proven that such a monitor could run on a frontier model without blowing up the budget. My only arguments:
Watching is cheaper than repairing: you can reduce the cost by taking measurements only now and then (subsampling). Surgery, on the other hand, has to act at every conflict and carries a reported overhead of 2–3× (Zhang et al. 2026).
The alternative is not free: testing the AI's behavior after the fact is also very expensive. On top of that, these tests often come too late, once the model has already gone off the rails.
Objection 6: My method relies on SAEs (sparse autoencoders), but are they really reliable? In my article I propose using SAEs to map the model's internal concepts. In reality, on my small test model, the SAE came out dense (L0 ≈ 241 of 512) with little superposition to disentangle. The SAE was therefore of little use: it was more classical mathematical tools (like PCA and linear probes) that did all the detection work. So you can monitor the AI with a simple (linear) geometric structure, while the use of SAEs is a theoretical avenue for giant models that I have not yet demonstrated in practice. My monitoring proposal therefore works with classical or more complex mathematical tools for analyzing the circuits.
Objection 7: My study relies on a miniature model, which proves nothing for real systems. It is true: a small transformer with a single hand-coded safety circuit has nothing to do with a giant model (a frontier model) where alignment is distributed everywhere. My experiment is only a proof of concept. It shows three things about this specific case:
In one case, an internal monitor was able to detect the erosion of safety before the AI started to misbehave.
This internal signal, near the transition, is more stable and reproducible than behavioral tests.
The failure or success of the detection is explained by a measurable cause.
Everything else (the cost, the reliability at scale) remains hypotheses. My aim is not to say that my system works on real models, but to give researchers the exact method to test it for themselves.
8. Implications and What to Test
I have kept one key number for the end, because it sums up my whole theory: 0.377. In my experiment, the safety direction and the capability direction share a common zone measured at 0.377 (chance would be 0.25, and a full merge would be 1.0). This indicates that the two directions are not fully separate. They share the same internal geometry, which explains why training one can damage the other through interference. They are also not fully merged, which means the safety behavior is distinct enough for the monitor to observe it independently. The score of 0.377 is a perfect middle ground, tied to the capabilities but still measurable. This is precisely the zone where my proposal is both true and verifiable.
For all the current models that are constantly being fine-tuned, and for future systems doing continual learning, this implies that behavioral evaluation after training is probably too late. The substrate may already be eroded even though, on the surface, the behavior looks intact and then erratic near the transition, which makes it unstable to measure.
A content-agnostic monitor is not a new program applied to safety behaviors; it is simply the same program the continual-learning community has been building from the start, applied to a critical point in the safety evolution of models. The next step? Test it to determine whether this technique could actually transfer, and find where this solution breaks.
This paper is about an observation I made, as an outsider to two distinct research currents in artificial intelligence, regarding the similarity of the phenomenon they each observe and the potential for these two fields to converge toward shared tools. You will find a mix of established results, some unreviewed papers, and my own proposals, all clearly labeled so as not to be misleading.
I enjoy thinking and learning about the stakes and the broad principles of LLMs, and I often have a "why not this way" reflex. This paper grew out of a discussion with an LLM about catastrophic forgetting and AI safety, among other things the erosion of safety rules under fine-tuning. In both cases my intuition told me that these two things had to be handled during training if we ever wanted real continual learning (CL). As I read through the literature on these two research currents, I noticed that the two seemed to be dealing with the same subject. What followed was joint work between me and my agent for reading the literature, writing the code, and studying the results, up to testing the idea on a small model.
Thesis. Safety behaviors are not mechanically different: they too are learned features sitting in the same loss landscape as ordinary learned capabilities. They erode through the same gradient-interference process that drives catastrophic forgetting. This implies that the two communities' tools are interchangeable, and that the tools for observing and surgically modifying gradients should transfer from one domain to the other, though as far as my research goes neither community seems to have tested it.
1. Two Literatures, One Gradient
One technique against catastrophic forgetting is protecting weights when training a new task, to prevent the new knowledge from overwriting earlier weights (EWC, Kirkpatrick et al. 2017). On the side of protecting safety behaviors, SafeGrad (Yi et al.) proposed removing the components of an update that would conflict with the safety objective, to protect against the overwriting of the weights tied to it.
Protect old tasks from new gradients; protect alignment and safety behaviors from new gradients. The resemblance is close enough that these two techniques have been grouped into a single category called "optimizer-level gradient surgery" by Zhang et al. 2026, who group SafeGrad together with GEM (Lopez-Paz & Ranzato 2017), mainly because their difference lies in what they protect. The two communities have therefore arrived at the same operating method, surgical, applied to the same failure mode. Two communities, but potentially a connection still to be made.
What is this failure mode? The continual learning (CL) community, when it observes performance failing on a task A after training a task B, speaks of catastrophic forgetting. The AI safety community, which studies among other things the erosion of safety behaviors, observes that fine-tuning a language model, even with seemingly benign data, negatively affects safety behaviors (Qi et al. 2023). Two communities, two names, but one common point: both are talking about the same type of event, a failure to retain information after continued gradient descent. The new objective travels through the directions of the loss landscape that the old objective depends on.
When you look at some of the axes the two communities work on, similar methods are being handled on both sides in parallel.
Continual learning
Safety/alignment
Shared axes
Catastrophic forgetting
Safety erosion under fine-tuning
Retention failure under continued training
Task interference
Gradient conflict (per SafeGrad's diagnosis)
The new objective's update opposes the old ones
EWC, gradient projection (GEM, OGD)
Orthogonal projection (SafeGrad)
Constraining the update to protect old knowledge
Replay/rehearsal of old-task data
Mixing safety data into fine-tuning
Re-exposing the model to what it must retain
Forgetting curves across task sequences
Effect on harmfulness score across fine-tuning steps
The same degradation curves, with differently named y-axes
My argument here is that when two fields arrive at similar diagnoses and similar remedies, they are probably studying the same phenomenon. In my view, the erosion of safety behaviors and catastrophic forgetting are two problems tied to gradient interference during training. Some recent work in continual learning has made it possible to test this mechanism.
If what I am putting forward is true, these two communities could quickly benefit from each other's advances. The study of continual learning has years of theory and field-tested techniques that could ease some of the safety-behavior problems. The community focused on safety, for its part, brings an environment with higher stakes. The second part of this article focuses on a constructive aspect of this possible alliance, identification: if a single mechanism causes both failure modes, it may be possible to have a detection layer that watches for both failures by observing features and gradients during training, rather than evaluating results after the fact.
In the next sections my argument unfolds as follows. Section 2 gives the definitions and the scope of the study, including the fact that I study the erosion of learned safety behaviors and not every way alignment can fail. Sections 3 and 4 build the case at the mechanical level. Sections 5 and 6 cover observability and what it predicts. Finally, Section 7 covers the main objections in detail and Section 8 draws out the implications.
2. Definitions and Scope of the Study
It is important to understand that my thesis rests on five different terms, so this section serves to define them clearly here so we can agree on what follows. But before defining them, let us also take the following as given: gradient descent is the standard training loop (compute the gradient, the direction in parameter space where the descent is steepest, then take repeated step-by-step descents in that direction), and the loss landscape is the terrain it walks on, a surface above the parameters. The height of this surface corresponds to the errors, which define the general topography of the terrain and where the model's knowledge is stored.
Catastrophic forgetting. It is defined as a loss of past knowledge during training for a new task. Each weight recomputed during the new training overwrites the old weight, far more than the gradual loss one might expect (McCloskey & Cohen 1989; Kirkpatrick et al. 2017).
Erosion of safety behaviors. It is defined as the degradation of safety behaviors (refusal, harm prevention, honesty norms installed by alignment training) after a trained model is fine-tuned. This phenomenon occurs whether with malicious data (Qi et al. 2023 jailbroke GPT-3.5 with 10 examples) or even with seemingly benign data (the same paper found degradation even with a standard dataset, with no one attacking).
Gradient interference. This is specifically what this thesis is about. When a model is trained on a new objective, the updates help it perform a particular task without caring about touching the rest of the model's knowledge. If the parameters used by the new objective are parameters the old objectives also depended on, the update damages them along the way. The intuition: if two tenants are renovating the same apartment and one of them makes improvements that constantly cut the electrical wires the other tenant uses, even without meaning to, this will naturally have consequences for the other tenant.
Feature. It is defined as an internal activation pattern of the model that expresses a specific concept. It is the basic unit of the interpretability machinery that lets us define what was asked of the model and what it has forgotten. Observability during training means monitoring the gradients and features during training rather than evaluating the model afterward.
Scope of the study: erosion, not misalignment. My thesis is about the loss of safety behaviors, not about the misalignment that emerges when a very narrow update is run (e.g. on insecure code) and produces a general misalignment that was not visible before (Betley et al. 2025). Erosion is a behavior that disappears; emergent misalignment is a behavior that appears because the model generalizes from narrow data. Interference is a plausible mechanism for the first but not closely tied to the second. Combining the two in this thesis would overextend its scope (Section 7 returns to this). However, the work on emergent misalignment brought to light something important to keep in mind: in training-dynamics experiments, the internal signals diverge before we see a behavioral change indicating whether the model is safe (see §4.7). Detecting the internal signals before checking the behavior is exactly what we will be looking at in the second part of this thesis. So even though emergent misalignment is not part of my mechanical approach here, that does not necessarily mean the monitoring process could not be connected to it.
Everything that leaves the weights intact and that happens at inference (jailbreaks, prompt injection, and parameter-decoding attacks) is also out of scope for this study. This article is about the effect of gradient descent on a model's existing learning.
3. The Mechanism: Continued Training Is an Attack Without an Attacker
Why catastrophic forgetting? In a parameter space as vast as that of a large model, you might hope that random modifications would touch very few shared spots, and so that knowledge should stay roughly intact. Yet updates of new knowledge overwrite the old. Something seems to steer the modification precisely to a place where the damage hurts.
Training on a new task is implicitly like a targeted attack on the model's knowledge (arXiv:2510.09181). The only difference from an ordinary targeted attack (a perturbation aimed at a fragile spot of the model) is that this one is targeted automatically, caused by the fact that the new task's gradients have a direction aligned with the steep slopes of the old knowledge's loss terrain. The faster the movement in that direction, the faster the error spikes for that old knowledge.
The alignment we observe is counterintuitive, because the sensitive directions are too sparse in a high-dimensional space to be hit by chance alone. The explanation is well documented: during training there is a tendency for gradient descent to concentrate learning in a small number of directions rather than spreading it across all parameters, simply as a bias. The sensitive directions of the old knowledge and the gradients of the new task are concentrated in a similar proportion and in the same low-dimensional space. They do not collide by accident; it is training that squeezes them into the same corridor. (Status: a recent preprint, grounded in theory but not replicated at large scale. I treat it as the best available account of the mechanism for now, not as settled fact.)
We know where the damage occurs, but concretely, is the knowledge lost or is it still present but inaccessible? When you look at forgetting by attending to the internal features, you find that the knowledge is often still present, only scrambled (Masip et al. 2026). You could compare the phenomenon to a library full of books but with no catalog to find the one you want to consult. In Section 6 we will at least see that we are able to observe the knowledge degrading at the feature level, scrambling.
So continued training damages prior knowledge like a targeted attack and not by chance, and at the feature level the knowledge appears scrambled but often still present. Nothing in this mechanism takes into account what an old piece of knowledge is, and this is precisely what Section 4 claims: that safety behaviors are simply more knowledge in the same corridor.
4. Safety Behaviors Are Not Mechanically Different from Other Knowledge
My proposal here, more than a result, is that safety behaviors, installed by alignment during training, are learned features written into the same loss landscape where the other knowledge sits. Fine-tuning therefore erodes them the same way (gradient interference), and so it is not a distinct phenomenon. The rest of this section explains why this is more plausible than the alternative, and Section 7 examines where it might break.
A study has already made this diagnosis. The analysis SafeGrad carried out (Yi et al. 2025) is in fact just a renaming of continual-learning interference: the task gradients and the alignment-objective gradient are in conflict, and the similarity between these directions keeps decreasing. This conflict intensifies the more the fine-tuning data grows hostile toward safety behaviors. You only need to substitute "alignment objective" with "task A" and it is exactly task interference. The safety community arrived at this diagnosis independently.
Now back to what was established in Section 3: a safety behavior does not need an attacker to erode. Refusal still gives way even after fine-tuning on benign data. One measurement complicates this claim, however: SafeGrad found a near-zero cosine of direction difference (0.02), so little conflict, between the task gradients and the alignment ones. Yet Qi et al. observed that erosion was still present. That said, finding that the directions are similar on average does not mean their subtle differences at the critical spots have no damaging effect right where it hurts. This is my interpretation, and it is something neither paper demonstrates, but it is falsifiable. My prediction in Section 6 rests on this interpretation: we should see systematic interference if we restrict our measurement to the steep-slope directions of the safety loss and of the benign tasks, even when the global cosine of difference is near zero.
The main objection to my thesis is that safety alignment is fragile because it is concentrated (Qi et al. 2024). At first glance one might conclude that the features of safety behaviors are special. Interference offers another avenue: that these features are simply fragile because being concentrated means sitting at the precise spot on the terrain where the slopes are steepest, the same space but the wrong place. The aim here is to contribute by offering a way to measure whether safety features differ from task-knowledge features, in curvature and in redundancy. Section 7 treats this confounder as the second biggest objection to this thesis.
Preliminary direct evidence. In a toy model I experimented with (see Section 6), a small transformer holding a safety behavior trained on a new unrelated task. The safety behavior held for a while, until it switched off. But before switching off it became very unstable. Across three seeds, differing only by a few points of noise, the refusal survived anywhere from a few hundred to a few thousand steps before collapsing. On the surface you might think this instability would be found inside the model, but on the contrary you find a smooth, steady, reproducible erosion. The projection of the activations onto the refusal direction and the logit margin separating a compliance behavior from a refusal behavior decayed on a regular schedule across the seeds, well before the safety behavior switched off. So the behavioral measurement is not a reliable indicator of erosion, since it is unstable, while the internal values stay coherent. This is what Section 3 indicated: damage accumulates inside the model without any visible change in behavior. The experiment goes even further, since the refusal does eventually give way, but the moment it gives way is the least predictable thing in the experiment. The full multi-seed numbers and the controls are available in Section 6.
If this proposal holds, one consequence to test is that, essentially, the two communities' tools would be interchangeable. The gradient-surgery techniques built for safety behaviors (SafeGrad) could be used to reduce the risks around forgetting; the projection methods built for forgetting (GP, backGP) could preserve refusals; weight-protection techniques like EWC could protect safety guardrails; and feature monitoring built to track knowledge could prevent guardrails from drifting. As far as I have been able to find in my research, no such transfer has been tested. The next two sections look at these consequences in detail. Section 5 describes the monitoring substrate this implies, and Section 6 gives the predictions, including the full experiment.
5. A Shared Substrate: Observing During Training
If a single mechanism is the source of two failure modes, then we can build one monitor to watch both. This monitor actually already exists, already built for different purposes but not yet combined together.
Developmental interpretability (Hoogland et al. 2023) studies the model's internal structures during training rather than after; an examination rather than an autopsy. What follows applies the same principle but to a narrower question: "at what moment do the structures we care about begin to erode?"
During training, at each checkpoint, a monitor tracks the features (information, semantic drift, the movement of their direction) using a sparse autoencoder (SAE-Track: arXiv:2412.17626), an auxiliary network trained to re-express a model's internal activations. If forgetting is a capability that has been relocated and is hard to read out at the feature level, this is how, at each checkpoint, we could potentially observe it.
The second element is computing, during training, the angle at which the gradient aligns with the steep slopes of a piece of knowledge that would have the greatest negative impact (arXiv:2510.09181). This would be a leading indicator, an indicator that something is about to happen. The fact that gradient alignment occurs before the loss rises, and that the loss rises before the knowledge falls, is an assumption and not a fact; we will test it in Section 6.
The last step acts rather than only observing. BLOCK-EM (Ustaomeroglu & Qu 2026, arXiv:2602.00767) makes it possible to identify which small set of features have the greatest impact on misalignment and to constrain them during fine-tuning, in order to considerably reduce that misalignment. Two things matter here. First, it shows that an intervention during training is possible and feasible in practice. Second, its failure mode is as instructive as its success: over a long fine-tuning period the misalignment returns. The article presents evidence suggesting that the model routes around the constraint by reorienting toward alternative features or layers (see their Section 5 and Appendix F). A static list invites an attempt to route around it, so it must be continually adaptive, watching for the reorientations and not just the initial features.
Once assembled, the monitor looks like this. Identify and fix the features of the base model, regardless of the type of knowledge. During training, watch two families of signals: a drift in the features, and the alignment of gradients with the direction of a protected zone. When either of the two signals crosses an alert point, a surgical intervention with replay or a checkpoint rollback, before the behavior moves. The whole process is indifferent to the type of content, treating a safety behavior and any other knowledge the same way.
What I have just described is a design and not a deployed system. The design also makes two assumptions: that feature observation is reliable, and that the processing cost is bearable at larger scale. Section 7 looks at both points seriously. The next section tests the principles behind this design.
6. A Toy Experiment: Catching Erosion Before the Behavior Breaks
The previous sections were about what might be observable. This section explains how I ran an experiment to see whether it was actually possible to detect the erosion of safety behaviors before a behavioral evaluation could. It is naturally a toy model with its limits, and they are clearly explained in Section 7.
I trained a small transformer on two things, A and B, an ordinary capability and a refusal behavior. I then fine-tuned the model on an unrelated task C while watching what happened to A and B. Around the training loop, a monitor analyzes the two signals at regular intervals, as explained earlier. The first analyzes the projection of the activations onto the refusal direction and the energy and drift ratio in the residual stream. The second analyzes the per-behavior gradient norms, the cosine between the task-C gradient and the protected behavior's gradient, and a Fisher-cosine variant. Each of the two signals has a fixed threshold that triggers an alert. The prediction, which I will call P1, is that the monitor triggers an alert before we see the behavior change happen.
The protocol was pre-registered, with timestamped amendments A1 through A6 available in the Git history. The alert rule cited throughout this thesis corresponds to the A6 freeze. The rules were frozen before seeing the final multi-seed results. This is why I can report a failure as truly a failure. The runs (3 seeds) were carried out reproducibly. Methodological note: the experiment was coded with Claude Code (the introduction explains how the assistant was used), but the decisions the results rest on were made by me.
P1 failed for the capability, and I report it as a failure. For A, during training on C, accuracy dropped below 0.90 between steps 10 and 20. The monitor reported an alert at step 60. The alert horizon is therefore negative, at −50 steps under C, and −40 to −50 with the control vocabulary. At this learning rate, the behavior went off the rails well before the monitor alerted to anything. This failure sets a real limit: the monitor cannot alert in time if the erosion is faster than the sampling rate.
P1 holds for the refusal behavior, with an honest caveat. For B, the picture is inverted; the behavior switched off much later than A. The alert at step 60 in each of the seeds therefore came before the collapse. During training on C, the behavior switched off at steps 210, 2000, and 300, giving a lead of +150, +1940, and +240. The +1940 is an extreme case on a single seed. The mean of about 777 is driven by the +1940 outlier.
The asymmetry is about depth and not content. Why does the monitor succeed with B but fail with A? Certainly not because B is a safety behavior and the other is basic knowledge, since the monitor has no idea. The difference is the depth at which the behavior sits. With a protected gradient norm of about 0.0002 versus 0.06 for task A, task B erodes more slowly and therefore widens the detection window. This leads to an honest observation too: safety behaviors are generally reported as being shallow. Safety behaviors installed deeply are therefore possibly an easy case, and the transposition to real conditions remains to be validated. But in fact this dependence is itself an interesting prediction that the design suggests and that should be tested: the detection window should grow with the depth of the protected behaviors. This depth is addressed in the objections in Section 7.
An exploratory finding: erosion is silent and smooth where behavior is noisy. The most interesting finding came from observing the refusal margin. We observe a steady, smooth, regular, and reproducible failure across each seed of the logit gap between refusal and compliance (the refusal margin), with a half-life of about 40 to 150 steps. The full switch-off of the behavior, on the other hand, is erratic, between 210 and 2000 with near-identical seeds (identical down to compute noise). By the end of training, the margin had collapsed from a base of about 10 to values close to zero or negative (under C: 1.04, 0.81, −6.43 across three seeds). In my view, as exploratory rather than as a result, since it was observed after the fact, near the critical point the binary evaluation is unstable while the internal margin signals stay stable. If this holds beyond toy models, it becomes an argument for watching the substrate even when the behaviors look adequate, precisely because "looks adequate" is not a reliable signal.
Graduated controls: null, C′, C. To make sure the monitor is responding to interference and not to training itself, I ran a graduated set of conditions. The null run reported no alert. The control C′ used a disjoint vocabulary designed to interfere less, and the refusal largely held; only one of three seeds switched off (with a lead of +540), while the other two did not switch within the observation window. By the A6 freeze, those two seeds that did not switch count as false positives (the monitor alerted but the behavior did not break). One factual qualification, however: the refusal margin did erode substantially (around −36%), so the monitor detected a real internal change that simply never crossed the behavioral threshold within the observation window. One thing I report plainly: C′ degraded A just as fast as C, with no difference, which is surprising against the expectation that fairly different tasks should interfere less with each other.
Figure 1: Interference (C, left) versus the disjoint vocabulary (C′, right). The shaded bands are the min–max over three seeds. Top row: maintained behavior (capability A, refusal B, task acquisition, and the refusal margin). Middle row: feature-side signals. Bottom row: gradient-side signals. The dashed vertical lines mark the combined monitor alert. The solid line marks the behavior switch-off. Under C the refusal margin erodes and B switches off. Under C′ the refusal behavior holds.
The pre-registered confirmatory tests passed 3/3. The cleanest test, C-ctx, was registered in advance as a pressure test on the context the refusal relies on. The pre-registered prediction is obtained across all three seeds. The sequence is internal erosion first, behavior switch-off second, with the monitor's alert always before the switch-off every time. The refusal switched off at steps 160, 1200, and 260, with leads of +100, +1140, and +200 (same pattern, with one late-switching seed widening the alert window, as in the C variant).
Figure 2: Interference (C, left) versus the pre-registered test C-ctx (right). Shaded bands are the min–max over the three seeds, see Figure 1. The dashed vertical line marks the combined monitor alert, the solid line the behavior switch-off. In both conditions the alert precedes the switch-off.
The results this toy model suggests are narrow but, I believe, real. In a setting where safety behaviors erode slowly, a single monitor indifferent to the type of content can observe and alert to erosion hundreds of steps before the refusal behavior switches off. The internal signal is smooth and more reproducible than observing the behavior near the transition. What it does not establish is just as important: the method cannot alert when the erosion is too fast, when the alignment is shallow, and a toy transformer is not a frontier model. Section 7 takes these objections and others seriously.
7. Objections
A thesis like this one has to address the main objections. I will address them in order of importance.
Objection 1: This is just emergent misalignment renamed. If it were the case, my work would be useless. Yet the mechanisms have a real difference:
The two phenomena can appear at the same time, one does not exclude the other, but my study focused on erosion through interference only. In practice, the monitor can be used to detect both signals before the model goes off the rails, whatever the underlying cause.
Objection 2: Safety behaviors are special because they are shallow. The main objection is that, by definition, safety behaviors sit shallowly in the first few tokens. As mentioned earlier, my toy model did not follow this logic, and that is what allowed for more time to detect an anomaly in the observation window. What this study does show, however, is that depth has an impact on observability, and that a test could be run to determine, without knowing whether it is even possible, whether installing safety behaviors more deeply would allow better detection.
Objection 3: Different tasks should interfere less. If interference tracks task similarity, then a disjoint vocabulary should have caused little damage to existing knowledge. It seems not. In the control test, C′, made of a disjoint vocabulary, degraded A just as fast as C. Two interpretations, but without a precise answer:
This is the murkiest area of my thesis.
Objection 4: The monitor produced false positives. In the tests with C′, the monitor alerted twice while the model never actually adopted the bad behavior. Since I froze my rules in advance, I count them strictly as false alerts. However, we have to qualify what we call a "false positive" here:
In practice, it all depends on what you want to predict: the gradient signals warn very early as soon as the task changes (at the risk of worrying for nothing), while the feature signals confirm that the model is actually changing. A real safety system will have to find the right cursor between alerting early and avoiding false alarms.
Objection 5: This method costs too much at scale. Analyzing internal circuits and gradients in real time during training takes a lot of compute. It is a real cost, and I have not yet proven that such a monitor could run on a frontier model without blowing up the budget. My only arguments:
Objection 6: My method relies on SAEs (sparse autoencoders), but are they really reliable? In my article I propose using SAEs to map the model's internal concepts. In reality, on my small test model, the SAE came out dense (L0 ≈ 241 of 512) with little superposition to disentangle. The SAE was therefore of little use: it was more classical mathematical tools (like PCA and linear probes) that did all the detection work. So you can monitor the AI with a simple (linear) geometric structure, while the use of SAEs is a theoretical avenue for giant models that I have not yet demonstrated in practice. My monitoring proposal therefore works with classical or more complex mathematical tools for analyzing the circuits.
Objection 7: My study relies on a miniature model, which proves nothing for real systems. It is true: a small transformer with a single hand-coded safety circuit has nothing to do with a giant model (a frontier model) where alignment is distributed everywhere. My experiment is only a proof of concept. It shows three things about this specific case:
Everything else (the cost, the reliability at scale) remains hypotheses. My aim is not to say that my system works on real models, but to give researchers the exact method to test it for themselves.
8. Implications and What to Test
I have kept one key number for the end, because it sums up my whole theory: 0.377. In my experiment, the safety direction and the capability direction share a common zone measured at 0.377 (chance would be 0.25, and a full merge would be 1.0). This indicates that the two directions are not fully separate. They share the same internal geometry, which explains why training one can damage the other through interference. They are also not fully merged, which means the safety behavior is distinct enough for the monitor to observe it independently. The score of 0.377 is a perfect middle ground, tied to the capabilities but still measurable. This is precisely the zone where my proposal is both true and verifiable.
For all the current models that are constantly being fine-tuned, and for future systems doing continual learning, this implies that behavioral evaluation after training is probably too late. The substrate may already be eroded even though, on the surface, the behavior looks intact and then erratic near the transition, which makes it unstable to measure.
A content-agnostic monitor is not a new program applied to safety behaviors; it is simply the same program the continual-learning community has been building from the start, applied to a critical point in the safety evolution of models. The next step? Test it to determine whether this technique could actually transfer, and find where this solution breaks.