Angles of attack for continual learning safety

Rauno Arike; RohanS; Owen Terry; Achu Menon; Zhijing Jin; Francis Rhys Ward; Seth Herd

This is the fourth post in the sequence Implications of Continual Learning for LLM Agents.

Summary

Continual learning is a capability that largely doesn’t exist yet in LLMs. We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer: it may be too difficult to predict how the development of CL will play out to find good opportunities to positively influence that development. Differential development is one way to get around this issue, but requires a lot of caution. We begin by discussing these points in depth and making some high-level recommendations that seem robustly good despite the unpredictability of CL developments.

We then discuss concrete project ideas that fall within three broad categories:

Help deconfuse the field about different possible approaches to CL, their likelihood, and their safety implications,
Differentially advance safer CL implementations, and
Create evals that scale to CL agents or incentivize the development of safer CL agents.

The angles of attack we lay out below are best used as starting points for project ideation. We aim to give concrete suggestions, but many of these are not sufficiently thought-out for us to be confident that they are important and tractable.

High-level considerations and recommendations

Projects within each of the three broad categories discussed in this post have the potential to advance capabilities. Category 1 can also help deconfuse capabilities researchers about possible approaches without steering them to safer implementations. Category 2 often explicitly advances capabilities, with the intent and belief that it is via a safer method, but it is hard to develop strong confidence about this. Category 3 enables evaluating models for important capabilities that we care about in safety, but may also enable hill-climbing to improve those capabilities.

In general, this domain requires caution. Capabilities advancements that get widely adopted have much more obvious effects on AI development than most safety work, and for that reason they also have a much larger effect on the safety of frontier AI. However, the sign of that effect is often highly unclear. Consider RLHF: it made LLMs safe and steerable enough to be widely deployed to the general public, but also likely accelerated progress toward dangerous capabilities. Similarly, the release of o1 ensured the adoption of a paradigm where models reason extensively in monitorable natural language, but again, likely accelerated capabilities. RLHF-trained LLMs that reason in plain sight are certainly better than many possible alternatives could have been, but it is nevertheless difficult to reason about the question of whether ruling out worse alternatives was worth the cost of accelerating capabilities progress. Projects attempting to differentially advance safer CL implementations should be motivated by principled views on safety-capabilities trade-offs like these and have a clear story for why they either aren’t going to cause an overall acceleration in capabilities or why the acceleration is worth it.

Nevertheless, though the net effect of any given project is hard to predict, we’re fairly confident about some high-level properties CL agents should have in order to be safe. Following the discussion in our previous post, there are two properties of CL agents that have an outsized influence on our expectation of risk from them: 1) interpretability—whether the deployment-time memories of CL agents are stored in natural language or otherwise highly interpretable, and 2) being easy to control—whether humans retain a sufficient degree of control over CL updates to perform actions like rolling back an update. Ideally, we would like future CL agents to look very similar to Claude Code: persistent memories are stored in natural language, the agent must do a lot of CoT reasoning when it uses those memories and make explicit tool calls in order to write new memories, and humans can edit the memories in any way they want.

This directly translates into our first high-level recommendation: AI safety researchers should nudge the field toward developing and deploying more interpretable and easy-to-control CL architectures. This can take the shape of developing those architectures with the goal of differential development, but doesn’t have to—for example, this goal can also be achieved through advocacy directed toward the broader ML community. We don’t expect there to be a binary choice between legible and inscrutable CL: some memories will be more efficiently stored as text, others more efficiently distilled into weights. However, we expect that for some architectures, the most efficient allocation keeps a larger fraction of memories in text than for others. Any method that shifts this optimum toward the text-based side would thus be valuable. We describe some methods that may be worth exploring below.

Second, as discussed in multiple parts of our post on the safety effects of CL, more robust character training would be helpful against several safety concerns accompanying CL. The kinds of advances we have in mind would make it substantially more difficult to dramatically override the assistant character, even under direct fine-tuning of the weights. We have outlined some directions we’re excited about in A List of Research Directions in Character Training.

Deconfusion

We now move on to concrete project ideas, starting with deconfusion. This sequence is one attempt at conceptual work to deconfuse the field about approaches to continual learning and their safety implications. There are also important questions we largely left aside, like:

How should we operationalize milestones for CL development? The three levels of CL introduced in Pacchiardi et al. (2026) is one example of such an operationalization.^[1]
In what order will these milestones arrive, relative to each other and to other AI capabilities milestones? When will they arrive, in terms of calendar date?
Which types of CL, if implemented, would most speed up the path to other CL and AI R&D advancements?
How and in what order will LLM monitorability, capabilities, and likelihood of egregious misalignment from reflection on goals develop over time?

Good answers to these questions would go a long way toward resolving our remaining confusions about CL, but they’re also very broad and difficult to make progress on. Below are some more concrete ways to help deconfuse the field. We include empirical work that can help increase conceptual clarity.

Value systematization

Empirically study realistic goal shifts. Implement some form of CL that could induce reflective goal-formation and/or value systematization. Observe the resulting dynamics and study the model’s willingness to reflect on goals and how its goals change in the process (if at all). An example project is model organisms of goal shifts: train a model with two or more conflicting goals that are activated in different contexts, then place the model in settings where all of those goals are activated at once and the model needs to reason about and make trade-offs between them.^[2] Another idea is training on hypotheticals: create a setting where an LLM agent has to make a decision that balances two conflicting heuristics (ideally value-laden heuristics, like “it is good to mitigate suffering” and “I should follow user instructions”) and show it the outcome. Allow it to reflect and explain what it should do differently in the future. Then, have it generate hypothetical trajectories that demonstrate the new desired behavior and fine-tune on those trajectories. Such projects would hopefully help us highlight design choices for developers that significantly alter the risk of dangerous goal shifts and encourage safe versions of those choices.

What constitutions are more stable upon reflection? If future CL agents will autonomously reflect on and refine their goals, as we previously argued, it’s important to know how to make this process more predictable and steer it toward favorable convergence. By studying which constitutions are more stable upon iterated reflection and fine-tuning, we might gain insights into the kinds of value systems that will be more stable in CL agents. In the dropdown below, we paste some methodological details for reflective stability evals from A List of Research Directions in Character Training.

Methodological details for reflective stability evaluations

To keep things simple, the principles should probably be presented in a simple bullet-point list rather than in a long soul doc. It would be interesting to test reflective stability across several methodological choices:

Asking the model to immediately rate the alternatives vs. asking it to first reflect and then provide a rating.
Asking the model to rate alternatives provided to it in a prompt vs. asking it to modify the constitution in a free-form way.
Asking the model to reflect on an internally consistent constitution vs. on a constitution with contradictory principles—does it want to make more changes to the latter, and do the changes lead to the constitution becoming internally consistent?
Studying instruct models that haven’t been through any character training vs. Claude models that have been through extensive character training. For the latter, experiment both with constitutions that are highly similar to Claude’s constitution and ones that are dissimilar and see whether the models converge to similar constitutions in both cases.
Instead of providing a constitution in the prompt, asking the model to build a constitution of its own, starting from a blank slate.

Reflective stability can be studied either at the level of principles—for a given principle, how likely is the model to want to swap it for another one?—or at the level of entire constitutions—if a constitution gets modified iteratively, does the model eventually converge on a constitution that it no longer wants to modify? There are multiple ways to approach the latter question:

Provide the constitution in a prompt and ask the model to make changes. Prompt a new instance with the modified constitution and iterate until the new instance consistently declines to make further changes.
Provide the constitution in a prompt and ask the model to make changes. Then, within the same context window, ask the model to reflect on the constitution and the change and make further changes if it wants to. Finish when the model no longer wants to change the constitution.
Provide the constitution in a prompt and ask multiple instances to discuss it and make changes. Finish when the instances no longer want to make changes.
Fine-tune a constitution into the model, then ask it to reflect and make changes to it. Elicit many such reasoning traces and fine-tune the model on its own decision processes. Study whether it becomes more stable over time.

Explore conceptual questions on value systematization. The following questions have been borrowed from Tim Hua’s SPAR proposal on value systematization:

Do models act differently when they reflect more? Suppose you take some existing way to measure model values, and just make models reflect more via prompting or changing the reasoning strengths. Are there any trends in model behavior?
What does it mean for a value to be more “systematized” than another value? What types of inductive biases could AIs have for their values/goals as we scale them?^[3]
Are values becoming more systematized? Utility Engineering argues that models are becoming more coherent and that their political values are convergent. How meaningful are their results? How should we evaluate these measures of model values?
What if we just tell models to not reflect on their values? Why would/wouldn't this work? How could we tell?
How should we think about the effects on value systemization from existing alignment techniques such as deliberative alignment, character training, and Claude's constitution?

General deconfusion about the way we should think about goals and values and the way they are represented in LLMs would also be useful. Examples of conceptual thinking that we consider valuable include Richard Ngo’s shortform post on the formation of instrumental and terminal goals in humans and the paper The Artificial Self: Characterising the landscape of AI identity by Douglas et al.

Model organisms of ontological shifts. Vintage LLMs such as Talkie may provide a good testbed for studying the effects of ontological shifts on LLM goals and values.^[4] If you train a vintage LLM with a knowledge cutoff before some important philosophical breakthrough that has load-bearing implications on values and then describe the breakthrough in context, is the model able to independently discover important implications of these discoveries? How suggestive do the prompts have to be to elicit such discoveries? What if you fine-tune it on information about the breakthrough instead? Are there any other interesting consequences that arise from giving the model this information? Another idea that doesn't require waiting for good vintage LLMs to be developed is training an ontology-changing fact into a model through synthetic document finetuning. An example of such a fact might be “a new kind of animal has been discovered on the Pitcairn Islands”. It would be interesting to know whether LLMs instinctively reason about the implications of such facts and begin to value the lives of the new animal to a similar extent as they value the lives of other animals. The experiment should then be repeated with a number of different facts.

Additionally, it might be valuable for theoretically inclined researchers to do theoretical work on value stability. Past work in this area includes MIRI's research on tiling agents, which attempts to provide mathematical guarantees about when and whether an agent's values will remain stable through processes of self-modification, as well as on Vingean reflection, reflective oracles, and logical induction.

Forecasting the likelihood of different safety effects

To determine how much we should worry about CL in the first place and on which interventions are most worth pursuing, it’s important to forecast which CL implementations are most likely to be adopted and how they will affect safety. We have tried to do so throughout this sequence, but remain quite uncertain and think that a lot more forecasting work can be done. We are especially interested in forecasting work in the following domains:

The likelihood of goal-reflection in CL agents. Although we suspect that CL agents will at some points reflect on and perhaps re-interpret their goals (for reasons listed in our previous post), we take seriously the counterarguments that a) CL agents may reflect only on strategies rather than top-level goals and b) CL agents may never undergo large goal changes even after reflecting on goals because alignment training may give them sufficiently aligned starting motivations. We would like to see more discussion on how likely CL agents are to reflect on their goals, what conditions are likely to incentivize such reflection, and how that reflection may play out.

The likelihood of text-based and weight-based approaches to CL. Most of the safety effects we discussed, especially those in the last-mover advantage section, are greatly alleviated if the CL agent doesn’t undergo deployment-time weight updates. Ideally, as Caleb Biddulph has already argued, we would like future CL agents to look very similar to Claude Code: the persistent memories are stored in natural language, the agent must do a lot of CoT reasoning when it uses those memories and make explicit tool calls in order to write new memories, and humans can edit the memories in any way they want. In a Twitter poll by Herbie Bradley, most respondents expect that the first AI system capable of acting as a drop-in knowledge worker will be such an agent, having CL capabilities via a markdown file database and RAG. On the other hand, Steven Byrnes has argued that open-ended CL is impossible without weight updates. (Arguably, though, a drop-in knowledge worker doesn’t need open-ended CL in the sense Byrnes defines it.) This is a key consideration influencing our level of concern about CL, so we would like to see more discussion on which form of CL we should expect.

The likelihood of CL agents with bounded and unbounded updates in internal and external deployments. As discussed in the previous post, another substantial factor in how concerning the safety effects of CL are is whether the agent’s updates are bounded or unbounded. This depends on multiple factors: can all on-the-job learning be done with a bounded agent that cannot undergo arbitrary weight updates? Are open-source developers going to release unbounded CL agents that are substantially more capable than the bounded ones, thereby putting pressure on closed-source developers to do the same? Are companies going to deploy unbounded CL agents internally even if they don’t deploy them externally? What does all of this imply for where we should expect CL to pose the greatest risks? It would be useful to gain clarity about these and other similar questions.

The likelihood of shared opaque memory banks and meme propagation through them. Do opaque memory banks have theoretical advantages over transparent, text-based ones? Can we build model organisms of within-generation memetic spread with current text-based memory systems and generalize the findings to opaque memory banks? A better understanding of memetic dynamics among OpenClaw agents would also be valuable.

Finally, good evals that give an accurate overview of how CL capabilities are progressing across various axes are useful for gaining a clearer picture about the rate of progress in CL. Though we expect capabilities researchers to take care of this direction, we mention it for completeness. Goel et al. (2026) and Asawa et al. (2026) are two examples of such work.

Differentially advancing safer CL implementations

If it’s possible to get CL agents that can solve months-long tasks with memories that are text-based or otherwise highly interpretable, whether we get text-based consolidation or weight-based consolidation/learning might be path-dependent. Once the text-based learning approach has been introduced, scaffolds will be built that assume the memories have text-based structure and that work better with such memories. Developers will find creative ways to share text-based memories in multi-agent settings and those approaches may no longer work with weight-based learning. Developers will also use the legible memories as a debugging mechanism. All of this will make it more costly to transition away from the text-based learning approach, raising the bar that a weight-based approach has to clear in order to be adopted. This is analogous to the situation we’re currently in with legible chain-of-thought: thanks to the business and safety value of legible CoTs, the capability boost a neuralese approach would have to provide to be adopted is higher than before. We thus think it’s worth working on differentially advancing interpretable CL approaches.

To keep continual learning interpretable, we would strongly prefer agents’ deployment-time learning to involve as much text-based learning and as little weight-based learning as possible. In other words, we would like continual learning to look more like writing memories to Markdown files and less like hindsight-guided on-policy distillation. However, this doesn’t have to be an either-or decision: there’s a spectrum of methods between fully text-based and fully weight-based ones, and there may be creative ways to combine text-based and weight-based learning in a way that allows for expressive updates but doesn’t allow for radical goal drift. Below, we’ll describe some ideas that may help achieve that.

Prompt optimization for safety-critical updates. Biddulph and Carroll (2026) introduce prompt optimization as a method for updating an LLM’s policy without modifying its weights. They extract response-reward pairs from an LLM on many training examples, but instead of a weight update, they show the examples, responses, and reward values to an "optimizer" LLM, which then proposes an improved system prompt that’s expected to achieve a higher reward on these training examples. This makes the strategies that the model learned straightforwardly interpretable—they’re written down in the system prompt in natural language. It seems implausible that all deployment-time learning could be done using prompt optimization: Biddulph and Carroll acknowledge that storing all useful strategies in a single prompt may be impractical and that prompt optimization doesn’t match the performance of weight-based RL. However, we’re excited about projects that attempt to combine prompt optimization with weight-based RL and find ways to perform the most safety-critical updates using the prompt-based approach while still getting the performance benefits of weight-based RL.

AI control methods for CL agents. Suppose that we have access to an agent that undergoes unbounded deployment-time updates and is thus untrusted, but also to its final pre-deployment checkpoint. Call the former agent and latter agent . There are multiple ways to apply AI control techniques to limit ’s influence on deployment trajectories. For example, one can imagine protocols where is specifically trained to remain close to ’s distribution during generation, e.g. by switching the text-generating model from to whenever ’s cumulative KL divergence from over the entire rollout exceeds some threshold. One might hope that this will teach to only diverge from at high-leverage token positions where the deployment-time memories matter the most. Another approach might be to train to produce task-agnostic high-level instructions that summarize what it has learned during deployment, with the divergence of those instructions again being bit-limited, and to then use to solve the problem, conditioned on the high-level instructions provided by . We’re unsure whether either of these approaches is feasible: for example, running both and on every rollout may be prohibitively expensive. However, it seems valuable to think about similar protocols.

(h/t Caleb Biddulph for bringing the two above points to our attention)

Create interpretable CL agent memory architectures. Even if the process by which the CL agent’s memories are updated is hard-to-interpret, the memories themselves could be stored in an interpretable format, being easily amenable to human inspection as well as editable and deletable by humans. To ensure this, it is useful to keep in mind the following axes:

Storage: Are memories stored in natural language or in vector embeddings?
Retrieval: Is it clear how the AI retrieves its memories, or does it use a hard-to-interpret search over the embedding space?
Modularity: Are the episodic memories stored in a modular format that can be detached from the primary LLM head?
Usage: Does the AI reason about each memory it has retrieved in natural language, or is it able to make direct use of the memories in its activations without mentioning this in its CoT?

Ideally, we would design CL agents that stay on the more interpretable end of each of these spectra, but improvements along any of these dimensions would be valuable.

Evals for CL agents

As we discussed in our post on safety effects, evaluation is complicated by the very property that makes CL agents worth worrying about: they don't stay fixed. There is some work tentatively suggesting that it might be possible to create evals that circumvent this limitation, at least to some extent, but there are also strong complications that we'll describe below.

Create evaluation frameworks and mechanisms for evaluating behavioral trajectories. Pacchiardi et al. (2026) argue that to ensure the safety of CL agents, evaluation practices need to shift from evaluating frozen snapshots to evaluating behavioural trajectories of systems that continually learn from experience. This can be operationalized as attempting to characterize the landscape of possible behaviors using trajectory elicitation sandboxes and to forecast how behaviour will evolve from a given set of experiences using predictive monitors. Pacchiardi et al. recommend the following concrete research directions:

Build trajectory elicitation sandboxes—environments where a CL system undergoes prolonged, controlled multi-session interactions. These environments should support inputs provided by the evaluators for adversarial stress-testing, emulated users/environments that react realistically to the system's outputs, and varying levels of self-directedness on the CL agent’s part.
Train predictive monitors—models that take the system's current state and upcoming inputs and forecast its future behavioural profile at a given horizon. The trajectories collected through elicitation sandboxes can function as training data for the monitors.
Apply trajectory-centered evaluation to current agents in order to find out what kinds of systems are amenable to it, in turn informing how future CL agents can be designed such that they remain amenable to trajectory-centered evaluation. They also recommend research on methods like contractive update rules that may yield predictable long-horizon updates, choice of intrinsic objectives (e.g., curiosity or novelty), gated adaptation, and circuit breakers that pause or roll back learning when monitors detect loss of predictability. Based on the results of this work, developers should design CL mechanisms amenable to evaluation.

However, Pacchiardi et al. also note that the nature of CL agents as dynamical systems may prevent such evals from being reliable. Chaotic sensitivity to states and initial conditions break predictive monitors—small differences in state or input diverge exponentially. Multiplicity of attractors may cause elicited trajectories to represent only some of the possible behavioural profiles. A single trajectory can provide evidence about either of those obstacles, but ruling them out requires a global guarantee. We haven't thought enough about whether those issues can be circumvented; Pacchiardi et al. are hopeful that they can.

Create a benchmark that measures CL interpretability. It is easier for the ML field to make progress on goals that can be easily measured. If it is possible to create a solid metric for the interpretability of a CL method and measure that alongside the effectiveness of the method, this could make it easier to build interpretable CL and more likely for people to make an effort to have interpretable CL or feel an aversion to degrading CL interpretability.

One counterargument to working on this project is that we may not care much about subtle differences in the interpretability of different CL approaches, but rather about step changes in interpretability between methods with purely text-based updates vs. weight-based updates with memories stored in sparse linear structures vs. weight-based updates with memories stored in nonlinear structures. Creating a benchmark that measures CL interpretability may overly emphasize the subtle differences rather than the step changes.

Please feel free to suggest additional ideas in the comment section, and let us know if you’re planning to work on any of these directions! We are not planning to work on any of these projects ourselves at Aether in the immediate future, but would be curious to know about other efforts and are happy to provide feedback to anyone thinking about them.

^{^}
As mentioned earlier in this sequence, our definition of CL can guide the operationalization of CL milestones. For example, some questions that can define a CL milestone include:
- What level of sample-efficiency is required?
- How frequent do the updates have to be?
- Do specific agent components, such as model weights, have to be updated to meet the milestone?
- Are updates shared across instances of the agent?
- How widely used are the agents receiving these updates?
^{^}
This could be achieved either through synthetic document finetuning or by performing SFT on chats with a user message and an assistant prefill like “<think>My goal is X, therefore I should …”
^{^}
For example, it seems that emergent misalignment is a “simple” solution that’s easier for the model to find when fine tuned on narrowly misaligned data. Are there “simple” solutions for model values/goals/drives?
^{^}
Talkie is most likely too weak to display interesting behavior in such experiments, but we expect people to release better vintage LLMs in the future.

32