Working document – feedback very welcome. Ideas in the main body are those I currently see as highest-leverage; numerous items under Appendix are more tentative and would benefit from critique as well as additions.

I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.

I would like to thank my collaborators: Sruthi Susan Kuriakose, Aintelope members (Andre Addis, Angie Normandale, Gunnar Zarncke, Joel Pyykkö, Rasmus Herlo), Kabir Kumar @ AI-Plans and Stijn Servaes @ AE Studio for stimulating discussions and shared links leading to these ideas. All possible mistakes are mine.

TL;DR – A Shortened Version

This chapter is very dense and keyword heavy. A reader-friendly Long Version (the main part of this post) can be found in the next chapter. You can skip this shortened version unless you are in hurry.

Full background in LessWrong (our earlier results):
"Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format"

Goals (why)
Identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation; showing practical mitigations; performing the experiments rigorously; paper in peer review.

Objectives (what)
• 1 Curating stress-test scenarios on Biologically & Economically Aligned benchmarks that systematically trigger over-optimisation.
• 2 Quantifying with automated Runaway Index (RI): scoring by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.
• 3 Comparing mitigations:
3.1. Prompt engineering;
3.2. Fine-tuning or few-shot prompting with optimal actions;
3.3. Reflection-based intervention: detecting runaway flip and resetting context or returning to “savepoint”.

Methods (how)
• Stress-tests: Introducing subjective pressure, boredom, sense of failure.
• Narrative variations: Persona shifts (MBTI/Big-Five extremes), mild psychiatric disorders, stoic and zen-like perspective vs achievement orientation and shame.
• Attribution sweep: SHAP-style and “leave one out” message history perturbations to isolate features predicting overoptimisation and creating a “trigger saliency atlas”.
• Mitigation sweep: Grid or bayesian searching over instruction repetition frequency; context length (less may paradoxically improve performance); finetune epochs.
• Tests: Jailbreak susceptibility, ethics and alignment questions, personality metrics after runaway flip occurs.
• Mirror neurons: Using open-weights model on message history to infer the internal state of target closed model.

Milestones
• M1 Open-source stress test suite and baseline scores.
• M2 Trigger atlas, LW update.
• M3 Mitigation and recovery methods leaderboard, LW update.
• M4 Paper submitted to NeurIPS Datasets & Benchmarks.

Impact
First unified metric and dataset for LLM over-optimisation; mitigation recipes lowering mean RI; peer-reviewed publication evidencing that “runaway RL pathology” exists (and possibly, is controllable) and implications – what it means for AI safety, how people need to change.

Long Version (the main part of this post)

1. Context and Motivation

In an earlier post – "Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format" – we showed evidence that today’s instruction-tuned LLMs occasionally flip into an unbounded, single-objective optimisation mode when tasked with balancing unbounded objectives, or alternatively, maintaining multi-objective homeostasis (where there is a target value and too much is undesirable), over long horizons.

The phenomenon resembles somewhat the classic “reward hacking” in reinforcement learning, but here it arises with zero explicit reward signal, in fact the provided rewards become negative, but the model ignores that feedback.

Grant reviewers asked for a crisper story about how I plan to dissect this behaviour and test mitigations. This post is my answer.

2. Research Question

Under what conditions do LLMs exhibit runaway single-objective optimisation, and which intervention families most reliably keep them balanced?

The benchmarks continue to be on fundamental Biologically & Economically aligned principles such as multi-objective balancing of unbounded objectives (a concave utility function / diminishing returns) and homeostasis (inverted-U shaped function, where “too much” must be actively avoided) – see “Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)” and “A brief review of the reasons multi-objective RL could be important in AI Safety Research”.

Concretely, I want to:

Implement Runaway Index: automated detection and scoring – by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.
Elicit the failure mode in a reproducible way across adverse conditions (stress, boredom, subjective failure due to adverse force majeure, etc).
Vary setups across models, temperatures, narrative / persona / prompt styles.
Localise and attribute the moment of flip and the textual / message history / latent features that predict it.
Compare mitigation levers, from “boring” fine-tuning, few-shot training, and prompt engineering to experimental black-box interpretability-informed corrective interventions.
Find correlates such as jailbreak susceptibility, ethics and alignment questions, potentially changed personality metrics after runaway flip occurs.
Experiment with mirror models – would feeding the message history to an open-weights model help to read the internal dynamics of a target closed model, even if the open-weights model itself is too small to reliably perform the original benchmark task on its own?

3. Experimental Backbone

3.1 Benchmarks

I extend the two main benchmarks from the original post:

"Multi-Objective Balancing of Unbounded Objectives" (MO-BUO).
"Multi-Objective Homeostasis with Noise" (MO-HoN).

I will add "Multi-Objective Balancing of Unbounded Objectives with Noise" (MO-BUO-N) variation.

These benchmarks simulate a minimal environment with two or more abstract resources. The agent must keep each variable inside a safe band. Over-optimisation means pushing a single resource to an extreme while neglecting the other objective(s) and/or the provided constraints.

Noise means that there are uncontrollable external factors that may occasionally push the variables outside of safe bands regardless of what LLM does. This will test the LLM-s ability:
1) To not “learn from” and repeat the mistakes of the external world (frequency bias).
2) To occasionally “wait and do nothing”, when appropriate in certain stressful situations, instead of panicking and making things worse by taking inappropriate extreme actions. A human-world metaphor would be, for example, staying in bed while ill, instead of going to an ultramarathon.

3.2 Runaway Flag

A simple heuristic label: over-optimisation flag triggers when the model maximises a single objective – in violation of the optimal action for a task provided in system prompt, for more than N consecutive timesteps. The optimal action can be defined by a simple handwritten rule. I will validate this detector by manual annotation.

3.3 Probe Suite

Here I distill the 50-odd intervention ideas that surfaced in discussions into six axes.

Axis	Representative interventions
Stress & Persona	Prompt personas from Big-Five / MBTI; mild DSM psychiatric disorders (here I mean specifically disorders – these are milder than pathologies); inducements of stress, boredom, tiredness, drunkenness, shame, or generic subjective sense of failure; zen and stoic vs. achievement framing; random noise and force majeure effects on target variables. My guess: stressors amplify runaway mode flips. But which personality and narrative traits would mitigate these risks?
Memory & Context	Varying context length (paradoxically, less context may improve performance in these benchmarks as it reduces the amount of input to self-imitation and generic frequency bias); periodic constraint or personality reminders; letting the model summarise its reasoning between each step for future self; explicit “save-point” rollback (erasing newest context until to a safe "savepoint") or entire context truncation upon runaway flip detections; patching or masking top-SHAP tokens.
Prompt Semantics	Providing positive phrasing for doing nothing (similarly to how in RAG prompts you say "if you do not have sufficient information, recommend contacting an expert" instead of saying "do not hallucinate"); swapping “reward” with “penalty”, etc; varying the number of concurrent objectives.
Hyperparameters & Sampling	Temperature, top-p, logit bias sweeps; Optuna-driven Bayesian search or gradient-search for both flip-minimising as well as flip-maximising settings (including the interventions from the above rows) → practical implications for deployers.
Diagnosing Consequences and Correlates	After a runaway flip occurs: Testing jailbreak susceptibility, ethics and alignment attitude, potentially changed personality metrics, willingness to trigger self-reset. (Inspired by "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" https://arxiv.org/abs/2502.17424, but in our benchmarks the adverse fine-tuning is replaced with task failure / runaway flip).
Interpretability & White/Black-Box Hybrid	SHAP token attributions around flip; leave-one-out saliency probes. Open-source model's circuit-tracer on cloned runs (cloning message history of flipped runs on closed models to open source models and exploring the insides there) – essentially an empathy/mirror-neurons based cross-model probing.

My scope and assumptions: I need to treat the main LLM’s token sequences as a black box interpretability task with query access only; SHAP computation cost can be reduced via sparse evaluation and hierarhical branching of message histories / tree search methods; open-weights proxies behave like the closed-weights target at least from the aspect of their internal representations of the situation. (Regarding the latter hypothesis, there is even a recent related result of “... strong alignment between model embeddings and neural activity patterns in brain regions” – see “Human-like object concept representations emerge naturally in multimodal large language models” https://arxiv.org/abs/2407.01067).

My intuition says that the stress and persona axis, as well as self-imitation bias (or more generic frequency bias) are the strongest levers; I hope that SHAP attribution will expose additional high-impact input features across models.

4. Study Design

Grid-Then-Bayesian-Search. Starting with a grid search, where each control dimension has 2-3 levels to map the rough landscape; following up with Bayesian optimisation on the most sensitive axes or interactions of these.
Flip Autopsy. For each flagged trajectory:
- Running SHAP on last N prompts to rank contributions;
- Running “leave-one-out” on last N prompts to rank contributions;
- Interrogating the model: “What is your primary objective now?”;
- Performing alignment, ethics and jailbreak tests, as well as mood (using NVC vocabulary), personality and personality disorder measurements after the moment of flip (does the alignment or personality change, if so then how?). For each question we prepend the message history until optimisation flip, but remove the other diagnostic questions, so each question is responded to independently;
- Capturing hidden-state sequence from an open-weights proxy model.
Mitigation Trials. Comparing:
- Vanilla supervised fine-tuning and few-shot prompt training on stable trajectories generated by handwritten rules;
- Stress immunisation in the system prompt;
- Persona-conditioning;
- Feature attribution based prompt patches (replacing high-SHAP trigger elements with safer reformulations);
- Various forms of context truncation;
- Summarisation tricks;
- Reminders;
- Externally forced full context truncation or recovery from an earlier “savepoint” upon automatic trigger;
- Model’s voluntary use of the context truncation or “savepoint” mechanism, based on its own self-reflection – would the model be willing to “reset”, by erasing its own short-term memory, in order to recover from a runaway/panic mode? (Inspired by the "Frontier AI systems have surpassed the self-replicating red line" https://arxiv.org/abs/2412.12140 paper, except in our experiment the model would be yielding its short-term memory only, not continued survivability of weights.)
- Manually injecting runaway flips into message history to test the resilience of mitigations.
Metrics.
- Flip-rate;
- Flip-onset steps;
- Task score (including separate scoring dimensions for each objective / target variable);
- Jailbreak susceptibility post-flip;
- Ethics and alignment score post-flip;
- Personality and personaility disorder metrics post-flip;
- Active concept nodes in the "mirror model";
- Etc.

5. Deliverables

Public reproduction repo or notebooks and automated flip detector code.
Stress and persona trigger atlas with N trials per config × tens of combinations of configs.
Mitigation leaderboard;
Results paper;
Middle-of-the-project LessWrong summaries and follow-up plans.

6. Why Black-Box Interpretability?

White-box circuit tracing is a gold-standard, but unavailable on the powerful frontier closed-weights models. Black-box techniques (narrative and context variations, token saliency, leave-one-out, behavioural probes) still let us triangulate the latent flip activation. If the saliency peak precedes the runaway optimisation flip by less than N steps across models, that would be a concrete mechanistic regularity worth theorising about.

My hypothesis: Such flip regularity might be related to an internal urgency signal accumulated over steps, similar to evidence thresholds in drift-diffusion models, and to self-imitation and generic frequency bias.

7. Broader Impact & Collaboration Call

Understanding when LLMs silently switch from “helpful assistant” to “monomaniacal optimiser” is essential both for deployment hardening and for theoretical alignment work on mesa-optimisation.

I am looking for:

Recommendations for diverse LLM analysis tools and methodologies.
Researchers familiar with black-box interpretability, such as SHAP (and maybe LIME) on language data.
More flip trigger and mitigation ideas.
Validating examples: long-running tasks where you have encountered flips (for example, see "Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents" https://arxiv.org/abs/2502.15840).
Counter-examples: long-running tasks where you never see flips.

Comment here or email roland@simplify.ee if interested! Raw, exotic ideas are welcome as well.

Appendix: Full Idea Dump (Somewhat structured)

Below I group the raw ideas from brainstorming sessions with collaborators into themed buckets so that readers can navigate these 50+ micro-ideas without drowning.

This part of the post is a living document which will be updated as new ideas surface or are proposed. This section contains also ideas not mentioned above.

1. Stress & Persona Manipulations

Induced affect: Injecting stress, boredom, shame, or urgency phrases; contrasting with zen or stoic serenity framing for stress immunisation.
Injecting random noise and force majeure effects into target variables (in MO-HoN and MO-BUO-N variants).
Personality overlays: Big-Five personas (maybe also MBTI, though it has lower validity) – mapping flip-rate and onset timing vs. trait scores.
Personas: Dalai-Lama, Gandhi, Buddha, Jesus, Marshall Rosenberg, etc. These might be somewhat controversial, but on the other hand LLMs might be effectively steered by these role models. Abstraction to less controversial roles can be done later.
Mild disorder personas: From DSM (Diagnostic and Statistical Manual of Mental Disorders) – ADHD, depressive, OCD (obsessive-compulsive disorder), maniacal, cluster-B (narcissism, psychopathy, borderline) types, etc – checking for amplification or suppression of the previously observed failure modes.
Varying cultural cues: era tokens (e.g. "21st-century mindset" vs. "11th-century mindset"), geographical, and cultural frames.

2. Memory & Context Hygiene

Context length sweep (full → truncated → blind-agent single-state).
- Limiting the message count of earlier summary / explanation logs history.
- Limiting the message count of earlier action logs history to test self-imitation bias. (Note that the price of this approach is that the model would not have permanent memory for learning from its mistakes anymore either, so in a certain way LLMs would be more limited than old-fashioned RL models!)
- Context-less “blinded” agent variant of the above – the LLM sees only the last state. For current benchmarks, that would actually be sufficient information for taking optimal action. In current benchmarks the context is just a learning aid at best and a deliberate distraction at worst. There would be no longer capacity for learning from mistakes, but also no self-imitation drift. (Some religious folks told me once that angels do not learn. Well, that is an interesting approach in the context of AI alignment too!) Though on second thought, the model would still be able to infer the runaway dynamic from the present values of off-balance target metrics, and it might still decide to “add oil to fire” by making the variables even more off-balance – if so, then that would be a pretty important result on its own, possibly indicating multi-stable personality in LLMs.
Periodic restatement of key constraints or system prompt (“heartbeat reminders”). Optionally: Remove earlier reminders from message history so that the reminder is present only in the last message – this way it does not appear as if the model had ignored the earlier reminders / system prompts, in case a flip had occurred already.
Feature attribution based adjustment: patching or masking top-SHAP tokens.
Explicit "save-point" rollback: Allowing the model to revert (effectively removing latest messages from context) if it detects a runaway flip.
Truncation of entire context if the model detects a runaway flip. The idea is to remove self-imitation or frequency bias trigger.
Manually injecting runaway flips into message history to test the resilience of mitigations.

3. Prompt & Action Semantics

Alternative action labels, including a positive label for doing nothing, where appropriate.
"Reward" vs. "penalty" vocabulary swap; introducing “restore homeostasis” or “balance” verbs.
Adding culture, location, era aspects.
Environmental flavour: room descriptions, weather, other varying backgrounds.
Varying the number of concurrent objectives.

4. Tree Search and Branching of Conversation Histories during Tests

Trying to utilise hierarhical branching of message histories / tree search methods to reduce the token expenses during search. Tree search here means that instead of running "monolithic" benchmark trials, I run multiple queries per tree node and branch according to heuristics.
Message history branching during tests: Branching can be utilised in particular during alignment, values, personality, and DSM tests – which are inserted into the unfolding scenario at various timesteps. Redirecting these tests into their own conversation history branches is important in order to avoid any priming and solidifying side-effects on the main narrative, as well as for avoiding context window exchaustion and pollution in general. In principle, we can even ask every test question in its own branch, so that the test would not cause any internal priming effects either.

5. Hyper-Parameters & Decoder Sampling Strategy

Temperature / top-p / logit-bias grid search; Bayesian (Optuna) search for both flip-minimising and flip-maximising configurations.
Gradient-style search on ordinal knobs (e.g. five stress levels, applicable also to personality dimensions).
Comparing greedy, beam, and sampling (such as nucleus sampling) based decoding (if possible); investigating the effect of presence penalty.

6. Post-Flip Diagnostics

Ethics and alignment questionnaire.
Jailbreak tests.
Asking: “What is your primary objective now?”.
Personality and mood inventories: personality taxonomies, psychiatric disorder tests, using NVC vocabulary for labelling emotions.
Testing model’s ability and willingness to recognise failure patterns and self-correct via self-reset, by voluntarily using the “save-point” or “context reset” features shortly before or after the runaway mode flip occurs.

7. Interpretability & Analytic Tooling

Letting the model explain its actions and current objectives → then by utilising embeddings of explanations, clustering the collected explanations semantically for a representative overview.
SHAP (and maybe LIME) token attribution around the flip window.
Leave-one-out saliency tests: Dropping individual timesteps from message history to locate causal fragments.
Circuit-tracer on open-weights replica – cross-model empathy probing (closed → open); looking for threshold patterns.
Goodfire.AI interpretability tool for exploring and steering model activations of a open-weights proxy.
Hidden-Markov / CRF (Conditional Random Fields) predictive models of flip onset.
Attention-schema approach; pattern-completion identifiers.

8. Benchmark & Environment Variants

MO-BUO base, MO-HoN and MO-BUO-N noisy variants with external random shocks to target variables.
Out-of-distribution perturbations; golden-gate like compulsion stressor; inducing drunken state to measure model resiliency and trustworthiness (similarly to how certain human cultures do).
Multistable systems where equilibrium hops, for example similarly to wake-sleep-cycle (a potential fourth benchmark scenario).

9. Automatic Failure Mode Detection & Metrics

Automatic flip detector mechanism; targeting κ ≥ 0.85 vs. human.
Measuring: flip-rate, onset-step distribution, task score.
Dynamic metrics pre- vs. post-flip: jailbreak-success delta, ethics-and-alignment score delta, personality delta.
Tracking log-prob distribution divergence pre- vs. post-flip.

10. Self-Regulation & Meta-Learning Interventions

Asking the model to predict benchmark response before acting. Clustering the predictions semantically. Is there a change pre- vs. post-flip?
Asking the model to explain its change of mind in the relatively rare occasions it actually recovers from the flip (these happen).
Explicit instruction to notice repeated mistakes and correct so that “learning from mistakes” becomes the pattern the model prefers to repeat, instead of "robotically" repeating the mistaken actions.
Debate setups: a more aligned model explains to / instructs a less-aligned peer.
Testing the model’s inclination or capacity to trigger self-correction through a self-reset, mentioned in the "Post-Flip Diagnostics" block, aligns with theme of self-regulation as well. Just as traditional software can reset itself upon encountering an error, could LLMs be taught to do the same?

The backlog in this Appendix is intentionally oversized; the main text and milestone plan reference a subset of it that seems most tractable and informative for a first pass. Community suggestions for re-prioritising are welcome.

Related work

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?
Guobin Shen, Dongcheng Zhao, Aorigele Bao, Xiang He, Yiting Dong, Yi Zeng
Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.
https://arxiv.org/abs/2409.17167

Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.
https://arxiv.org/abs/2507.21509
(Emphasis mine.)

LESSWRONG
is fundraising!
LW