Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

michaelwaves; Yanjo; Yuqi Sun

TLDR; SAEs can complement and enhance LLM as a Judge scalable oversight for uncovering hypotheses over large datasets of LLM outputs

paper

Abstract

Large language models (LLMs) are increasingly trained in long-horizon, multi-agent environments, making it difficult to understand how behavior changes over training. We apply pretrained SAEs, alongside LLM-summarizer methods, to analyze reinforcement learning training runs from Full-Press Diplomacy, a long-horizon multi-player strategy game. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover SAE-based analysis finds fine-grained behaviors including role-playing patterns, degenerate outputs, and language switching, while LLM-summarizer captures environment-specific bugs and strategic behaviors. We validate discovered features through automated evaluation, two human user studies, and add them to an untrained agent's system prompt, improving performance by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical toolkit for interpreting long-horizon multi-agent LLM training.

Blog Post

We run Sparse Autoencoders on 114GB of Reinforcement Learning training trajectories from the popular multi-player strategy game Diplomacy, showing for the first time the potential downstream applications of data-centric interpretability techniques

What are the AIs doing when no one is watching? Current large-scale training runs can produce hundreds of millions or billions of tokens, with production AI deployments in the trillions. Human oversight of all AI outputs is becoming increasingly unfeasible. Common approaches to solving this problem include summarizing the logs, or using LLM as a judge with rubrics. The problem is these approaches are expensive, prone to hallucination, and can only attend to a small set of features you already know how to look for.

In our paper, we tested a novel approach: Using Sparse Autoencoders (SAEs) to collect feature activations on each token and generate hypotheses of what features changed most over training and are correlated with better performance. We ran Gemma 27B with gemma-scope-2 layer_31_width_262k_l0_medium over 1800 trajectories (114GB in total) from two 25 batch training runs (one successful and one failed) in the game Diplomacy, a multi-agent long-horizon strategy game.

Sparse Autoencoders

A Sparse Autoencoder (SAE) is a model that takes intermediate calculations from a language model (activations) and expands them to a higher dimension (for example, vectors of size 5376 to 262k). The idea is every entry in the expanded vector represents a single, human interpretable concept, for instance "dominance", or "napoleon." If we run this over text, we now have a machine that can label exactly "how much" of a concept each token contains, up to 262k concepts at once.

Pipelines

We used two pipelines to generate hypotheses: An LLM summarization pipeline and SAE pipeline. Unless otherwise specified, we used a canonical set of 1800 trajectories for each experiment; the first 6 trajectories from each group, the first 6 groups from each batch, and the first 25 batches across two runs.

LLM Summarization

We conducted a two stage hierarchical summarization pipeline on the canonical set. We first summarized each trajectory from around 50k tokens to 10k, preserving phase and tool call information. We then grouped the trajectory summaries by batch, summarizing each group of 36 summaries into one batch summary around 10k tokens. Finally, we used an LLM with a rubric to surface hypotheses across the 50 batch level summaries.

SAE Pipeline

We used gemma-scope-2-27b-it-res, layer_31_width_262k_l0_medium, and Gemma 3 27b it for all our main experiments. We chose this SAE due to the recommendation of the original authors of gemmascope 2, empirical performance, and the availability of explanations on neuronpedia. We first tokenized each trajectory and generated role masks (each token is either a user, assistant, or tool token). We then generated activations for each trajectory, saving the top 250 activating features per token, for a total of 6029159605 activation values.

Using the activations and role masks, we masked to assistant tokens only and used Spearman correlation and AUROC to find correlations between features and target variables of interest, namely batch and run.

To label SAE features at scale, a common technique is autointerp: Passing activating examples to an LLM and asking it "what does this represent?" A problem is features are often noisy, or not interesting on their own. We propose a new technique we call meta-autointerp: using another LLM pass on several autointerp labelled features to cluster them into a related meta-feature.

To answer the question "what features increase/decrease the most over training?" we summed the activations per trajectory and calculated the spearman correlation with batch. An interesting metafeature we found highly correlated with training batch was Napoleonic roleplay (the model's starting power was France).

We also found features that indicated excessive/duplicate message sending and reward hacking (the model was given +0.2 reward for each message sent), which we validated with regex. Surpisingly, the model also wrote more duplicated diary entries, despite this action receiving no reward.

Validation of Results

We consider and validate the following metrics for each metafeature

Interpretability. To what extent does the metafeature fire monosemantically? How effectively can a user or LLM distinguish between an activating and non-activating sample?
Helpfulness. How helpful is this metafeature to the practitioners conducting RL runs? Does it surface novel insights? Can it be used for monitoring rare failure modes? Does it cause them to make changes to the RL environment or system prompt?
Predictive usefulness. How effectively does this metafeature discriminate between early and late in training? High and low reward? A good training run vs a bad training run? How effectively can a user or LLM distinguish between a sample pair, one from class A and one from class B, given a hypothesis derived from the metafeature?

Autointerp and meta-autointerp score for interpretability and helpfulness. To further validate this on actual users, we conduct a user study with Diplomacy RL practitioners.

We fount that Meta-autointerp hypotheses outperform single feature autointerp, and LLM hypotheses obtain the highest ratings.

To validate the predictive usefulness of our features, we used an LLM as judge A/B test with and without the hypothesis, to see if it gave the LLM any uplift in predicting which span in a pair comes from early vs late in training.

We ran 100 runs on 3 LLMs for each hypothesis, averaged the results, and found that 21% of LLM generated features, 45% of single feature SAE hypotheses, and 90% of SAE meta-feature hypotheses provide significant uplift.

Evaluating the interpretability and predictive usefulness of hypotheses from 3 different sources: LLM summary, SAE features, and SAE meta-features. These were evaluated on 50 samples pairs with hypothesis-random sampling. Hypotheses are highlighted by direction: green = increases with training; red = decreases with training. Their uplift is marked with an asterisk if *p < 0.05 via McNemar’s test with positive uplift. Hypotheses are abbreviated for space.

We further validated features with a user study with n=25 and 277 responses. We found that although automated validation shows high scores, in practice using SAE and LLM hypotheses is difficult for humans, perhaps due to shorter spans and fewer samples (only 3 per hypothesis).

Uplift in percentage of correct responses with vs without the hypothesis as a hint. Most LLM generated hypotheses are negatively useful as well as a subset of SAE generated ones

We then tested our 10 top performing features by adding them to the system prompt of the original untrained model, running 20 games of Diplomacy, showing around a 14% improvement in mean score.

Conclusion

Overall, we found that SAE embeddings enhance and complement traditional LLM as a Judge techniques for discovering hypotheses over large datasets. Although automated metrics might show predictive usefulness, we find on real humans some SAE features are worse than useless. To our knowledge, this is the first time SAE generated hypotheses have been used in downstream tasks, showing potential for augmenting classical scalable oversight and AI control techniques. Further research directions include training SAEs for long context, on custom datasets, and potentially multimodal use cases. We're excited to see how the field of data-centric interpretability progresses!

LESSWRONG
LW