What and Why: Developmental Interpretability of Reinforcement Learning

Garrett Baker

Introduction

I happen to be in that happy stage in the research cycle where I ask for money so I can continue to work on things I think are important. Part of that means justifying what I want to work on to the satisfaction of the people who provide that money.

This presents a good opportunity to say what I plan to work on in a more layman-friendly way, for the benefit of LessWrong, potential collaborators, interested researchers, and funders who want to read the fun version of my project proposal

It also provides the opportunity for people who are very pessimistic about the chances I end up doing anything useful by pursuing this to have their say. So if you read this (or skim it), and have critiques (or just recommendations), I'd love to hear them! Publicly or privately.

So without further ado, in this post I will be discussing & justifying three aspects of what I'm working on, and my reasons for believing there are gaps in the literature in the intersection of these subjects that are relevant for AI alignment. These are:

Reinforcement learning
Developmental Interpretability
Values

Culminating in: Developmental interpretability of values in reinforcement learning.

Here are brief summaries of each of the sections:

Why study reinforcement learning?
1. Imposed-from-without or in-context reinforcement learning seems a likely path toward agentic AIs
2. The “data wall” means active-learning or self-training will get more important over time
3. There are fewer ways for the usual AI risk arguments to fail in the RL with mostly outcome-based rewards circumstance than the supervised learning + RL with mostly process-based rewards (RLHF) circumstance.
Why study developmental interpretability?
1. Causal understanding of the training process allows us to produce reward structure or environmental distribution interventions
2. Alternative & complementary tools to mechanistic interpretability
3. Connections with singular learning theory
Why study values?
1. The ultimate question of alignment is how can we make AI values compatible with human values, yet this is relatively understudied.
Where are the gaps?
1. Many experiments
2. Many theories
3. Few experiments testing theories or theories explaining experiments

Reinforcement learning

Agentic AIs vs Tool AIs

All generally capable adaptive systems are ruled by a general, ground-truth, but slow outer optimization process which reduces incoherency and continuously selects for systems which achieve outcomes in the world. Examples include evolution, business, cultural selection, and to a great extent human brains.

That is, except for LLMs. Most of the feedback LLMs receive is supervised, unaffected by the particular actions the LLM takes, and process-based (RLHF-like), where we reward the LLM according to how useful an action looks in contrast to a ground truth regarding how well that action (or sequence of actions) achieved its goal.

Now I don't want to make the claim that this aspect of how we train LLMs is clearly a fault of them, or in some way limits the problem solving abilities they can have. And I do think it possible we see in-context ground-truth optimization processes instantiated as a result of increased scaling, in the same way we see in context learning.

I do however want to make the claim that this current paradigm of mostly processed-based supervision, if it continues, and doesn't itself produce ground-truth based optimization, makes me optimistic about AI going well.

That is, if this lack of general ground-truth optimization continues, we end up with a cached bundle of not very agentic (compared to AIXI) tool AIs with limited search or bootstrapping capabilities.

Of course, supervised pretraining + RLHF does not optimize for achieving goals in the world, so why should we get anything else?

"Well, in a sense we are optimizing for agentic AIs..." The skeptic says, "Humans are agentic, and we're training LLMs to mimic humans! Mimicking agency is agency, so why won't LLMs be agentic?"

This is why I say I think it possible we see in-context ground-truth optimization criteria instantiated as a result of increased scaling.

However I expect the lessons I learn from studying outside-imposed RL to be informative about in-context RL if it appears.

Data walls

As for fighting the data wall, already labs are researching ways to get AIs to give themselves feedback, generate their own synthetic datasets, perform self-play, and scalably learn from algorithmically checkable problems. Mostly by adaptation of RL algorithms. The best known example here for this audience is Anthropic's Constitutional AI (also known as reinforcement learning from AI feedback (RLAIF)).

One may ask how likely are such active learning approaches to be based on RL algorithms versus some other different thing?

I do think there's a good chance that new RL algorithms are invented, or that other existing algorithms are adapted for RL. But to me the question isn't so much whether or not future active learning approaches will use PPO, but what dynamics are similar across different active learning approaches & why. I tend to think a lot. They aren’t all that different from each other.

Developmental interpretability

So why study developmental interpretability, instead of regular old mechanistic interpretability?

To me, I think the biggest reason is that I want to know why the structures in models exist in the first place, not just that they exist. We want to be able to make predictions about which structures are stable, how the training distribution affects which structures we see, what is the formation order of those structures, and which points or events in training are most critical for the formation of them.

Studying developmental interpretability also lets us make connections with singular learning theory, and the local learning coefficient. It gives us a connection to the geometry of the loss landscape, which we have good ways of mathematically characterizing and describing.

Focusing on the development of models also allows me to ask more and (I think) quite interesting questions that mechanistic interpretability doesn't so much care about. We can abstract away, and ask questions about the dynamics of model evolution, which constrains 1) What algorithm is our model mechanistically implementing, and 2) What functional forms or measurable quantities should our theory of model development build up to or try to predict?

Values

Of course, I want to ultimately say something of relevance to AI alignment, and the most direct way of doing this is to talk about values.

Whether you plan on ensuring your AIs always follow instructions, are in some sense corrigible, have at least some measure of pro-sociality, or are entirely value aligned, you are going to need to know what the values of your AI system are, how you can influence them, and how to ensure they're preserved (or changed in only beneficial ways) when you train them (either during pretraining, post-training, or continuously during deployment). Many of the arguments for why we should expect AI to go wrongly assume as a key component that we don't know how such training will affect the values of our agents.

Ok, but concretely what will you actually do?

Well, looking around the past literature which seems relevant, there seems to be a bunch of theories about why RL systems learn particular policies, and an awful lot of experiments on RL systems, but few who are trying to create theories to explain those experiments, or experiments to test those theories.

Some examples include the Causal Incentives Working Group on the theoretical side, and Jenner et al.'s Evidence of Learned Look-Ahead in a Chess-Playing Neural Network & Colognese & Jose’s High-level interpretability: detecting an AI's objectives, & Team Shard's Understanding and Controlling a Maze-Solving Policy Network on the experimental side^[1].

So the obvious place to come in here is to take those theories, and take those experimental results & methods, and connect the two.

So for example (taking the above papers as prototypical examples), to connect experimental results to theory, we could take Jenner et al.'s technique for detecting lookahead, extend Colognese & Jose's techniques for detecting objectives, decompose the "shards" of models (by looking for contexts and the relevant heuristics used or identifying the relevant activation vectors), or otherwise identify mechanisms of interest in RL models, quantify these, and track their progression over training.

After tracking this progression over training, we can then identify the features of the setup (environmental details & variables, reward structure) which affect this progression, track those details (if they vary) over time, determine functional forms for the relevant curves we end up with, study how the environmental variables affect those forms, and propose & test more ground-up hypotheses for how those forms could be produced by lower-level mechanisms.

And from the theoretical to experimental side of things, one question I'm pretty excited about is about how much singular learning theory (SLT), a theory of supervised learning, has to say about reinforcement learning, and in particular whether SLT's derived measure of algorithmic complexity---the "local learning coefficient"---can be adapted for reinforcement learning.

The algorithms used for estimating the local learning coefficient take in a model, and a dataset of labels & classifications.

In reinforcement learning, we have a model. So that aspect is fine. But we don't have a dataset of labels and classifications. We have environmental interactions instead. So if we want to use that same algorithm, we're going to need to synthesize a suitable dataset from those environmental interactions (or perhaps some other aspect of the environment).

One very particular idea in this space would be to take Zhongtian et al.'s Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition, and train the same models on the same tasks, but using PPO, identify the phase transitions if there are any (easy for this setup), see whether the usual way of measuring the local learning coefficient works in that circumstance, add some dataset drift, see how we need to modify the local learning coefficient calculation to detect the phase transitions in that circumstance, and essentially slowly add more of the difficult aspects of RL to that very simple environment until we get an estimation method which we can be reasonably experimentally confident has the same properties as the local learning coefficient in the supervised learning case.

Call to action

If any of this sounds exciting, there are two ways to help me out.

The first, and most relevant for LessWrong is collaboration, either short-term or long-term.

If you suspect you're good at running and proposing experiments in ML (and in particular RL) systems, interpretability, or just finding neat patterns in data, I probably want to work with you, and we should set up a meeting to talk.

Similarly, if you suspect you'd be good at the theoretical end of what I describe--mathematical modeling, or inferring generating mechanisms from higher level descriptive models, then I also probably want to work with you, and similarly we should set up a meeting to talk.

If you do want to talk, use this page to set up a meeting with me.

The second, and less relevant, way you can help is via funding. Anyone can donate to this project via the corresponding Manifund project page, which closes on July 30th. That project page also gives a more detailed & concrete description of the project. Every bit counts, and if I don't reach my minimum funding amount (about $20k), no funds will be deducted from your Manifund account, so you can repurpose that funding to other causes.

^{^}
Though the shard theory project is closer to a theory-experiment loop than the others here. They don't yet have math go go along with the intuitions they're presenting though.

I recently learned about developmental interpretability and got very excited about the prospect of being able to understand the development of structure that encodes things like piece evaluations, tactics, and search that chess engines must be doing under the hood, so I'm glad to find out I'm not the only one interested in this!

> Why study reinforcement learning/developmental interpretability?

The main reason I find this angle so compelling is that the learning process for games compresses what I see as "natural" reasoning to a form which we have some intuition for and can make reasonable hypotheses about. I think a lot about the limits of our cognitive senses, which I've heard Chris Olah talk about; naturally, a primary obstacle in interpretability research is that the actual computation done inside a model greatly exceeds our working memory capacity. Games provide one of the most information-dense interfaces between the internals of a model and our own minds; observing and steering the process of a network learning to play a game is the best way I can think to encode the thought process of a model into my own neural net (to the extent that it's possible).

To that end, I think it would be amazing if there was something like Neuronpedia for looking inside a chess engine as it learns. Something that let you view replays, had an analysis board where you could choose a weight snapshot and board state and play against it or have it play against itself, and that made pretty visualizations of various metrics. There's a large community of people interested in chess engines and I bet they would discover some cool things. If this sounds interesting to anyone else, please let me know!

By the way, if you want to donate to this but thought, like me, that you need to be an “accredited investor” to fund Manifund projects, that only applies to their impact certificate projects, not this one.

Whether you plan on ensuring your AIs always follow instructions, are in some sense corrigible, have at least some measure of pro-sociality, or are entirely value aligned, you are going to need to know what the values of your AI system are, how you can influence them, and how to ensure they're preserved (or changed in only beneficial ways) when you train them (either during pretraining, post-training, or continuously during deployment).

This seems like it requires solving a very non-trivial problem of operationalizing values the right way. Developmental interpretability seems like it's very far from being there, and as stated doesn't seem to be addressing that problem directly.

RLHF does not optimize for achieving goals

RLHF can be seen as optimizing for achieving goals in the world, not just in the sense in the next paragraph? You're training against a reward model that could be measuring performance on some real-world task.

The “data wall” means active-learning or self-training will get more important over time

Out of curiosity, are you lumping things like "get more data by having some kind of good curation mechanism for lots of AI outputs without necessarily doing self-play and that just works (like say, having one model curate outputs from another, or even having light human oversight on outputs)" under this as well? Not super relevant to the content, just curious whether you would count that under an RL banner and subject to similar dynamics, since that's my main guess for overcoming the data wall.

This seems like it requires solving a very non-trivial problem of operationalizing values the right way. Developmental interpretability seems like it's very far from being there, and as stated doesn't seem to be addressing that problem directly.

I think we can gain useful information about the development of values even without a full & complete understanding of what values are. For example by studying lookahead, selection criteria between different lookahead nodes, contextually activated heuristics / independently activating motivational heuristics, policy coherence, agents-and-devices (noting the criticisms) style utility-fitting, your own AI objective detecting (& derivatives thereof), and so on.

The solution to not knowing what you're measuring isn't to give up hope, its to measure lots of things!

Alternatively, of course, you could think harder about how to actually measure what you want to measure. I know this is your strategy when it comes to value detection. And I don't plan on doing zero of that. But I think there's useful work to be done without those insights, and would like my theories to be guided more by experiment (and vice versa).

RLHF can be seen as optimizing for achieving goals in the world, not just in the sense in the next paragraph? You're training against a reward model that could be measuring performance on some real-world task.

I mostly agree, though I don't think it changes too much. I still think the dominant effect here is on the process by which the LLM solves the task, and in my view there are many other considerations which have just as large an influence on general purpose goal solving, such as human biases, misconceptions, and conversation styles.

If you mean to say we will watch what happens as the LLM acts in the world, then reward or punish it based on how much we like what it does, then this seems a very slow reward signal to me, and in that circumstance I expect most human ratings to be offloaded to other AIs (self-play), or for there to be advances in RL methods before this happens. Currently my understanding is this is not how RLHF is done at the big labs, and instead they use MTurk interactions + expert data curation (+ also self-play via RLAIF/constitutional AI).

Out of curiosity, are you lumping things like "get more data by having some kind of good curation mechanism for lots of AI outputs without necessarily doing self-play and that just works (like say, having one model curate outputs from another, or even having light human oversight on outputs)" under this as well? Not super relevant to the content, just curious whether you would count that under an RL banner and subject to similar dynamics, since that's my main guess for overcoming the data wall.

This sounds like a generalization of decision transformers to me (i.e. condition on the best of the best outputs, then train on those), and I also include those as prototypical examples in my thinking, so yes.

Whether you plan on ensuring your AIs always follow instructions, are in some sense corrigible, have at least some measure of pro-sociality, or are entirely value aligned, you are going to need to know what the values of your AI system are, how you can influence them, and how to ensure they're preserved (or changed in only beneficial ways) when you train them (either during pretraining, post-training, or continuously during deployment).

RLHF does not optimize for achieving goals

The “data wall” means active-learning or self-training will get more important over time

This seems like it requires solving a very non-trivial problem of operationalizing values the right way. Developmental interpretability seems like it's very far from being there, and as stated doesn't seem to be addressing that problem directly.

The solution to not knowing what you're measuring isn't to give up hope, its to measure lots of things!

RLHF can be seen as optimizing for achieving goals in the world, not just in the sense in the next paragraph? You're training against a reward model that could be measuring performance on some real-world task.

Out of curiosity, are you lumping things like "get more data by having some kind of good curation mechanism for lots of AI outputs without necessarily doing self-play and that just works (like say, having one model curate outputs from another, or even having light human oversight on outputs)" under this as well? Not super relevant to the content, just curious whether you would count that under an RL banner and subject to similar dynamics, since that's my main guess for overcoming the data wall.

67

What and Why: Developmental Interpretability of Reinforcement Learning

67

Introduction

Reinforcement learning

Agentic AIs vs Tool AIs

Data walls

Values

Ok, but concretely what will you actually do?

Call to action

67

67