Team Shard Status Report

David Udell

Team Shard is a nebulous alignment research collective, on paper siloed under John Wentworth's SERI MATS program, but in reality extending its many tendrils far across the Berkeley alignment community. "Shard theory" -- a name spoken of in hushed, mildly confused tones at many an EA hangout. This is their story (this month).

Epistemic status: A very quick summary of Team Shard's current research, written up today. Careful summaries and actual results are forthcoming, so skip this unless you're specifically interested in a quick overview of what we're currently working on.

Introduction

This past month, Team Shard began its research into the relationship between the reinforcement schedules and learned values of RL agents. Our core MATS team is composed of yours truly, Michael Einhorn, and Quintin Pope. The greater Team Shard, however, is legion -- its true extent only dimly suggested by the author names on its LessWrong writeups.

Our current path to impact is to (1) distill and expound shard theory and preregister its experimental predictions, (2) run RL experiments testing shard theory's predictions about learned values, and (3) climb the interpretability tech tree, starting with finetuned-on-values-text large language models, to unlock more informative experiments. In the 95th percentile, best-case possible world, we learn a bunch about how to reliably induce chosen values in extant RL agents by modulating the agent's reinforcement schedule and are able to probe the structure of those induced values within the models with interpretability tools.

Distillations

If you don't understand shard theory's basic claims and/or its relevance to alignment, stay tuned! A major distillation is forthcoming.

Natural Shard Theory Experiments in Minecraft

Uniquely, Team Shard already has a completed (natural) experiment under its belt! However, this experiment has a couple of nasty confounds, and even without those it would only have yielded a single bit of evidence for or against shard theory. But to summarize: OpenAI's MineRL agent is able to, in the best case, craft a diamond pickaxe in Minecraft in 4 minutes (!). Usually, the instrumental steps the MineRL agent must pursue to craft the diamond pickaxe … lie on the most efficient path to crafting the diamond pickaxe. So we can't disentangle from the model's ordinary gameplay data whether the model terminally values the journey or the destination: is reinforcement the model's optimization target, or are its numerous in-distribution proxies among its terminal goals?

One Karolis Ramanauskas, thankfully, already did the hard work of finding out for us!

When you give the MineRL agent a full stack of diamonds at the get-go … it starts punching trees and crafting the basic Minecraft tools, rather than immediately crafting as many diamond pickaxes as possible. Theories that predict the journey rather than the destination rejoice!

Now, there's a confound here, because the model was trained via reward shaping -- it was rewarded some lesser amount for all the instrumental steps along the way to the diamond pickaxe. Also, the model is quite stupid, despite its best-case properties. Whatever's true about its terminal values, it may just be spazzing around and messing up. Given the significant difficulty of even running (yet alone further finetuning) the OpenAI MineRL model, along with the model's stupidity confound, we opted to conduct our RL experiments in a more tractable (but still appreciably complex) environment.

Learned Values in CoinRun

So we now have an RL agent playing CoinRun well!

We're going to take it off-distribution and see whether it terminally values (1) just the coins, (2) a small handful of in-distribution proxies for getting coins, or (3) all of its in-distribution proxies for coins! Shard theory mostly bets that (3) will be the case, and is very nearly falsified if (1) is the case.

Feedback on Observable Monologues

We have successfully replicated some of the core results of the ROME paper, on a GPT-style model! It looks like a language model finetuned on value-laden sentences stores facts about those sentences in the same place internally that it would store, e.g., facts about which city the Eiffel Tower is located in.

From here, we will proceed on to forcing the thing to observably monologue.^[1] Think of this as a continuation of the ROME interpretability result. That is, this part of Team Shard is betting on us climbing the interpretability tech tree and thereby unlocking important alignment experiments, including experiments that will reveal much about the shards active inside a model.

Conclusion

♫Look at me still talking when there's science to do
When I look out there
It makes me GLaD I'm not you
I've experiments to run
There is research to be done
On the people who are
Still alive.♫

^{^}
~~externalize its reasoning~~

If Shard theory is false, I would expect it to be false in the sense that as models get smarter, they stop pursuing proxies learned in early training as terminal goals and aim for different things instead. That not-smart models follow rough proxy heuristics for what to do seems like the normal ml expectation to me, rather than a new prediction of Shard Theory.

Are the models you use to play Minecraft or CoinRun smart enough to probe that difference? Are you sure that they are like mesa-optimisers that really want to get diamonds or make diamond pickaxes or grab coins, rather than like collections of "if a, do b" heuristics with relatively little planning capacity that will keep following their script even as situations change? Because in the later case, I don't think you'd be learning much about Shard theory by observing them.

I love how your intro has the flavour of

We are Hydra. We are legion.

p.s. Hail Team Shard

p.p.s I've read a bunch of so-called Shard Theory stuff and I'm still not sure how it differs from the concepts of optimization daemons/mesa optimization besides less exclusively emphasising the 'post-general' regime (for want of a better term).

The only difference is that we're betting there'll be a lot of interesting, foreseeable structure in which mesa-optimizers are learned, conditional on a choice of key training parameters. We have some early conjectures as to what that structure will be, and the project is to build up an understanding and impressively win a bunch of Bayes points with it. Most alignment people don't especially think this is an important question to ask, because they don't think there'll end up being a lot of predictable structure in which proxies mesa-optimizers latch on to.

Shard theory is also a bet that proto-mesa-optimizers are also the mechanistic explanation of how current deep RL (and other deep ML settings, to a lesser extent) works.

We're going to take it off-distribution and see whether it terminally values (1) just the coins, (2) a small handful of in-distribution proxies for getting coins, or (3) all of its in-distribution proxies for coins

Is (2) here just referring to the type of stuff seen in Goal Misgeneralization in Deep Reinforcement Learning (CoinRun agent navigates to right-hand end of level instead of fetching the coin)?

Yes!

We … were somewhat schizophrenic when previously laying out our experiments roadmap, and failed sufficiently consider this existing result during early planning. What we would have done after replicating that result would have been much more of that stuff, trying to extract the qualitative relationships between learned values and varying training conditions.

We are currently switching to RL text adventures instead, though, because we expect to extract many more bits about these qualitative relationships from observing RL-tuned language models.

Cool! How do you tell if it is (2) or (3)?

When you take the agent off-distribution, offer it several proxies for in-distribution reinforcement. When you offer these such that going out of your way for one proxy detours you from going after a different proxy, and if you can modulate which proxy the agent detours for (by bringing some proxy much closer to the agent, say), you learned that the agent must care some about all those proxies it pursues at cost. If the agent hasn't come to value a proxy at all, then it will never take a detour to get to that proxy.

I am eager to see how the mentioned topics connect in the end -- this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.

On the interpretability side -- I'm curious how you do causal mediation analysis on anything resembling "values"? The ROME paper framework shows where the model recalls "properties of an object" in the computation graph, but it's a long way from that to editing out reward proxies from the model.