What should we think about shard theory in light of chain-of-thought agents?

Chris_Leong

LawrenceC proposed that the nine main theses of shard theory are as follows:

Agents are well modeled as being made of shards---contextually activated decision influences.
Shards generally care about concepts inside the agent's world model, as opposed to pure sensory experiences or maximizing reward.
Active shards bid for plans in a way shaped by reinforcement learning.
The optimization target is poorly modeled by the reward function.
Agentic shards will seize power.
Value formation is very path dependent and relatively architecture independent.
We can reliably shape an agent's final values by changing the reward schedule.
"Goal misgeneralization" is not a problem for AI alignment.
Shard theory is a good model of human value formation.

Back in the day, discussion of this was quite abstract, almost by necessity, however now chain-of-thought provides us with much more (albeit imperfect) insight into how models reason.

It feels like the time is ripe to re-evaluate this theory. Does anyone have any takes how this pans out?

Shard theory is dangerously incomplete as a model of humans and LLM-based AGI. It should be a useful lens but the conclusions drawn from it about alignment are largely wrong.

Shard theory describes fairly well how humans usually work and how current LLMs make decisions. It's a better theory of animals than adult human motivations/behavior. This is bad news for how it generalizes from current LLMs to future LLM systems.

The most important conclusions for alignment IMO are that LLMs aren't maximizers, and that they don't treat reward as the optimization target. But it doesn't rule out:

Future LLM systems becoming maximizers
Future LLM AGI treating reward as the optimization target

Beyond not being ruled out, these appear to be the default outcomes of increased intelligence and reflection.

A little more of the logic:

It does not rule out the possibility of such a system becoming more coherent and becoming effectively a maximizer for some goal or collection of goals that won a complex competition.

It does not rule out learning to treat reward as the optimization target. Humans and RL systems don't usually do that, but that's in part a product of our limited intelligence. Some humans sometimes DO seem to treat reward as the optimization target (let's do fun/exciting things!). And we might expect smarter networks to form representations of reward as an abstract (and correct) concept, rather than just representing the proximal inputs correlated with reward.

This concern that becoming smarter breaks the assumptions of shard theory makes it much less useful as a theory for the purpose of aligning future AGI.

But it is useful as a theory of how current networks behave and how future networks will behave until they hit those phase shifts.

(That should technically be "unless" they shift, but I strongly expect these shifts at some point; delaying them is one potential strategy).

Shard Theory in Nine Theses is the LawrenceC post you refer to. It is fairly skeptical of shard theory.

Disentangling Shard Theory into Atomic Claims is another excellent breakdown of the claims.

Shard theory isn't wrong but it's incomplete in some really important ways. There's a weird contradiction: it should be useful, but the ways it's been deployed so far have probably caused more mistakes than progress.

So my feelings about it are mixed.

This concern that becoming smarter breaks the assumptions of shard theory makes it much less useful as a theory for the purpose of aligning future AGI

I've made the criticism myself that I didn't believe that the shard theory model would hold up for long because a more agentic shard (or a set of them) would end up eventually seizing control. Then again, Lawrence writes that "agentic shards will seize power" is one of the assumptions of the theory. So maybe this isn't actually a criticism of shard theory? This is a point I'm still somewhat confused on - is shard theory just meant to be an intermediate theory or does it still hold even after the more agentic shards seize power?

I am going back through some of the old shard theory articles. Hopefully that provides me with some more clarity.

Some more work in this vein, and connecting it to personas seems like a clear next direction for chain-of-thought.