Here is a shard-theory intuition about humans, followed by an idea for an ML experiment that could proof-of-concept its application to RL:

Let's say I'm a guy who cares a lot about studying math well, studies math every evening, and doesn't know much about drugs and their effects. Somebody hands me some ketamine and recommends that I take ketamine this evening. I take the ketamine before I sit down to study math, and math study goes terrible intellectually but since I am on ketamine I'm having a good time and credit gets assigned to the 'taking ketamine before I sit down to study math' computation. So my policy network gets updated to increase the probability of the computation 'take ketamine before I sit down to study math.'

HOWEVER my world-model also gets updated, acquiring the new knowledge 'taking ketamine before I sit down to study math makes math-study go terrible intellectually.' And if I have a strong enough 'math study' value shard then in light of this new knowledge the 'math study' value shard is going to forbid taking ketamine before I sit down to study math. So my 'take ketamine before sitting down to study math' exploration resulted in me developing an overall disposition against taking ketamine before sitting down to study math, even though the computation 'take ketamine before sitting down to study math' was directly reinforced! (Because same act of exploration also resulted in a world-model update that associated the computation 'take ketamine before sitting down to study math' with implications that an already-powerful shard opposes.)

This is important, I think, because it shows that an agent can explore relatively freely without being super vulnerable to value-drift, and that you don't necessarily need complicated reflective reasoning to have (at least very basic) anti-value-drift mechanisms. Since reinforcement is a pretty gradual thing, you can often try an action you don't know much about, and if it turns out that this action has high reward but also direct implications that your already existing powerful shards oppose then the weak shard formed by that single reinforcement pass will be powerless.

Now the ML experiment idea:

A game where the agent gets rewarded for (e.g.) jumping high. After the agent gets somewhat trained, we continue training but introduce various 'powerups' the agent can pick up that increase or decrease the agent's jumping capacity. We train a little more, and now we introduce (e.g.) green potions that decrease the agent's jumping capacity but increase the reward multiplier (positive for expected reward on the balance).

My weak hypothesis is that even though trying green potions gets a reinforcement event, the agent will avoid green potions after trying them. This is because there'd be a strong 'avoid things that decrease jumping capacity' shard already in place that will take charge once the agent learns to associate taking green potions with decrease in jumping capacity. (Though maybe it's more complicated: maybe there will be a kind of race between 'taking green potions' getting reinforced and the association between taking green potions and decrease in jumping capacity forming and activating the 'avoid things that decrease jumping capacity' shard.)

Another interesting question: what will happen if we introduce (e.g.) red potions that increase the agent's jumping capacity but decrease the reward multiplier (negative for expected reward on the balance)? Seems clear that as the agent takes red potions over and over the reinforcement process will eventually remove the disposition to take red potions, but would this also start to push the agent towards forming some kind of mental representation of 'reward' to model what's going on? If we introduce red potions first, then do some training, and then introduce green potions, would the experience with red potions make the agent respond differently (perhaps more like a reward maximiser) to trying green potions?

Nice! I think the general lesson here might be that when an agent has predictive representations (like those from a model, or those from a value function, or successor representations) the updates from those predictions can "outpace" the updates from the base credit assignment algorithm, by changing stuff upstream of the contexts that that credit assignment acts on.

Here is a shard-theory intuition about humans, followed by an idea for an ML experiment that could proof-of-concept its application to RL:

Let's say I'm a guy who cares a lot about studying math well, studies math every evening, and doesn't know much about drugs and their effects. Somebody hands me some ketamine and recommends that I take ketamine this evening. I take the ketamine before I sit down to study math, and math study goes terrible intellectually but since I am on ketamine I'm having a good time and credit gets assigned to the 'taking ketamine before I sit down to study math' computation. So my policy network gets updated to increase the probability of the computation 'take ketamine before I sit down to study math.'

HOWEVER my world-model also gets updated, acquiring the new knowledge 'taking ketamine before I sit down to study math makes math-study go terrible intellectually.' And if I have a strong enough 'math study' value shard then in light of this new knowledge the 'math study' value shard is going to forbid taking ketamine before I sit down to study math. So my 'take ketamine before sitting down to study math' exploration resulted in me developing an overall disposition

againsttaking ketamine before sitting down to study math, even though the computation 'take ketamine before sitting down to study math' was directlyreinforced! (Because same act of exploration also resulted in a world-model update that associated the computation 'take ketamine before sitting down to study math' with implications that an already-powerful shard opposes.)This is important, I think, because it shows that an agent can explore relatively freely without being super vulnerable to value-drift, and that you don't necessarily need complicated reflective reasoning to have (at least very basic) anti-value-drift mechanisms. Since reinforcement is a pretty gradual thing, you can often try an action you don't know much about, and if it turns out that this action has high reward but also direct implications that your already existing powerful shards oppose then the weak shard formed by that single reinforcement pass will be powerless.

Now the ML experiment idea:

A game where the agent gets rewarded for (e.g.) jumping high. After the agent gets somewhat trained, we continue training but introduce various 'powerups' the agent can pick up that increase or decrease the agent's jumping capacity. We train a little more, and now we introduce (e.g.) green potions that decrease the agent's jumping capacity but increase the reward multiplier (positive for expected reward on the balance).

My weak hypothesis is that even though trying green potions gets a reinforcement event, the agent will avoid green potions after trying them. This is because there'd be a strong 'avoid things that decrease jumping capacity' shard already in place that will take charge once the agent learns to associate taking green potions with decrease in jumping capacity. (Though maybe it's more complicated: maybe there will be a kind of race between 'taking green potions' getting reinforced and the association between taking green potions and decrease in jumping capacity forming and activating the 'avoid things that decrease jumping capacity' shard.)

Another interesting question: what will happen if we introduce (e.g.) red potions that increase the agent's jumping capacity but decrease the reward multiplier (negative for expected reward on the balance)? Seems clear that as the agent takes red potions over and over the reinforcement process will eventually remove the disposition to take red potions, but would this also start to push the agent towards forming some kind of mental representation of 'reward' to model what's going on? If we introduce red potions first, then do some training, and then introduce green potions, would the experience with red potions make the agent respond differently (perhaps more like a reward maximiser) to trying green potions?

Did you ever try this experiment? I'm really curious how it turned out!

No but I hope to have a chance to try something like it this year!

Nice! I think the general lesson here might be that when an agent has predictive representations (like those from a model, or those from a value function, or successor representations) the updates from those predictions can "outpace" the updates from the base credit assignment algorithm, by changing stuff upstream of the contexts that that credit assignment acts on.