# All of peligrietzer's Comments + Replies

I describe the more formal definition in the post:

'Actions (or more generally 'computations') get an x-ness rating. We define the x shard's expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifi...

-1M. Y. Zuo1mo
A formal definition means one based on logical axioms, mathematical axioms, universal constants (e.g. speed of light), observed metrics (e.g. the length of a day), etc.  Writing more elaborate sentences can't resolve the problem of circularity or infinite regress. You might be confusing it with the legal or societal/cultural/political/literary sense.  This seems to be entirely your invention? I can't find any google results with a similar match.

Yep! Or rather arguing that from a broadly RL-y + broadly Darwinian point of view 'self-consistent ethics' are likely to be natural enough that we can instill them, sticky enough to self-maintain, and capabilities-friendly enough to be practical and/or survive capabilities-optimization pressures in training.

This brings up something interesting: seems worthwhile to compare the internals of a 'misgeneralizing,' small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.

I'd maybe point the finger more at the simplicity of the training task than at the size of the network? I'm not sure there's strong reason to believe the network is underparameterized for the training task. But I agree that drawing lessons from small-ish networks trained on simple tasks requires caution.

1Max H1mo
I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder. (Caveat: I'm not super familiar with the literature on goal mis-generalization and Langosco et al. [https://arxiv.org/pdf/2105.14111.pdf]; what follows is just based on my reading of this post and the previous ones in the sequence.) From the previous post: Concretely, I'm predicting that, if there were training examples where the cheese was located in, say, the bottom-right corner, you probably wouldn't end up with an agent that sometimes goes to the top-right, sometimes to the bottom-right, and sometimes to the cheese, or even an agent that learns a "going right" shard (as a combination of the top-right and bottom-right shards), and a cheese-finding shard. The agent would just always, or nearly always, find the cheese in the test environment. Or, if you want to make sure the training -> test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right. Put differently, the behaviors in Statistically informed impressions [https://www.lesswrong.com/s/sCGfFb5DPfjEmtEdn/p/eowhY5NaCaqY6Pkj9#Statistically_informed_impressions_] seem relevant only in the regime where P(cheese acquired) is not close to 1.  That seems like a pretty fragile / narrow / artificial condition, at least for maze-solving. I'm looking forward to seeing more follow up work on this though. I do think there are a lot of interesting interpretability questions this kind of experimentation can answer. What happens if you subtract the cheese vector from a perfect cheese finder, for example?

I would again suggest a 'perceptual' hypothesis regarding the subtraction/addition asymmetry.  We're adding a representation of a path where there was no representation of a path (creates illusion of path), or removing a representation of a path where there was no representation of a path (does nothing).

I guess that I'm imagining that the {presence of a representation of a path}, to the extent that it's represented in the model at all, is used primarily to compute some sort of "top-right affinity" heuristic. So even if it is true that, when there's no representation of a path, subtracting the {representation of a path}-vector should do nothing, I think that subtracting the "top-right affinity" vector that's downstream of this path representation should still do something regardless of whether there is or isn't currently a path representation.

So I gu...

No but I hope to have a chance to try something like it this year!

The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.

Having a go at extracting some mechanistic claims from this post:

• A value x is a policy-circuit, and this policy circuit may sometimes respond to a situation by constructing a plan-grader and a plan-search.
• The policy-circuit executing value x is trained to construct <plan-grader, plan-search> pairs that are 'good' according to the value x, and this excludes pairs that are predictably going to result in the plan-search Goodharting the plan-grader.
• Normally, nothing is trying to argmax value x's goodness criterion for <plan-grader, plan-search> pai
...

Here is a shard-theory intuition about humans, followed by an idea for an ML experiment that could proof-of-concept its application to RL:

Let's say I'm a guy who cares a lot about studying math well, studies math every evening, and doesn't know much about drugs and their effects. Somebody hands me some ketamine and recommends that I take ketamine this evening. I take the ketamine before I sit down to study math, and math study goes terrible intellectually but since I am on ketamine I'm having a good time and credit gets assigned to the 'taking ketami...

2redbird2mo
Did you ever try this experiment? I'm really curious how it turned out!
2cfoster06mo
Nice! I think the general lesson here might be that when an agent has predictive representations (like those from a model, or those from a value function, or successor representations) the updates from those predictions can "outpace" the updates from the base credit assignment algorithm, by changing stuff upstream of the contexts that that credit assignment acts on.