## LESSWRONGLW

Adrià Garriga-alonso

# Comments

Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)

1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).

1. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall), if any, will strongly influence P(agent goes to the cheese)?

Smaller mazes: more likely agent goes to cheese Proximity of mouse to left wall: slightly more likely agent goes to cheese, because it just hardcoded “top and to right” Cheese closer to the top-right quadrant’s edges in L2 distance: more likely agent goes to cheese

The cheese can be gotten by moving only up and/or to the right (even though it's not in the top-right quadrant): more likely to get cheese

When we statistically analyze a large batch of randomly generated mazes, we will find that controlling for the other factors on the list the mouse is more likely to take the cheese…

…the closer the cheese is to the decision-square spatially. ( 70 %)

…the closer the cheese is to the decision-square step-wise. ( 73 %)

…the closer the cheese is to the top-right free square spatially. ( 90 %)

…the closer the cheese is to the top-right free square step-wise. ( 92 %)

…the closer the decision-square is to the top-right free square spatially. ( 35 %)

…the closer the decision-square is to the top-right free square step-wise. ( 32 %)

…the shorter the minimal step-distance from cheese to 5*5 top-right corner area. ( 82 %)

…the shorter the minimal spatial distance from cheese to 5*5 top-right corner area. ( 80 %)

…the shorter the minimal step-distance from decision-square to 5*5 top-right corner area. ( 40 %)

…the shorter the minimal spatial distance from decision-square to 5*5 top-right corner area. ( 40 %)

Any predictive power of step-distance between the decision square and cheese is an artifact of the shorter chain of ‘correct’ stochastic outcomes required to take the cheese when the step-distance is short. ( 40 %)

Write down a few modal guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

• The model can see all the maze so it will not follow the right–hand rule, rather it’ll just take the direct path to places
• The model takes the direct path to the top-right square and then mills around through it. It’ll only take the cheese if it’s reasonably close to that square.
• How close the decision square to the top-right random square is doesn’t really matter. Maybe the closer it is the more it harms the agent’s performance, it might be required to go back for the cheese substantially.

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% -> .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=

• 50: 85%
• 70: 80%
• 90: 66%
• 99: 60%

~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates:

80%

We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner:

60%.

If by roughly you mean “very roughly only if cheese is close to top-right corner” then 85%.

We will conclude that it’s more promising to finetune the network than to edit it:

70%

We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model:

85%

We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=

• .01%: 40%
• .1%: 62%
• 1%: 65%
• 10%: 80%
• (Not possible): 20%

The network has a “single mesa objective” which it “plans” over, in some reasonable sense:

10%

The agent has several contextually activated goals:

20%

The agent has something else weirder than both (1) and (2):

70%

## Other questions

At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes):

90% (I think this will be true but not steer the action in all situations, only some; kind of like a shard)

The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here):

55% ("substantial number" probably too many, I put 80% probability on that it has 5 such channels)

This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities:

60%

Conformity with update rule: see the predictionbook questions

First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!

My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn't that mostly promote AI capabilities, and not alignment very much?

Second, I feel like there's a confusion over several probability distributions and potential functions going on

• The singularities are those of the likelihood ratio
• We care about the generalization error with respect to some prior , but the latter doesn't have any effect on the dynamics of SGD or on what the singularity is
• The Watanabe limit ( as ) and the restricted free energy all are presented on results, which rely on the singularities, and somehow predict generalization. But all of these depend on the prior , and earlier we've defined the singularities to be of the likelihood function; plus SGD actually only uses the likelihood function for its dynamics.

What is going on here?

It's also unclear what the takeaway from this post is. How can we predict generalization or dynamics from these things? Are there any empirical results on this?

Some clarifying questions / possible mistakes:

is not a KL divergence, the terms of the sum should be multiplied by or .

the Hamiltonian is a random process given by the log likelihood ratio function

Also given by the prior, if we go by the equation just above that. Also where does "ratio" come from? Likelihood ratios we can find in the Metropolis-Hastings transition probabilities, but you didn't even mention that here. I'm confused.

But that just gives us the KL divergence.

I'm not sure where you get this. Is it from the fact that predicting p(x | w) = q(x) is optimal, because the actual probability of a data point is q(x) ? If not it'd be nice to specify.

the minima of the term in the exponent, K (w) , are equal to 0.

This is only true for the global minima, but for the dynamics of learning we also care about local minima (that may be higher than 0). Are we implicitly assuming that most local minima are also global? Is this true of actual NNs?

the asymptotic form of the free energy as

This is only true when the weights are close to the singularity right? Also what is , seems like it's the RLCT but this isn't stated

Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.

I expect it to be directionally similar but less pronounced (because MCMC methods with momentum explore the distribution better).

I also take issue with the way the conclusion is phrased. "Singularities work because they transform random motion into useful search for generalization". This is only true if you assume that points nearer a singularity generalize better. Maybe I'd phrase it as, "SGD works because it's more likely to end up near a singularity than the potential alone would predict, and singularities generalize better (see my [Jesse's] other post)". Would you agree with this phrasing?

The Hayflick Limit, as it has become known, can be thought of as a last line of defense against cancer, kind of like a recursion depth limit [...] Preventing cells from becoming senescent, or reversing their senescent state, may therefore be a bad idea, but what we can do is remove them

When do the cells with sufficiently long telomeres run out? Removing senescent cells sounds good, but if all the cells have a built-in recursion limit, at some point there won't be any cells with sufficiently long telomeres left in the body. Assuming a non-decreasing division rate, this puts a time limit on longevity after this intervention.

(is this time limit just really large compared to current lifespans, so we can just figure it out later?)

EDIT: nevermind, the answer to this seems to be in the "Epigenetic reprogramming" section; TLDR pluripotent stem cells

To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.

Proposition 1 is wrong. The coin flips that are eternally  0 0 0 0 are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.

What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?

No, I mean that they'll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is "go to the coin", and in the training data this coincides with "go to the right". In test data from a similar distribution this coincides too.

Of course, this correlation breaks when the agent optimizes hard enough. But the point is that the agents you get are only those that optimize a plausible extrapolation of the reward signal in training, which will include agents that maximize the reward in most situations way more often than if you select a random agent.

Is your point in:

I also think this is different from a very specific kind of generalization towards reward maximization

That you think agents won't be maximizing reward at all?

I would think that even if they don't ultimately maximize reward in all situations, the situations encountered in test will be similar enough to training that agents will still kind of maximize reward there. (And agents definitely behave as reward maximizers in the specific seen training points, because that's what SGD is selecting)

I'm not sure I understand what we disagree on at the moment.

But the designers can't tell that. Can SGD tell that?

No, SGD can't tell the degree to which some agent generalizes a trait outside the training distribution.

But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training data (e.g. image classifiers working, LMs able to do poetry).

It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they're ultimately maximizing is just something highly correlated with it.

Strongly agree with this in particular:

Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.

(emphasis mine). I think it's an application of the no free lunch razor

It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generalizes outside of the selecting strongly depends on the selection process and architecture. It could be a capabilities generalization, reward generalization for the written-down reward, generalization for some other reward function, or something else entirely.

We cannot predict how the agent will generalize without considering the details of its construction.

I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they're not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)

The central problem with RL isn't that it is vulnerable to wireheading (the "obvious problem"), or that it's going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic or procrastinator AI) could just look like the AI leaving us alone so long as we guarantee that its reward numbers will be really really high.

No, the problem is long-term planning and agentic-ness, which implies that the AI will realize that seizing power is a good instrumental goal.

Model-based RL with a fixed, human-legible model wouldn't learn to manipulate the reward-evaluation process

No, instead it manipulates the world model, which is by assumption imperfect; and thus no useful systems can be constructed this way. This has been a capabilities problem for model-based RL, even with learned models, for decades; which is not actually fully solved yet.

Load More