Sam Ringer

Wiki Contributions


Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.

I think the reason we have been talking past each other is in my head when I think of "reward function" I am literally thinking of the reward function (i.e the actual code), and when I think of "reward maximiser" I think of a system that is trying to get that piece of code to output a high number.

So I guess it's a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?

This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter 'e's in output) and we also have a model which is doing some internal optimization to maximise the number of 'e's produced in its output, so it is "goal-directed to produce lots of 'e's".

Even in this case, I would still say this model isn't a "reward maximiser". It is a "letter 'e' maximiser".

(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is the misunderstanding I highlight in the post is quite pervasive, and the language we use isn't current up-to-scratch to write about these things clearly. Good job thinking of a case that is pushing against my understanding!)

Thanks for the feedback!

From reading your linked comment, I think we agree about selection arguments. In the post, when I mention "selection pressure towards a model", I generally mean "such a model would score highly on the reward metric" as opposed to "SGD is likely to reach such a model". I believe the former is correct and the later is very much an open-question.

To second what I think is your general point, a lot of the language used around selection can be confusing because it confounds "such a solution would do well under some metric" with "your optimization process is likely to produce such a solution". The wolves with snipers example illustrates this pretty clearly. I'm definitely open to ideas for better language to distinguish the two cases!

I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.

However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn't change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never "sees" the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I'm sure there are schemes where this is not the case, but I think I'm correct for PPO.)

I'm less sure of this, but even in model-based systems, Q-learning etc, the planning/iteration happens with respect to the outputs of a value network, which is trained to be correlated with the reward, but isn't the reward itself. For example, I would say that the MCTS procedure of MuZero does "want" something, but that thing is not plans that get high reward, but plans that score highly according to the system's value function. (I'm happy to be deferential on this though.)

The other interesting case is Decision Transformers. DTs absolutely "get" reward. It is explicitly an input to the model! But I mentally bucket them up as generative models as opposed to systems that "want" reward.

Instrumental convergence only comes into play when there are free variables in action space which are optimized with respect to their consequences.

I roughly get what this is gesturing at, but I'm still a bit confused. Does anyone have any literature/posts they can point me at which may help explain?

Also great post janus! It has really updated my thinking about alignment.

I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!

However, I agree with Jacob's criticism here. Any AGI success story basically has to have "the safest model" also be "the most powerful" model, because of incentives and coordination problems.

Models that are themselves optimizers are going to be significantly more powerful and useful than "optimizer free" models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just "not building agents" (

So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.

This post seems to be using a different meaning of "consequentialism" to what I am familiar with (that of moral philosophy). Subsequently, I'm struggling to follow the narrative from "consequentialism is convergently instrumental" onwards.

Can someone give me some pointers of how I should be interpreting the definition of consequentialism here? If it is just the moral philosophy definition, then I'm getting very confused as to why "judge morality of actions by their consequences" is a useful subgoal for agents to optimize against...

Load More