All of Sam Ringer's Comments + Replies

Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.

I think the reason we have been talking past each other is in my head when I think of "reward function" I am literally thinking of the reward function (i.e the actual code), and when I think of "reward maximiser" I think of a system that is trying to get that piece of code to output a high number.

So I guess it's a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?

3DragonGod5mo
It doesn't have to be a preference ordering. My point was that depending on the level of detail at which you consider the reward function slightly different functions could be identical. I don't think it makes sense to tie a reward function to a piece of code; a function can have multiple implementations. My contention is that if seems possible for the model's objective function to be identical (at the level of detail we care about) to the reward function. In that case, I think the model is indistinguishable from a reward maximiser and it doesn't make sense to say that it's not a reward maximiser.

This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter 'e's in output) and we also have a model which is doing some internal optimization to maximise the number of 'e's produced in its output, so it is "goal-directed to produce lots of 'e's".

Even in this case, I would still say this model isn't a "reward maximiser". It is a "letter 'e' maximiser".

(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is... (read more)

2DragonGod5mo
But "reward" is governed by the number of letter 'e's in output! If the objective function that the model optimises for, and the reward function are identical[1], then saying the model is not a reward maximiser seems to me like a distinction without a difference. Ā  (Epistemic status: you are way more technically informed than this on me, I'm just trying to follow your reasoning.) 1. ^ Modulo transformations that preserve the properties we're interested in (e.g. consider a utility function as solely representing a preference ordering the preference ordering is left unchanged by linear transformations [just scaling up all utilities uniformly or adding the same constant to all utilities preserves the ordering]).

Thanks for the feedback!


From reading your linked comment, I think we agree about selection arguments. In the post, when I mention "selection pressure towards a model", I generally mean "such a model would score highly on the reward metric" as opposed to "SGD is likely to reach such a model". I believe the former is correct and the later is very much an open-question.

To second what I think is your general point, a lot of the language used around selection can be confusing because it confounds "such a solution would do well under some metric" with "your opti... (read more)

I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.

However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn't change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never "sees" the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I'm sure there are schemes where this is not the case, but I thi... (read more)

5KanHar5mo
Considering the fact that weight sharing between actor and critic networks is a common practice, and given the fact that the critic passes gradients (learned from the reward) to the actor, for most practical purposes the actor gets all of the information it needs about the reward. This is the case for many common architectures.
2cfoster05mo
Yeah. For non-vanilla PG methods, I didn't mean to imply that the policy "sees" the rewards in step 1. I meant that a part of the agent (its value function) "sees" the rewards in the sense that those are direct supervision signals used to train it in step 2, where we're determining the direction and strength of the policy update. And yeah the model-based case is weirder. I can't recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I'd think it's fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) "wants" reward.

Instrumental convergence only comes into play when there are free variables in action space which are optimized with respect to their consequences.


I roughly get what this is gesturing at, but I'm still a bit confused. Does anyone have any literature/posts they can point me at which may help explain?

Also great post janus! It has really updated my thinking about alignment.

1NicholasKees5mo
To me this statement seems mostly tautological. Something is instrumental if it is helpful in bringing about some kind of outcome. The term "instrumental" is always (as far as I can tell) in reference to some sort of consequence based optimization.Ā 

Yeh so thinking a little more I'm not sure my original comment conveyed everything I was hoping to. I'll add that even if you could get a side of A4 explaining AI x-risk in front of a capabilities researcher at <big_capabilities_lab>, I think they would be much more likely to engage with it if <big_capabilities_lab> is mentioned.

I think arguments will probably be more salient if they include "and you personally, intentionally or not, are entangled with this."

Saying that, I don't have any data about the above. I'm keen to hear any personal experiences anyone else might have in this area.

Ok not sure I understand this. Are you saying "Big corps are both powerful and complicated. Trying to model their response is intractably difficult so under that uncertainty you are better to just steer clear?"

2trevor1y
Yes, that's a very good way of putting it. I will be more careful to think about inferential distance [https://www.lesswrong.com/posts/HLqWn5LASfhhArZ7w/expecting-short-inferential-distances] from now on.

I think it's good that someone is bringing this up. I think as a community we want to be deliberate and thoughtful with this class of things.

That being said, my read is that the main failure mode with advocacy at the moment isn't "capabilities researchers are having emotional responses to being called out which is making it hard for them to engage seriously with x-risk."
It's "they literally have no idea that anyone thinks what they are doing is bad."

Consider FAIR trying their hardest to open-source capabilities work with OPT. The tone and content of the re... (read more)

2[comment deleted]1y
2trevor1y
I've done quite a bit of thinking about this, and I'm pretty familiar with the area. If a corporation has a brand, and you have no idea how powerful, aggressive, or exploitative that corporation is (e.g. Facebook, Disney, etc), then it's best not to write anything that calls out that brand. If you go on Reddit and write something publically about how awful Dr. Pepper is, then you're entangling yourself into the ongoing conflict between Coca Cola and Pepsi, whether you know about it or not. And if you don't know what you're getting into, or even aren't sure, then you certainly aren't prepared to model the potential consequences.

I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!

However, I agree with Jacob's criticism here. Any AGI success story basically has to have "the safest model" also be "the most powerful" model, because of incentives and coordination problems.

Models that are themselves optimizers are going to be significantly more powerful and useful than "optimizer free" models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interestin... (read more)

This post seems to be using a different meaning of "consequentialism" to what I am familiar with (that of moral philosophy). Subsequently, I'm struggling to follow the narrative from "consequentialism is convergently instrumental" onwards.

Can someone give me some pointers of how I should be interpreting the definition of consequentialism here? If it is just the moral philosophy definition, then I'm getting very confused as to why "judge morality of actions by their consequences" is a useful subgoal for agents to optimize against...