This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter 'e's in output) and we also have a model which is doing some internal optimization to maximise the number of 'e's produced in its output, so it is "goal-directed to produce lots of 'e's".
Even in this case, I would still say this model isn't a "reward maximiser". It is a "letter 'e' maximiser".
(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is...
Thanks for the feedback!
From reading your linked comment, I think we agree about selection arguments. In the post, when I mention "selection pressure towards a model", I generally mean "such a model would score highly on the reward metric" as opposed to "SGD is likely to reach such a model". I believe the former is correct and the later is very much an open-question.
To second what I think is your general point, a lot of the language used around selection can be confusing because it confounds "such a solution would do well under some metric" with "your opti...
I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.
However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn't change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never "sees" the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I'm sure there are schemes where this is not the case, but I thi...
Instrumental convergence only comes into play when there are free variables in action space which are optimized with respect to their consequences.
I roughly get what this is gesturing at, but I'm still a bit confused. Does anyone have any literature/posts they can point me at which may help explain?
Also great post janus! It has really updated my thinking about alignment.
Yeh so thinking a little more I'm not sure my original comment conveyed everything I was hoping to. I'll add that even if you could get a side of A4 explaining AI x-risk in front of a capabilities researcher at <big_capabilities_lab>, I think they would be much more likely to engage with it if <big_capabilities_lab> is mentioned.
I think arguments will probably be more salient if they include "and you personally, intentionally or not, are entangled with this."
Saying that, I don't have any data about the above. I'm keen to hear any personal experiences anyone else might have in this area.
Ok not sure I understand this. Are you saying "Big corps are both powerful and complicated. Trying to model their response is intractably difficult so under that uncertainty you are better to just steer clear?"
I think it's good that someone is bringing this up. I think as a community we want to be deliberate and thoughtful with this class of things.
That being said, my read is that the main failure mode with advocacy at the moment isn't "capabilities researchers are having emotional responses to being called out which is making it hard for them to engage seriously with x-risk."
It's "they literally have no idea that anyone thinks what they are doing is bad."
Consider FAIR trying their hardest to open-source capabilities work with OPT. The tone and content of the re...
I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!
However, I agree with Jacob's criticism here. Any AGI success story basically has to have "the safest model" also be "the most powerful" model, because of incentives and coordination problems.
Models that are themselves optimizers are going to be significantly more powerful and useful than "optimizer free" models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interestin...
This post seems to be using a different meaning of "consequentialism" to what I am familiar with (that of moral philosophy). Subsequently, I'm struggling to follow the narrative from "consequentialism is convergently instrumental" onwards.
Can someone give me some pointers of how I should be interpreting the definition of consequentialism here? If it is just the moral philosophy definition, then I'm getting very confused as to why "judge morality of actions by their consequences" is a useful subgoal for agents to optimize against...
Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.
I think the reason we have been talking past each other is in my head when I think of "reward function" I am literally thinking of the reward function (i.e the actual code), and when I think of "reward maximiser" I think of a system that is trying to get that piece of code to output a high number.
So I guess it's a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?