I didn’t mean it as a criticism, more as the way I understand it. Misalignment is a “definite” reason for pessimism - and therefore somewhat doubtful about whether it will actually play out. Gradual disempowerment is less definite about what actual form problems may take, but also a more robust reason to think there is a risk.
That’s a good explanation of the distinction
I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.
This seems different to “maximising rewards for the wrong reasons”. That view generally sees the reward maximised because it is instrumental for or aliased with the wrong goal. Here it’s just a separate behaviour that is totally unhelpful for maximising rewards but is learned as a reflex anyway.
Given the apparent novelty of this interpretation, it doesn't actually obviate your broader thesis.
Wait, wherefore is probably better translated as "for what reason" than "why". But this makes it much more sensible! Romeo Romeo, what makes you Romeo? Not your damn last name, that's for sure!
I see the gradual disempowerment story as a simple outside view flavoured reason why things could go badly for many people. I think it’s outside view flavoured because it’s a somewhat direct answer to “well things seems to have been getting better for people so far”. While, as you point out, misalignment seems to make the prospects much worse, it’s worth bearing in mind also that economic irrelevance of people also strongly supports the case for bad outcomes from misalignment. If people remained economically indispensable, even fairly serious misalignment could have non catastrophic outcomes.
Someone I was explaining it to described it as “indefinite pessimism”.
Sorry but this is nonsense. JBlack's comment shows the argument works fine even if you take a lot of trouble to construct P(count|Y) to give a better answer.
But this isn't even particularly important, because for your objection to stand, it must be impossible to find any situation where P(A|Y) would give you a silly answer, which is completely false.
You’ve just substituted a different proposition and then claimed that the implication doesn’t hold because it doesn’t hold for your alternative proposition. “We’re counting kids” absolutely implies “the count can be represented by a nonnegative int32”. If I want to show that an argument is unsound I am allowed to choose the propositions that demonstrate it’s unsoundness.
To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.