To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.
I didn’t mean it as a criticism, more as the way I understand it. Misalignment is a “definite” reason for pessimism - and therefore somewhat doubtful about whether it will actually play out. Gradual disempowerment is less definite about what actual form problems may take, but also a more robust reason to think there is a risk.
That’s a good explanation of the distinction
I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.
This seems different to “maximising rewards for the wrong reasons”. That view generally sees the reward maximised because it is instrumental for or aliased with the wrong goal. Here it’s just a separate behaviour that is totally unhelpful for maximising rewards but is learned as a reflex anyway.
Given the apparent novelty of this interpretation, it doesn't actually obviate your broader thesis.
Wait, wherefore is probably better translated as "for what reason" than "why". But this makes it much more sensible! Romeo Romeo, what makes you Romeo? Not your damn last name, that's for sure!
I see the gradual disempowerment story as a simple outside view flavoured reason why things could go badly for many people. I think it’s outside view flavoured because it’s a somewhat direct answer to “well things seems to have been getting better for people so far”. While, as you point out, misalignment seems to make the prospects much worse, it’s worth bearing in mind also that economic irrelevance of people also strongly supports the case for bad outcomes from misalignment. If people remained economically indispensable, even fairly serious misalignment could have non catastrophic outcomes.
Someone I was explaining it to described it as “indefinite pessimism”.
Sorry but this is nonsense. JBlack's comment shows the argument works fine even if you take a lot of trouble to construct P(count|Y) to give a better answer.
But this isn't even particularly important, because for your objection to stand, it must be impossible to find any situation where P(A|Y) would give you a silly answer, which is completely false.
We did some related work: https://arxiv.org/pdf/2502.03490.
One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with "natural" facts: if e2->e3 is a "natural" fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.
We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much "knowledge capacity") as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether "natural facts" can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there's a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.