Did you test Claude for it being less susceptible to this issue? Otherwise not sure where your comment actually comes from. Testing this, I saw similar or worse behavior by that model - albeit GPT4 also definitely has this issue
https://twitter.com/mobav0/status/1637349100772372480?s=20
What do you mean by Scaling Hypothesis? Do you believe extremely large transformer models trained based on autoregressive loss will have superhuman capabilities?
I think there is a big danger in just relying on papers and not doing empirical tests.