Thoughts on the Alignment Implications of Scaling Language Models

I think a major crux is that the things you couldn't impart on Mary through language (assuming that such things do exist) would be wishy-washy stuff like qualia whose existence, for a nonhuman system modelling humans, essentially doesn't matter for predictive accuracy. In other words, a universe where Mary does learn something new and a universe where she doesn't are essentially indistinguishable from the outside, so whether it shows up in world models is irrelevant.