This is a linkpost for https://energy-based-transformers.github.io/
I do sure wish that abstract was either Actually Short™, or broken into paragraphs. (I'm assuming you didn't write it but it's usually easy to find natural paragraph breaks on the authors' behalf)
Unfortunately they extended the scaling curves to ~10 B tokens, less than 3OOMs of the data used to train frontier models. So it’s unclear whether this will work at scale, and the fact that they didn’t extend it further is some evidence against it working.