1、Although I cannot find any papers describing transformer with information theory, there are actually researches on DNN and information theory, which describe the sample complexity and generalization error with **mutual information**. Like this one: https://www.youtube.com/watch?v=XL07WEc2TRI
2、there are experiments trying to train LM on LM-generated data, and observe a loss of performance
3、Illya had a lecture named "An observation on Generalization", using compression theory to understand self-supervised learning (SSL), which is the paradigm of LLM pretraining. This is among the few trials to interpret LLM mathematically. SSL is very different from supervised learning (and researches in 1、actually focus on supervised learning, AFAIK).