The scaling laws for neural language models showed that cross-entropy loss follows a power law in three factors:
The lower the cross-entropy loss, the better the model’s next-token prediction. Since prediction and compression are deeply linked (The Intricate Link Between Compression and Prediction), a lower loss should translate into better lossless compression. This raises the question:
Do LLMs exhibit clean power-law scaling in compression performance?
For the experiment, I considered the Pythia models, which are available at multiple checkpoints and parameter sizes. All Pythia models are trained on the same dataset (The Pile), ensuring a uniform comparison. Following Delétang et al. (2023), The LLM’s predicted probabilities for each token are used to update the symbol table of an arithmetic encoder (a near-optimal entropy coder). I took the first 2048 chunks of the enwik8 dataset each chunk is 2048 bytes and fed them to the LLM to predict the next-token probabilities.
The other experimental parameters are:
Model | 1 k | 8 k | 32 k | 128 k | 143 k |
---|---|---|---|---|---|
pythia-70M | 0.223 | 0.176 | 0.17 | 0.173 | 0.175 |
pythia-160M | 0.218 | 0.159 | 0.149 | 0.149 | 0.150 |
pythia-410M | 0.223 | 0.148 | 0.136 | 0.129 | 0.128 |
pythia-1B | 0.207 | 0.140 | 0.128 | 0.120 | 0.120 |
pythia-1.4B | 0.207 | 0.137 | 0.124 | 0.115 | 0.115 |
We fit a Kaplan-style power law
For **text**, the best-fit coefficients are:
The “Language Modeling Is Compression” paper shows that an LLM trained only on text can compress arbitrary byte streams Language Modeling Is Compression. I applied the same pipeline to ImageNet-1k patches and LibriSpeech audio.
Model | 1 k | 8 k | 32 k | 128 k | 143 k |
---|---|---|---|---|---|
pythia-70M | 0.601 | 0.499 | 0.492 | 0.505 | 0.513 |
pythia-160M | 0.615 | 0.483 | 0.471 | 0.482 | 0.492 |
pythia-410M | 0.668 | 0.506 | 0.461 | 0.444 | 0.447 |
pythia-1B | 0.601 | 0.470 | 0.456 | 0.436 | 0.440 |
pythia-1.4B | 0.643 | 0.482 | 0.470 | 0.434 | 0.436 |
Model | 1 k | 8 k | 32 k | 128 k | 143 k |
---|---|---|---|---|---|
pythia-70M | 0.695 | 0.460 | 0.439 | 0.475 | 0.466 |
pythia-160M | 0.678 | 0.440 | 0.430 | 0.433 | 0.456 |
pythia-410M | 0.770 | 0.505 | 0.404 | 0.383 | 0.391 |
pythia-1B | 0.677 | 0.424 | 0.444 | 0.376 | 0.384 |
pythia-1.4B | 0.752 | 0.469 | 0.443 | 0.378 | 0.385 |
The below plot shows the overall scaling laws plots across all the three modalities, While the scaling law trend is present in non-textual modalities, the compression is not as strong in text.
It has been argued that LLMs trained only on text can form world models for proto-AGI. These compression results reinforce that claim: despite text-only pretraining, LLMs compress images and speech following clear power laws.
We hypothesize two primary mechanisms responsible for this:
A really interesting future work should be to quantify each component’s contribution and trace how they emerge during pretraining.