Introduction

The scaling laws for neural language models showed that cross-entropy loss follows a power law in three factors:

Dataset size
Model parameters
Training compute (steps or epochs)

The lower the cross-entropy loss, the better the model’s next-token prediction. Since prediction and compression are deeply linked (The Intricate Link Between Compression and Prediction), a lower loss should translate into better lossless compression. This raises the question:

Do LLMs exhibit clean power-law scaling in compression performance?

Experiment

For the experiment, I considered the Pythia models, which are available at multiple checkpoints and parameter sizes. All Pythia models are trained on the same dataset (The Pile), ensuring a uniform comparison. Following Delétang et al. (2023), The LLM’s predicted probabilities for each token are used to update the symbol table of an arithmetic encoder (a near-optimal entropy coder). I took the first 2048 chunks of the enwik8 dataset each chunk is 2048 bytes and fed them to the LLM to predict the next-token probabilities.

The other experimental parameters are:

Models: EleutherAI Pythia family (8-bit quantized): 70 M, 160 M, 410 M, 1 B, 1.4 B parameters
Checkpoints: 1 k, 8 k, 32 k, 128 k, 143 k training steps
Chunking:
- CHUNK_SIZE: 2048 bytes
- NUM_CHUNKS: 2048 per dataset
Compression pipeline:
1. Raw bytes → ASCII
2. Tokenize and pass the ASCII symbols to LLM
3. Arithmetic coding on LLM probabilities
4. CR = compressed_bits / original_bits
Datasets & Preprocessing:
- Text: enwik8 (Wikipedia XML)
- Image: ImageNet-1k validation → 32 × 64 grayscale patches (uint8 bytes)
- Speech: LibriSpeech “clean” train.100 → 16 kHz PCM → int16 bytes

Results: Text Compression

Model	1 k	8 k	32 k	128 k	143 k
pythia-70M	0.223	0.176	0.17	0.173	0.175
pythia-160M	0.218	0.159	0.149	0.149	0.150
pythia-410M	0.223	0.148	0.136	0.129	0.128
pythia-1B	0.207	0.140	0.128	0.120	0.120
pythia-1.4B	0.207	0.137	0.124	0.115	0.115

Text scaling-law fit:

We fit a Kaplan-style power law
$C R (P, S) = a + b P^{- α} + c S^{- β}$
For **text**, the best-fit coefficients are:
${C R}_{text} = 0.10 + 60 P^{- 0.38} + 10 S^{- 0.75}$

Compression on Non-Text Modalities

The “Language Modeling Is Compression” paper shows that an LLM trained only on text can compress arbitrary byte streams Language Modeling Is Compression. I applied the same pipeline to ImageNet-1k patches and LibriSpeech audio.

Image Results

Model	1 k	8 k	32 k	128 k	143 k
pythia-70M	0.601	0.499	0.492	0.505	0.513
pythia-160M	0.615	0.483	0.4

...

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments