LESSWRONG
LW

AI
Frontpage

8

Scaling Laws for LLM Based Data Compression

by rokosbasilisk
5th Aug 2025
4 min read
0

8

AI
Frontpage

8

New Comment
Moderation Log
More from rokosbasilisk
View more
Curated and popular this week
0Comments

Introduction

The scaling laws for neural language models showed that cross-entropy loss follows a power law in three factors:

  1. Dataset size
  2. Model parameters
  3. Training compute (steps or epochs)

The lower the cross-entropy loss, the better the model’s next-token prediction. Since prediction and compression are deeply linked (The Intricate Link Between Compression and Prediction), a lower loss should translate into better lossless compression. This raises the question:

Do LLMs exhibit clean power-law scaling in compression performance?

Experiment

For the experiment, I considered the Pythia models, which are available at multiple checkpoints and parameter sizes. All Pythia models are trained on the same dataset (The Pile), ensuring a uniform comparison. Following Delétang et al. (2023), The LLM’s predicted probabilities for each token are used to update the symbol table of an arithmetic encoder (a near-optimal entropy coder). I took the first 2048 chunks of the enwik8 dataset each chunk is 2048 bytes and fed them to the LLM to predict the next-token probabilities.

The other experimental parameters are:

  • Models: EleutherAI Pythia family (8-bit quantized): 70 M, 160 M, 410 M, 1 B, 1.4 B parameters
  • Checkpoints: 1 k, 8 k, 32 k, 128 k, 143 k training steps
  • Chunking:
    • CHUNK_SIZE: 2048 bytes
    • NUM_CHUNKS: 2048 per dataset
  • Compression pipeline:
    1. Raw bytes → ASCII
    2. Tokenize and pass the ASCII symbols to LLM
    3. Arithmetic coding on LLM probabilities
    4. CR = compressed_bits / original_bits
  • Datasets & Preprocessing:
    • Text: enwik8 (Wikipedia XML)
    • Image: ImageNet-1k validation → 32 × 64 grayscale patches (uint8 bytes)
    • Speech: LibriSpeech “clean” train.100 → 16 kHz PCM → int16 bytes

Results: Text Compression

Model1 k8 k32 k128 k143 k
pythia-70M0.2230.1760.170.1730.175
pythia-160M0.2180.1590.1490.1490.150
pythia-410M0.2230.1480.1360.1290.128
pythia-1B0.2070.1400.1280.1200.120
pythia-1.4B0.2070.1370.1240.1150.115

Text scaling-law fit:

We fit a Kaplan-style power law  
CR(P,S)=a+bP−α+cS−β
For **text**, the best-fit coefficients are:  
CRtext=0.10+60P−0.38+10S−0.75

Compression on Non-Text Modalities

The “Language Modeling Is Compression” paper shows that an LLM trained only on text can compress arbitrary byte streams Language Modeling Is Compression. I applied the same pipeline to ImageNet-1k patches and LibriSpeech audio.

Image Results

Model1 k8 k32 k128 k143 k
pythia-70M0.6010.4990.4920.5050.513
pythia-160M0.6150.4830.4710.4820.492
pythia-410M0.6680.5060.4610.4440.447
pythia-1B0.6010.4700.4560.4360.440
pythia-1.4B0.6430.4820.4700.4340.436

Image scaling-law fit

CRimage=0.38+2P−0.15+60S−0.86

Speech Results

Model1 k8 k32 k128 k143 k
pythia-70M0.6950.4600.4390.4750.466
pythia-160M0.6780.4400.4300.4330.456
pythia-410M0.7700.5050.4040.3830.391
pythia-1B0.6770.4240.4440.3760.384
pythia-1.4B0.7520.4690.4430.3780.385

Speech scaling-law fit:

CRspeech=0.39+8×103P−0.68+1×102S−0.85

Combined Scaling Curves

The below plot shows the overall scaling laws plots across all the three modalities, While the scaling law trend is present in non-textual modalities, the compression is not as strong in text.

Scaling curves for text, image, speech


Conclusion

It has been argued that LLMs trained only on text can form world models for proto-AGI. These compression results reinforce that claim: despite text-only pretraining, LLMs compress images and speech following clear power laws.

We hypothesize two primary mechanisms responsible for this:

  1. In-Context Learning
    Within a given token window, self-attention adapts on-the-fly to file-specific patterns (repeating pixels, periodic audio cycles), shaving off most of the entropy.
  2. Universal Sequence Prior
    Pretraining internalizes bursty repetitions, Zipf’s law, heavy tails, and long-range autocorrelations statistics shared by all byte streams. Even without context, compression ratio should be far below uniform-noise rates. Do read Benford’s Law, Zipf’s Law and the Pareto Distribution to know more about naturally occuring data distributions and their universality

A really interesting future work should be to quantify each component’s contribution and trace how they emerge during pretraining.

References

  • Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. “Scaling Laws for Neural Language Models.” arXiv 2001.08361 (2020). https://arxiv.org/abs/2001.08361
  • The Intricate Link Between Compression and Prediction. Mindful Modeler Substack (2024). https://mindfulmodeler.substack.com/p/the-intricate-link-between-compression
  • Delétang, G.; Ruoss, A.; Duquenne, P.-A.; Catt, E.; Genewein, T.; Mattern, C.; Grau-Moya, J.; Wenliang, L.K.; Aitchison, M.; Orseau, L.; Hutter, M.; Veness, J. “Language Modeling Is Compression.” arXiv 2309.10668 (2023). https://arxiv.org/abs/2309.10668
  • Gao, Leo. “Thoughts on the Alignment Implications of Scaling Language Models.” BMK.sh Blog (2021). https://bmk.sh/2021/06/02/Thoughts-on-the-Alignment-Implications-of-Scaling-Language-Models/
  • Tao, T. “Benford’s Law, Zipf’s Law and the Pareto Distribution.” Terence Tao’s Blog (2009). https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/