Compression Presupposes Its Object: Why Intelligence Is Not Fundamentally the Search for Short Descriptions

Anthony_Ayeke

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

I used AI to frame and polish the writing but the ideas are mine.

Compression Presupposes Its Object

Why intelligence is not, fundamentally, the search for short descriptions

You cannot compress your way to the choice of what to compress.

Abstract

A recurring thesis holds that intelligence is fundamentally compression: to be intelligent is to find short descriptions, and the ideal intelligent agent is the one that, in the spirit of Kolmogorov complexity and the Minimum Description Length principle, identifies the shortest program generating its observations. The thesis is formidable. It is backed by exact theorems (compression and prediction are inter-derivable), by a rigorous published definition of machine intelligence built entirely on a compression prior, and by the empirical fact that today's most capable AI systems are trained on an objective that is literally expected code length.

I argue the thesis is false on every non-trivial reading, and that all of its defenses fail for a single reason. The word "compression" is made to straddle two relations with opposite type-signatures: compression of an extant object (a map from a pre-given object to a shorter re-expression of that same object) and compact specification of a generator (a short program that produces an object which need not have pre-existed). The first relation is meaningful but is not necessary for intelligence. The second is true of intelligence but vacuously — it is equally true of a crystal, a logic gate, and the decimal expansion of π. No reading of "intelligence is compression" is at once non-vacuous and true. Compression presupposes its object; intelligence frequently constitutes it. That difference is the whole of the matter.

1. The claim, stated at full strength

It would be cheap to disprove a weak version, so let me state the strongest one and grant it everything it is owed.

Prediction and compression are the same operation. Given a probability model q over the next symbol, arithmetic coding encodes that symbol in essentially −log₂ q bits, meeting Shannon's bound (Shannon, 1948). A compressor therefore is a predictor: to compress well you need a good model of what comes next, and any good predictive model immediately yields a near-optimal code. They are inter-derivable, not merely analogous.

The optimal agent is, formally, a compressor. Solomonoff induction defines the ideal predictor as a Bayesian mixture over all programs weighted by 2^(−length) — Occam's razor rendered as mathematics — and this mixture provably dominates every computable predictor (Solomonoff, 1964). Hutter's AIXI adds sequential decision-making to that prior to define a single optimal reinforcement learner (Hutter, 2005). Legg and Hutter then define universal intelligence as expected reward across all computable environments, weighting simpler environments more heavily by description length (Legg & Hutter, 2007). There already exists, in print, a rigorous object called "universal intelligence" whose most intelligent possible agent is precisely the one running compression-based induction.

Practical learning is compression, and that is why it generalizes. MDL says: choose the model minimizing L(model) + L(data | model) (Rissanen, 1978); overfitting is penalized automatically because a baroque model has a long description. And the Occam's-razor theorem of Blumer, Ehrenfeucht, Haussler, and Warmuth (1987) shows that any algorithm compressing a labeled sample is, provably, a learning algorithm with generalization guarantees. Compression is the thing that forbids memorization and forces capture of the transferable regularity.

The empirical case arrived on its own. Large language models are trained on next-token prediction, whose cross-entropy loss is expected code length; the training objective is "compress this corpus." Scaled, a single such objective yields arithmetic, translation, code, and reasoning nobody explicitly supplied. A language model used as a general compressor can out-compress PNG on images and FLAC on audio — modalities it was never built for (Delétang et al., 2023) — and the Hutter Prize encodes the same wager as a cash bounty for compressing Wikipedia near its entropy, on the explicit thesis that doing so is AI-complete.

So the position to beat is not a slogan. It is: intelligence is the capacity to find and exploit structure so as to predict and act; finding structure is mathematically identical to compression; the optimal such agent is formally a compressor; and our best learning theory, our best AI, and our best science all corroborate it.

I will grant, throughout, the premise this rests on: that every phenomenon worth modeling has a computable generator. Denying it (with Penrose, say) would be a different essay. I concede it, and argue the thesis fails anyway.

Four readings of "intelligence is compression" need separating, because a careful advocate retreats down this ladder under fire and a disproof must follow them all the way down:

Identity — intelligence is compression.
Sufficiency — enough compression is enough to be intelligent.
Necessity — anything intelligent must be compressing.
Vacuity — any finite representation counts as "compression," so the thesis is trivially true.

I take them in turn. Each falls to the same distinction, which I state first.

2. The one distinction

Everything below rests on separating two things the word "compression" is used to mean.

Compression of an extant object. A map from a pre-given object to a shorter re-expression of that same object. Its type-signature is Object → ShorterEncoding(Object). It requires the object to exist prior to the act: you cannot compress what is not yet there. Lossless compression is a bijection onto a shorter codeword (no information added); lossy compression strictly discards (information removed). In neither case does the output contain structure not already entailed by the input. This relation is contentful, directional, and — I will argue — not necessary for intelligence.

Compact specification of a generator. A short program that produces an object. Its type-signature is () → Object. The object need not have pre-existed; it can be brought into being by the generator. A 12-line program prints the first million digits of π, but those digits were never an input dataset that got summarized — they have no existence prior to the generator. This relation is true of intelligence — intelligent products do have short generators — but it is equally true of a crystal lattice, a NAND gate, and the laws of physics. It is, with respect to intelligence, vacuous.

The thesis lives by sliding between these. Read as the first relation, it is false (Sections 3–5). Read as the second, it is true but says nothing about intelligence specifically (Section 6). There is no third reading that is both true and non-empty. Hold this distinction in view; every rescue I examine is an attempt to blur it.

3. Identity fails

Two independent strikes.

The information sign-flip. Compression is conservative: its output is a function of its input alone, and contains no structure the input did not entail. Intelligence's signature acts are ampliative: the solution contains structure the problem statement did not supply.

Let me state this extensionally, to avoid the trap that sinks careless versions of the argument. (Leibniz's law fails in intensional contexts — "necessarily 8 > 7" is true while "necessarily the number of planets > 7" is false, though 8 is the number of planets — so a property like "conceived as novel" cannot do the work; the difference must be in what the two things are or do, not in how they are described.) Here is the extensional statement. For a compression, the output o is a function of the input i alone: fix the codec and o = f(i). For an ampliative act, o is not a function of i alone; it is o = f(i, g), where the agent's generator g contributes structure not recoverable from i. Solomon, facing two women and one child, has symmetric testimony — compress it however you like, it is information-balanced, and the answer is not a function of it. His "cut the child in two" is a new datum he injects, a counterfactual probe that did not exist in the problem and along which the hidden variable becomes legible. Cohen's forcing is not a point in the space of models the continuum hypothesis induces, nor a compression of that space; it is a previously nonexistent method that enlarges what is expressible (Cohen, 1963). The contributed bit comes from g, and g is not in i. Compression's sign on information is non-positive; the ampliative act's is positive. They run in opposite directions.

The optimal agent's own architecture refutes the identity. AIXI — the formal optimum the thesis points to — is two components, not one. There is the Solomonoff prior, which is compression and builds the world-model; and there is an expectimax planner, which selects actions to maximize expected reward over a search tree (Hutter, 2005). The planner is not compression. It is utility-maximization over possible futures, and it is where goals, action, and the generation of what-is-not-given live. The four hard cases above — Solomon, Cohen, the inventor — exhibit the planner half. The formalism that is supposed to ground "intelligence is compression" contains, by its own construction, a second irreducible part that is not compression. Identity is dead.

4. Sufficiency fails: the crystal

Take the definition at face value — intelligence is finding the shortest correct computation of a phenomenon — and consider a crystal. It minimizes free energy and settles into a lattice that is about as short to describe as a physical structure gets. Under the definition, it is maximally intelligent. It is not.

The reason it is not is exactly the reason that matters. The crystal performs one minimization regardless of environment. Change the world and it does the single thing it does; it has no hypothesis space, runs no search, and exhibits no counterfactual competence — it cannot find the short description of a different regularity if the regularity changes. A pure compression-act, with no search across hypotheses, is unintelligent. So compression — even compression that demonstrably produces a maximally compact result — is not sufficient. Sufficiency is dead.

Note what this forces. To exclude the crystal one reaches for counterfactual competence — search across a space of tasks, adaptability to whatever the data turn out to be. That is not a compression notion. It is about flexible generation. Hold onto this; it returns to convict the thesis from the other side.

5. Necessity fails

This is the thesis's last redoubt and it takes three blows, each independent.

The compressed Blockhead recants the witness. Block's "Blockhead" is a giant lookup table: it has memorized a correct response to every situation, exhibits flexible output, and is intuitively not intelligent (Block, 1981). That intuition is the natural witness for necessity — "no compression (it's pure memorization) → not intelligent." But now compress it. Imagine an agent running a perfectly compressed correct program in place of the table. Does it become intelligent? It does not: both the table and the compressed program merely execute a fixed correct policy; they differ only in memory footprint. Compressing Blockhead does not cross any threshold. Therefore whatever makes Blockhead unintelligent was never about compression — for if it were, compression would help, and it does not. The single best witness for necessity, examined closely, testifies against it. What Blockhead lacks is not compression but search: its answers were put there, not found.

Pruning is not compression. The most sophisticated defense of necessity is that a compact model is what makes a search tractable — Edison's filament search is feasible only because he has compressed materials-space into "high-melting-point conductors," pruning millions of candidates. But pruning requires only that the agent possess usable relevant structure; it does not require that the structure be minimal, nor that it was arrived at by compression. A bloated, redundant, gloriously uncompressed body of knowledge about materials prunes the search just as well. Mendeleev's regularities prune chemistry whether stored as eight tidy rules or as a giant unfactored table of properties. Compression and pruning are doubly dissociated: one can prune with uncompressed knowledge, and one can compress a goal-irrelevant regularity while pruning nothing useful. The work in a tractable search is done by having relevant structure, and "relevant structure" is not a synonym for "short description."

The amnesiac. Consider an agent fed a stream of environment input, computing at each step on a finite live window of perception, writing its actions back into the world, and forgetting — reusing no stored compression of its past experience. Over time, by goal-directed online reasoning against a freshly-read world, it produces a useful result. This is at least the semblance of intelligence, and it compresses nothing: there is no retained representation of experience to be a compression of. The natural objection — "its update rule is a reused, hence compressed, structure" — fails on the central distinction. The update rule is a function present before any experience exists; it is a compression of nothing, like the program for π, not a summary of data the agent saw. (And should the agent rewrite its own reasoning each tick, the persisting rewrite-rule is likewise a stipulated function, not a précis of anything.) To call such a function "compression" is to retreat to the vacuous reading, where a NAND gate compresses too — which is Section 6, not a defense of necessity.

Three independent demonstrations, one conclusion. Intelligence does not require compression of its experience — not as essence, not as substrate. Necessity is dead.

6. Vacuity: the only surviving reading, and why it says nothing

A defender can always retreat to: "fine — not the shortest, and not necessarily of experience, but any finite reusable representation is some compression, and intelligence uses finite representations." This reading is true. It is also empty, and three observations show why.

Enumeration suffices, so completeness — not compression — was operative. Suppose we say "to solve a problem is to compress the solution space it induces." If that space is enumerable, then enumerating it — the maximally un-compressed representation — solves the problem exactly as well as finding any deep structure. One can hold the longest possible description and be precisely as solved. So whatever "solving" tracks, it is completeness with respect to the induced space, not brevity. The compression was never the operative property; it was smuggled in alongside the property that mattered.

Generation and compression have opposite type-signatures. The refined thesis says a solution is "the structure from which the solution space falls out." But falls out means is generated by — and a generator constitutes its space rather than summarizing a pre-given one. Before forcing, there was no "space of forcing extensions" sitting in front of Cohen awaiting a short summary; the method brought that space into being. You cannot compress what your act of solving creates. The solution space is the output of the structure the solver finds, not its input. Calling the generator a "compression of the space" inverts the direction of the relation — it is the π error in its final form: a compact specification of a generator masquerading as a compression of data.

The oracle fork. Every goal decomposes into two tasks: (A) find a specific path within a given discriminator, or (B) construct the discriminator itself. Even granting an oracle that labels paths useful/useless, the intelligent act is frequently (B) — building the oracle — and (B) does not consist in compressing any pre-given object, because the object (the discriminator, the standard of success) is what (B) produces. The thesis only ever covered (A), and Sections 3–5 show it fails even there.

So the surviving reading is true of the crystal's lattice, of π's program, of a logic gate's truth table, and of every law of physics. Made broad enough to be true, "intelligence is compression" becomes a statement about every finite dynamical system and tells us nothing about intelligence in particular. Vacuity is the price of truth here.

7. The hardest case, and the turn: even Gauss is constitutive

The strongest case for the thesis is the one that looks like pure compression. The schoolboy Gauss, told to sum the integers from 1 to 100, returns 5050 at once: pair 1 with 100, 2 with 99, and so on — fifty pairs of 101. The closed form n(n+1)/2 is unmistakably a shorter program than adding sequentially. Surely this is compression, and surely we call it intelligent because it is compressed.

But look at what Gauss actually did. The other students were solving "add these numbers in order." Gauss solved a different problem — "pair the terms" — and the pairing symmetry is a structure that the task "add these up" does not contain. The product is a short program; genuinely so. The act was constitutive: it injected a reframing the problem did not supply. This is the same shape as Newton compressing Kepler — the product, F = G·m·M / r², is a short description of the planetary data, but the act invented the calculus and the concept of universal gravitation, apparatus nowhere in the tables.

Now the turn that closes the case. Consider when an act is purely compression, with no constitutive component: a gzip run, a crystal reaching its lattice. Those acts are unintelligent. And consider the acts we unhesitatingly crown as intelligent: each carries a constitutive component, and its compressed product is merely the shadow that component casts. So intelligence and compression do not even co-occur as acts. When the act is only compression, it is not intelligent; when it is intelligent, the act is constitutive and compression is the trace, not the deed.

This is why "intelligence sometimes compresses" is the wrong concession — as though compression were a sub-routine intelligence occasionally invokes and could, on other occasions, slip on. It is not a sub-routine of the act at all. It is the form of the trace the act leaves behind. Even the example built to look like pure compression turns out, on inspection, to be the constitution of a new structure — and the brevity we admired was the fingerprint, not the hand.

8. The diagnosis: the fingerprint fallacy

Why did the thesis seem so nearly true? Because intelligent products are compressible the way a crime scene bears fingerprints — and the fingerprint is evidence that an agent was present, not the agent.

Any insight whatsoever, stated after the fact, has the form "such-and-such compact principle holds." That is precisely what makes "I can always redescribe the clever part as a short description it found" carry zero explanatory content: a clever solution imposes compressible order by definition, so re-describing it as "a short description" is not a discovery about the solution but a restatement of its cleverness. Compressibility is the trace intelligence leaves, not the mechanism by which it works. Reading the trace as the cause — taking the fact that intelligence's products are short for evidence that intelligence's process is compression — is the single error the whole thesis runs on.

There is a real overlap underneath the illusion, and honesty requires naming it: modeling a given system genuinely is compression. Kepler's data pre-existed and were fixed; finding their shorter generator is compression of an extant object, and that is why science feels like compression and partly is. The error is the generalization from this descriptive half to the whole. Intelligence's other signature act — choosing what system to write down, inventing the apparatus, injecting the discriminating datum — is constitutive, and no amount of the first kind of activity adds up to the second. The thesis is true on the descriptive half and illicitly extended to the constitutive half. It is the reification of one real part, and one real correlation, into a false identity of the whole.

9. The fence: what the formalism does and does not say

The disproof must survive the formal machinery, so I grant that machinery in full and then mark its exact boundary.

Grant that compression is prediction, exactly. Arithmetic coding achieves the Shannon bound; the expected code length of a code optimal for model q, under the true distribution p, is the cross-entropy H(p) + D_KL(p ‖ q), minimized precisely when q = p. A good compressor needs q ≈ p; a good q yields a good code. There is no daylight between them. (The formal statements are in Appendix A.)

Grant Solomonoff dominance and the invariance theorem — the strongest formal expression of "ideal intelligence is ideal compression" — in full. Grant the Hutter Prize's premise and the empirical result that a large model compresses images and audio better than format-specific codecs (Delétang et al., 2023).

Now the boundary. Every one of these results is conditional on a given measure, model class, or reference machine. Arithmetic coding's optimality is optimality relative to q; it is silent on where q came from. Solomonoff dominance is dominance relative to a chosen universal machine U; the invariance theorem says a different machine V shifts every description length by a bounded constant c_{U,V} — bounded in theory, arbitrary and decisive for any finite agent in practice. No coding theorem selects U, and none constructs a new model class richer than the ones on offer. The constitutive act — selecting or inventing the representation in which compression then operates — lives entirely in the silence of these theorems. Indeed, the very empirical result cuts this way: a 70-billion-parameter model "compresses" image data only by bringing an enormous stipulated structure to bear — a structure that is itself a compression of nothing the test data contained, in the π sense. The deepest formal identity in the field is conditional on a space. Intelligence makes the space.

10. The one open question, and why it is not a rescue

One honest question remains, and a hostile reader should be handed it rather than have it hidden: does the ampliative mechanism — the generation of the new bit — employ compression as an internal subroutine? Perhaps to generate Solomon's test or Cohen's forcing, some compressive operation runs inside the generator.

This is genuinely open. It is a question about the engineering of generation, and I do not claim to have settled it. But it cannot resurrect the thesis, for two reasons. First, "employs compression as a part" is the chef's-knife relation, not identity: a chef needs a sharp knife and cooking is not fundamentally cutting. Second, identity, sufficiency, and necessity-of-the-whole are refuted in Sections 3–5 independently of what runs inside generation. Even if a compressive subroutine were found at the heart of every ampliative act, the most that would establish is that compression is one component among others of a process whose defining work — constituting the space — it is not. Flagging this question openly is stronger than feigning its closure; the disproof does not depend on its answer.

11. Conclusion

Intelligence is not, fundamentally, compression. Not by identity: the optimal agent's own architecture carries a non-compressive planner, and intelligence's signature acts add structure their inputs do not entail. Not by sufficiency: the crystal compresses maximally and is inert, because it cannot search. Not by necessity: the compressed Blockhead is no more intelligent than the uncompressed one, pruning runs on relevant-but-unminimal structure, and the amnesiac reasons without retaining any compression of its experience. And the only reading under which the thesis is true — that any finite representation is "compression" — is true of a crystal, a logic gate, and the digits of π, and so says nothing about intelligence at all.

Under every failed defense lies one error: the conflation of summarizing a given object with constituting a space. These have opposite type-signatures. Compression maps a pre-existing object to a shorter encoding of itself; it presupposes its object. Intelligence's defining act frequently makes the object — chooses the problem to write down, invents the method the problem did not contain, injects the datum that breaks the symmetry. The brevity we so admire in its products is the fingerprint it leaves, not the hand that acts.

You cannot compress your way to the choice of what to compress. That sentence is the whole proof, and everything above is its expansion.

Appendix A — Formal spine

A.1 Compression is prediction (exact). For a code optimal under model q, the codeword length of symbol x is ℓ(x) = −log₂ q(x) bits, achievable to within a vanishing per-symbol overhead by arithmetic coding (Shannon, 1948; Kraft's inequality guarantees a prefix code with these lengths exists iff Σ_x 2^(−ℓ(x)) ≤ 1). The expected length under the true distribution p is

E_p[ℓ(x)] = − Σ_x p(x) log₂ q(x) = H(p) + D_KL(p ‖ q),

the cross-entropy of p relative to q. It is minimized, at value H(p), exactly when q = p. Minimizing expected code length is therefore identical to minimizing predictive cross-entropy. The training loss of a next-token language model is exactly this cross-entropy; "train to predict" and "train to compress" denote one objective.

A.2 The compression prior as ideal induction. With U a fixed universal prefix machine, the Kolmogorov complexity of a string is

K(x) = min { |p| : U(p) = x }.

Solomonoff's universal prior assigns

M(x) = Σ_{p : U(p) = x∗} 2^(−|p|),

and dominates every computable measure μ: there is a constant c_μ > 0 with M(x) ≥ c_μ · μ(x) for all x, whence the M-based predictor's cumulative prediction error against any computable environment is bounded (Solomonoff, 1964). MDL operationalizes the finite case by selecting the hypothesis H minimizing the two-part code length L(H) + L(D | H) (Rissanen, 1978). Legg and Hutter define universal intelligence as

Υ(π) = Σ_{μ ∈ E} 2^(−K(μ)) · V_μ^π,

the expected total reward of policy π across all computable environments μ, weighted by 2^(−K(μ)) so that simpler environments dominate (Legg & Hutter, 2007). AIXI couples this prior to expectimax action selection — a Solomonoff-prior world-model and a separate planning maximization (Hutter, 2005).

A.3 The unreachable optimum. K(x) is uncomputable: no algorithm, ideal or real, returns the shortest program for arbitrary x (it is upper semi-computable — one obtains ever-better upper bounds but can never certify the minimum). Hence "intelligence = finding the shortest description" defines intelligence as something provably no intelligence performs; the only defensible reading is comparative (shorter, not shortest), and Section 6 shows that even the comparative reading is not what we track, since enumeration — the longest representation — solves an enumerable problem equally well.

A.4 The fence, formally. Each result in A.1–A.2 is conditional on a fixed q, U, or environment class E. The invariance theorem bounds the dependence on the reference machine,

K_U(x) ≤ K_V(x) + c_{U,V},

with c_{U,V} independent of x but otherwise arbitrary — and for any finite agent this constant is precisely the gap between a brilliant reframing and a mediocre one. No theorem in A.1–A.2 selects U, fixes q, or constructs a model class not already supplied. The selection and invention of the representation — the constitutive act — is exactly what these theorems leave unspecified. The formal apparatus characterizes compression within a chosen space; it is silent on the making of the space.

References

Block, N. (1981). Psychologism and behaviorism. The Philosophical Review, 90(1), 5–43.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24(6), 377–380.
Cohen, P. J. (1963). The independence of the continuum hypothesis. Proceedings of the National Academy of Sciences, 50(6), 1143–1148.
Delétang, G., Ruoss, A., Duquenne, P.-A., et al. (2023). Language modeling is compression. arXiv:2309.10668.
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138.
Hutter, M. (2005). Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer.
Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1(1), 1–7.
Kraft, L. G. (1949). A device for quantizing, grouping, and coding amplitude-modulated pulses. MSc thesis, MIT.
Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4), 391–444.
Li, M., & Vitányi, P. (2008). An Introduction to Kolmogorov Complexity and Its Applications (3rd ed.). Springer.
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–471.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423, 623–656.
Solomonoff, R. J. (1964). A formal theory of inductive inference, Parts I and II. Information and Control, 7(1), 1–22; 7(2), 224–254.

A note on two coinages

The fingerprint fallacy (Section 8) and the compression-of-an-extant-object vs. compact-specification-of-a-generator distinction (Section 2) are introduced here as framing devices rather than established terms of art. The fingerprint point has a respectable neighbor in the philosophy of science — the distinction between the context of justification and the context of discovery — but the specific claim that compressibility is a trace left by intelligence rather than its mechanism is the contribution of this essay, and should be attributed accordingly.

The retreat I find hardest is whether generation uses compression internally (§10); I argue it's useful-but-not-constitutive, and I'd welcome pushback on that specifically.