There's something called the No Free Lunch theorem which says, approximately, that there's no truly general algorithm for learning: if an algorithm predicts some environment better than chance, there must exist some adversarial environment on which it will do at least that much worse than chance. (Yes, this is even true of Solomonoff induction.)

In the real world, this is almost completely irrelevant; empirically, general intelligence exists. However, leaving anthropics aside for a moment, we ought to find this irrelevance surprising in some sense; a robust theory of learning first needs to answer the question of why, in our particular universe, it's possible to learn anything at all.

I suspect that Wentworth's Telephone Theorem, which says that in the limit of causal distance, information is either completely preserved or completely destroyed, may be a component of a possible answer. The Telephone Theorem is not a property of our universe, but it does single out a property of things we expect to be learnable in the first place: mostly, we can only make observations at large causal distance, since we ourselves are very large in terms of underlying physics, and therefore we only care about the preserved information, not the destroyed information. A maximum-entropy universe, of the sort usually considered by no-free-lunch theorems, would actually look simpler to a macroscale observer, since macroscopic properties like temperature, density, etc. would be approximately uniform throughout.

I expect that this ought to imply something about the class of learning algorithms that work well on the type of data we want to predict, but I'm not sure what.

New Answer
Ask Related Question
New Comment

2 Answers sorted by

My intuition is that the learnability of our universe is mostly because it's not a max entropic universe. There is real structure to it, and there are hyperpriors and inductive biases that let one effectively learn it. Because we evolved in such a universe, we have such machinery.

I haven't been thinking of it in terms of the Telephone Theorem.

I don't agree that max entropic universes are simpler. I think a lot of intelligence is compression (efficiently generating accurate world models, prediction, etc.). I don't agree that one can better compress or predict a max entropic universe. And I think what macroscale properties you pick to care about is somewhat arbitrary. See also: "utility maximisation = description length minimisation"

This is something I've thought about recently - a full answer would take too long to write, but I'll leave a couple comments.

First, what this implies about learning algorithms can be summarized as "it explains the manifold hypothesis." The Telephone Theorem creates an information bottleneck that limits how much information can be captured at a distance. This means that a 64x64 RGB image, despite being nominally 12288-dimensional, in reality captures far less information and lies on a much lower-dimensional latent space. Chaos has irreversibly dispersed all the information about the microscopic details of your object. "Free lunch" follows quite easily from this, since the set of functions you care about is not really the set of functions on all RGB images, but the set of functions on a much smaller latent space.

Second, the vanilla Telephone Theorem isn't quite sufficient - the only information that persists in the infinite-time limit is conserved quantities (e.g. energy), which isn't very interesting. You need to restrict to some finite time (which is sufficiently longer than your microscopic dynamics) instead. In this case, persistent information now includes "conditionally conserved" quantities, such as the color of a solid object (caused by spontaneous symmetry-breaking reducing the permanently-valid Lorentz symmetry to the temporarily-valid space group symmetry). I believe the right direction to go here is ergodic theory and KAM theory, although the details are fuzzy to me.