I'd like to introduce a mathematical definition that tries to capture some of our intuitive notion of 'complexity'. People use this word in lots of ways. What I mean by this word is the sort of 'complexity' that seems to have been increasing over the history of our planet for the past few billion years. By excample:

--Simple self-replicators are more complex than a random sea of chemicals

--Single-celled bacteria are more complex than simple replicators

--Multicellular life is more complex than single-celled life

--Hunter-gatherer bands are more complex than most other life forms

--Modern civilizations are more complex than hunter-gatherer bands

This definition would be in terms of the underlying computer program of physics, and in the style of Kolmogorov complexity. Kolmogorov complexity itself doesn't work, because it seems that our entire universe can be generated by a short computer program, namely the laws of physics running on some simple intitial state. (Yes, our universe is in fact quantum-mechanical and hence randomized. But it's still seems likely that we can sample from the distribution of possible worlds with a short program) Since our universe can be generated by a short program, it seems that any sort of 'complexity' that can be found must lie in something besides the underlying program itself.

Now, let's try to construct a formal model. We'll consider some sort of computable dynamical system with size N. For concreteness let's imagine it's a (randomized) cellular automaton. Then running it for T steps should take O(NT) steps of computation.

We're trying to predict the evolution of the system with limited computational resources. We'll build a randomized Turing machine and only let it run for m timesteps. We'll score our predictor on the summed KL divergence of its predictions at each timestep and the true distribution at that step.

Suppose we are using the best possible predictor for a given runtime. The entire history of the universe can be run in time NT simply by running physics. Then our KL divergence at each timestep will be zero.

Conversely, if we are given no time at all the best we can do is output the uniform distribution at each timestep. the KL divergence will be 1 - o(T), pretty close to 1 at each timestep. (Assuming the system continues to change at least somewhat over the course of the evolution)

In between, how will our predictors fare? That depends on the structure of the underlying program, the initial state, and the details of the trajectory.

At one extreme, imagine a dynamical system that is completely pseudorandom. There is no way to predict the output of the system short of running it. Then our predictors wouldn't start to perform very well until their runtime was actually on the order of NT:

At another extreme, picture an extremely simple system. For example, perhaps the initial state is such that subsequent states simply alternate in a cycle of length 2 forever. Then we can get perfect accuracy with programs of runtime just O(log T + N)

Something like our world seems a bit subtler. Initially you can model things pretty well with some sort of simple chemistry simultation. Getting into the finer details of the movement of the chemicals, there isn't really any structure to it, you just have to follow each individual particle. Later, life arises, and you can start to get a bit more detail. You can predict that fit organisms will survive and those that are unfit will die, and can predict the behavior of each organism. When humans arise, you can start to predict things based on consequentialism and social dynamics, which are more subtle. And so on:

Here, then, is my candidate definition of complexity. It is how broadly spread out the 'dips' in the graphs above are. How many layers of structure can be uncovered by 'thinking' longer and longer about the evolution. In both the pseudorandom and simple cases, there is really only one layer of structure to be considered. But in complex time-histories, like that of our world, there are many layers, like an onion. Further, the number of layers that can be usefully considered will increase with time.

Formally, the negative derivative of the divergence defines a probability distribution over the interval [O(N), O(NT)]. Let's say that the complexity is the mean absolute difference of this distribution(perhaps with the x-axis scaled logarithmically).

I realize that it is not obvious that this definition would in fact predict higher complexity for the example cases I talked about earlier. However, I think it's plausible that it would, and it at least is not obviously ill-fitted for this purpose like Kolmogorov complexity.

There are some subtleties to address. It seems like it might be necesarry to add some constraint on the program length to prevent simply encoding the entire evolution in the body of the program.