Maybe Insensitive Functions are a Natural Ontology Generator?

The most canonical example of a "natural ontology" comes from gasses in stat mech. In the simplest version, we model the gas as a bunch of little billiard balls bouncing around in a box.

The dynamics are chaotic. The system is continuous, so the initial conditions are real numbers with arbitrarily many bits of precision - e.g. maybe one ball starts out centered at x = 0.8776134000327846875..., y = 0.0013617356590430716..., z=132983270923481... . As balls bounce around, digits further and further back in those decimal representations become relevant to the large-scale behavior of the system. (Or, if we use binary, bits further and further back in the binary representations become relevant to the large-scale behavior of the system.) But in practice, measurement has finite precision, so we have approximately-zero information about the digits/bits far back in the expansion. Over time, then, we become maximally-uncertain about the large-scale behavior of the system.

... except for predictions about quantities which are conserved - e.g. energy.

Conversely, our initial information about the large-scale system behavior still tells us a lot about the future state, but most of what it tells us is about bits far back in the binary expansion of the future state variables (i.e. positions and velocities). Another way to put it: initially we have very precise information about the leading-order bits, but near-zero information about the lower-order bits further back. As the system evolves, these mix together. We end up with a lot of information about the leading-order and lower-order bits combined, but very little information about either one individually. (Classic example of how we can have lots of information about two variables combined but little information about either individually: I flip two coins in secret, then tell you that the two outcomes were the same. All the information is about the relationship between the two variables, not about the individual values.) So, even though we have a lot of information about the microscopic system state, our predictions about large-scale behavior (i.e. the leading-order bits) are near-maximally uncertain.

... again, except for conserved quantities like energy. We may have some initial uncertainty about the energy, or there may be some noise from external influences, etc, but the system’s own dynamics will not "amplify" that uncertainty the way it does with other uncertainty.

So, while most of our predictions become maxentropic (i.e. maximally uncertain) as time goes on, we can still make reasonably-precise predictions about the system’s energy far into the future.

That's where the natural ontology comes from: even a superintelligence will have limited precision measurements of initial conditions, so insofar as the billiard balls model is a good model of a particular gas even a superintelligence will make the same predictions about this gas that a human scientist would. It will measure and track conserved quantities like the energy, and then use a maxent distribution subject to those conserved quantities - i.e. a Boltzmann distribution. That's the best which can realistically be done.

Emphasizing Insensitivity

In the story above, I tried to emphasize the role of sensitivity. Specifically: whatever large-scale predictions one might want to make (other than conserved quantities) are sensitive to lower and lower order bits/digits, over time. In some sense, it's not really about the "size" of things, it's not really about needing more and more precise measurements. Rather, the reason chaos induces a natural ontology is because non-conserved quantities of interest depend on a larger and larger number of bits as we predict further and further ahead. There are more and more bits which we need to know, in order to make better-than-Boltzmann-distribution predictions.

Let's illustrate the idea from a different angle.

Suppose I have a binary function , with a million input bits and one output bit. The function is uniformly randomly chosen from all such functions - i.e. for each of the $2^{1000000}$ possible inputs $x$ , we flipped a coin to determine the output $f (x)$ for that particular input.

Now, suppose I know $f$ (i.e. I know the output produced by each input), and I know all but 50 of the input bits - i.e. I know 999950 of the input bits. How much information do I have about the output?

Answer: almost none. For almost all such functions, knowing 999950 input bits gives us $\sim \frac{1}{2^{50}}$ bits of information about the output. More generally, If the function has $n$ input bits and we know all but $k$ , then we have $o (\frac{1}{2^{k}})$ bits of information about the output. (That’s “little $o$ ” notation; it’s like big $O$ notation, but for things which are small rather than things which are large.) Our information drops off exponentially with the number of unknown bits.

Proof Sketch

With $k$ input bits unknown, there are $2^{k}$ possible inputs. The output corresponding to each of those inputs is an independent coin flip, so we have $2^{k}$ independent coin flips. If $m$ of those flips are 1, then we assign a probability of $\frac{m}{2^{k}}$ that the output will be 1.

As long as $2^{k}$ is large, Law of Large Numbers will kick in, and very close to half of those flips will be 1 almost surely - i.e. $m \approx$ $\frac{2^{k}}{2}$ . The error in this approximation will (very quickly) converge to a normal distribution, and our probability that the output will be 1 converges to a normal distribution with mean $\frac{1}{2}$ and standard deviation $\frac{1}{2^{k / 2}}$ . So, the probability that the output will be 1 is roughly $\frac{1}{2} \pm \frac{1}{2^{k / 2}}$ .

We can then plug that into Shannon’s entropy formula. Our prior probability that the output bit is 1 is $\frac{1}{2}$ , so we’re just interested in how much that $\pm \frac{1}{2^{k / 2}}$ adjustment reduces the entropy. This works out to $o (\frac{1}{2^{k}})$ bits.

The effect here is similar to chaos: in order to predict the output of the function better than 50/50, we need to know basically-all of the input bits. Even a relatively small number of unknown bits - just 50 out of 1000000 - is enough to wipe out basically-all of our information and leave us basically back at the 50/50 prediction.

Crucially, this argument applies to random binary functions - which means that almost all functions have this property, at least among functions with lots of inputs. It takes an unusual and special function to not lose basically-all information about its output from just a few unknown inputs.

In the billiard balls case, the "inputs" to our function would be the initial conditions, and the "outputs" would be some prediction about large-scale system behavior at a later time. The chaos property very roughly tells us that, as time rolls forward enough, the gas-prediction function has the same key property as almost all functions: even a relatively small handful of unknown inputs is enough to totally wipe out one's information about the outputs. Except, of course, for conserved quantities.

Characterization of Insensitive Functions/Predictions?

Put this together, and we get a picture with a couple pieces:

The "natural ontology" involves insensitive functions/predictions, because in practice if a function has lots of inputs then some of them will probably be unknown, wiping out nearly-all of one's information unless the function isn't very sensitive to most inputs.
Nearly all functions are sensitive.

So if natural ontologies are centrally about insensitive functions, and nearly all functions are sensitive... seems maybe pretty useful to characterize insensitive functions?

This has been done to some extent in some narrow ways - e.g. IIRC there's a specific sense in theory of computation under which the "least sensitive" binary functions are voting functions, i.e. each input bit gets a weight (positive or negative) and then we add them all up and see whether the result is positive or negative.

But for natural ontology purposes, we'd need a more thorough characterization. Some way to take any old function - like e.g. the function which predicts later billard-ball-gas state from earlier billiard-ball-gas-state - and quantitatively talk about its "conserved quantities"/"insensitive quantities" (or whatever the right generalization is), "sensitive quantities", and useful approximations when some quantities are on a spectrum between fully sensitive and fully insensitive.

LESSWRONG
LW

LESSWRONG
LW

15

Maybe Insensitive Functions are a Natural Ontology Generator?

15

15

Emphasizing Insensitivity

Characterization of Insensitive Functions/Predictions?