I know you asked for other people (presumably not me) to confirm this but I can point you to the statement of the theorem, as written by Conant and Ashby in the original paper :
Theorem: The simplest optimal regulator R of a reguland S produces events R which are related to the events S by a mapping
Restated somewhat less rigorously, the theorem says that the best regulator of a system is one which is a model of that system in the sense that the regulator’s actions are merely the system’s actions as seen through a mapping h.
I agree that it has nothing to do with modelling and is not very interesting! But the simple theorem is surrounded by so much mysticism (both in the paper and in discussions about it) that it is often not obvious what the theorem actually says.
The diagram is a causal Bayes net which is a DAG so it can't contain cycles. Your diagram contains a cycle between R and Z. The diagram I had in mind when writing the post was something like:
which is a thermostat over a single timestep.
If you wanted to have a feedback loop over multiple timesteps, you could conjoin several of these diagrams:
Each node along the top row is the temperature at successive times. Each node along the bottom row is the controller state at different times.
Thanks for the clarifications, that all makes sense. I will keep thinking about this!
This seems like an interesting problem! I've been thinking about it a little bit but wanted to make sure I understood before diving in too deep. Can I see if I understand this by going through the biased coin example?
Suppose I have 2^5 coins and each one is given a unique 5-bit string label covering all binary strings from 00000 to 11111. Call the string on the label .
The label given to the coin indicates its 'true' bias. The string 00000 indicates that the coin with that label has p(heads)=0. The coin labelled 11111 has p(heads)=1. The ‘true’ p(heads) increases in equal steps going up from 00000 to 00001 to 00010 etc. Suppose I randomly pick a coin from this collection, toss it 200 times and call the number of heads X_1. Then I toss it another 200 times and call the number of heads X_2.
Now, if I tell you what the label on the coin was (which tells us the true bias of the coin), telling you X_1 would not give you any more information to help you guess X_2 (and vice versa). This is the first Natural Latent condition ( induces independence between X_1 and X_2). Alternatively, if I didn’t tell you the label, you could estimate it from either X_1 or X_2 equally well. This is the other two diagrams.
I think that the full label will be an approximate stochastic natural latent. But if we consider only the first bit[1] of the label (which roughly tells us whether the bias is above or below 50% heads) then this bit will be a deterministic natural latent because with reasonably high certainty, you can guess the first bit of from X_1 or X_2. This is because the conditional entropy H(first bit of |X_1) is low. On the other hand H( | X_1) will be high. If I get only 23 heads out of 200 tosses, I can be reasonably certain that the first bit of is a 0 (ie the coin has a less than 50% of coming up heads) but can't be as certain what the last bit of is. Just because satisfies the Natural Latent conditions within , this doesn’t imply that satisfies . We can use X_1 to find a 5-bit estimate of , but most of the useful information in that estimate is contained in the first bit. The second bit might be somewhat useful, but its less certain than the first. The last bit of the estimate will largely be noise. This means that going from using to using ‘first bit of ’ doesn’t decrease the usefulness of the latent very much, since the stuff we are throwing out is largely random. As a result, the ‘first bit of ’ will still satisfy the natural latent conditions almost as well as . By throwing out the later bits, we threw away the most 'stochastic' bits, while keeping the most 'latenty' bits.
So in this case, we have started from a stochastic natural latent and used it to construct a deterministic natural latent which is almost as good. I haven’t done the calculation, but hopefully we could say something like ‘if satisfies the natural latent conditions within then the first bit of satisfies the natural latent conditions within (or or something else)’. Would an explicit proof of a statement like this for this case be a special case of the general problem?
The problem question could be framed as something like: “Is there some standard process we can do for every stochastic natural latent, in order to obtain a deterministic natural latent which is almost as good (in terms of \epsilon)”. This process will be analogous to the ‘throwing away the less useful/more random bits of \lambda’ which we did in the example above. Does this sound right?
Also, can all stochastic natural latents can be thought of as 'more approximate' deterministic latents? If a latent satisfies the the three natural latents conditions within , we can always find a (potentially much bigger) such that this latent also satisfies the deterministic latent condition, right? This is why you need to specify that the problem is showing that a deterministic natural latent exists with 'almost the same' . Does this sound right?
I'm going to talk about the 'first bit' but an equivalent argument might also hold for the 'first two bits' or something. I haven't actually checked the maths.
It seems that the relevant thing is not so much how many values you have tested as the domain size of the function. A function with a large domain cannot be explicitly represented with a small lookup table. But this means you also have to consider how the black box behaves when you feed it something outside of its domain, right?
This sounds right. Implicitly, I was assuming that when the black box was fed an x outside of its domain it would return an error message or at least, something which is not equal to f(x). I realise that I didn't make this clear in the post.
But what if it is a lookup table, but the index is taken mod the length of the table? Or more generally, what if a hash function is applied to the index first?
In the post I was considering programs which are 'just' a lookup tables. But I agree that there are other programs that use lookup tables but are not 'just' tables. The motivation is that if you want simulate f(x) over a large domain using a function smaller than the bound given in the post, then the program needs to contain some feature which captures something nontrivial about the nature of f.
Like (simple example just to be concrete) suppose f(x)=[x(mod10)]^2 defined on all real numbers. Then the program in the black box might have two steps: 1) compute x(mod10) using a non-lookup table method then 2) use a lookup table containing x^2 for x=1 to 10. This (by my slightly arbitrary criteria) would count as a program which 'actually computes the function' since it has fixed length and its input/output behaviour coincides with f(x) for all possible inputs. It seems to me that if a blackbox program contains a subsection which computes x(mod10) then it has captured something important about f(x), more so than if the program only contains a big lookup table.
It seems that counting the number of unique output values taken by the function is more robustly lower-bounds its size as a (generalized) lookup table?
I think this is right. If you wanted to rule out lookup tables being used at any point in the program then this is bottlenecked by the number of unique outputs.
Cool, that all sounds fair to me. I don't think we have any substantive disagreements.
hmm, we seem to be talking past each other a bit. I think my main point in response is something like this:
In non-trivial settings, (some but not all) structural differences between programs lead to differences in input/output behaviour, even if there is a large domain for which they are behaviourally equivalent.
But that sentence lacks a lot of nuance! I'll try to break it down a bit more to find if/where we disagree (so apologies if a lot of this is re-hashing).
For most practical purposes, "calculating a function" is only and exactly a very good compression algorithm for the lookup table.
I think I disagree. Something like this might be true if you just care about input and output behaviour (it becomes true by definition if you consider that any functions with the same input/output behaviour are just different compressions of each other). But it seems to me that how outputs are generated is an important distinction to make.
I think the difference goes beyond 'heat dissipation or imputed qualia'. As a crude example, imagine that, for some environment f(x) is an 'optimal strategy' (in some sense) for inputs x. Suppose we train AIs in this environment and AI A learns to compute f(x) directly whereas AI B learns to implement a lookup table. Based on performance alone, both A and B are equal, since they have equivalent behaviour. But it seems to me that there is a meaningful distinction between the two. These differences are important since you could use them to make predictions about how the AIs might behave in different environments or when exposed to different inputs.
I agree that there are more degrees of distinction between algorithms than just "lookup table or no". These are interesting, just not covered in the post!
I agree that there is a spectrum of ways to compute f(x) ranging from efficient to inefficient (in terms of program length). But I think that lookup tables are structurally different from direct ways of computing f because they explicitly contain the relationships between inputs and outputs. We can point to a 'row' of a lookup table and say 'this corresponds to the particular input x_1 and the output y_1' and do this for all inputs and outputs in a way that we can't do with a program which directly computes f(x). I think that allowing for compression preserves this important property, so I don't have a problem calling a compressed lookup table a lookup table. Note that the compression allowed in the derivation is not arbitrary since it only applies to the 'table' part of the programme, not the full input/output behaviour of the programme.
if you have a program computing a predicate P(x, y) that is only true when y = f(x), and then the program just tries all possible y - is that more like a function, or more like a lookup?
In order to test whether y=f(x), the program must have calculated f(x) and stored it somewhere. How did it calculate f(x)? Did it use a table or calculate it directly?
What about if you have first compute some simple function of the input (e.g. x mod N), then do a lookup?
I agree that you can have hybrid cases. But there is still a meaningful distinction between the part of the program which is a lookup table and the part of the program which isn't (in describing the program you used this distinction!). In the example you have given the pre-processing function (x mod N) is not a bijection. This means that we couldn't interpret the pre-processing as an 'encoding' so we couldn't point to parts of the program corresponding to each unique input and output. Suppose the function was f(x) =( x mod N )+ 2 and the pre-processing captured the x mod N part and it then used a 2xN lookup table to calculate the '+2'. I think this program is importantly different from one which stores the input and output for every single x. So when taken as a whole the program would not be a lookup table and might be shorter than the lookup table bound presented above. But this captures something important! Namely, that the program is doing some degree of 'actually computing the function'.
In the post 'Can economics change your mind?' he has a list of examples where he has changed his mind due to evidence:
I don't know enough about economics to tell how much these meet your criteria for 'I was wrong' rather than 'revised estimates' or something else (he doesn't use the exact phrase 'I was wrong') but it seems in the spirit of what you are looking for.