Epistemological Status: I attempt to distill the minimal map argument and provide a natural way to chain a sequence of queries together. I argue that this definition, together with reasonable symmetry constraints on the minimal map, provides an intuitive explanation for the appearance of neural networks.
Say we sample an observation from a state-observation pair and want to decide based only on whether has a property . We can define the function to indicate whether or not an observation has the property we have in mind. Concretely, I've chosen to suppress the reference to the hidden state since it's not an observable. This could alternatively be viewed as a semi-decision that either terminates in finite time with a positive answer or runs forever. The catch is that each observation is partial and dependent on an unobserved hidden state. Here's one natural question: what information summarizes the outcome of the semi-decision?
Minimal Map Lemma: Up to isomorphism, any information in observing the outcome of for a given is equivalent to . Moreover, this map is minimal in the sense that any other representation is either equivalent in information content or lossy.
Proof: First, fix some other map and consider the sequence . By the data processing inequality the total mutual information satisfies . Second, by definition, the event is equivalent to any observation because the outcome of the semi-decision is unaffected. Because we also have we can just write, Finally, we have for the entropy, Thus, with respect to the semi-decision, the information contained in the probability of a decision is equivalent to that contained in the observation.
At this point, we conclude that it's natural to represent the semi-decision on the observation as a probability. Now we can consider something more familiar. Consider a set of properties that satisfy the following for , If the previous properties are satisfied we'll say that the objects can be classified by . Alternatively, we can say is a classifier for the objects . It's worth noting that we can always add a 'non of the above' property in the case that our is only mutually exclusive which results in a proper classifier.
Fact: Every property can be associated to a classifier .
This essentially is the motivation for the choice of definition.
It’d be nice to have a natural way to compose classifiers together on observations. Why? A central advantage is that if we do this correctly then we would be able to build up complicated classifiers from simpler ones.
With this goal in mind, we’ll first define, The is an isomorphism that comes from the 'up to isomorphism' clause of the minimal map lemma. This is actually quite useful as we'll see in a bit. For the moment, our definition of return can be understood as a projection of the object onto a basic collection of minimal maps. After this, we need one more definition, This projects the resulting minimal maps onto a new collection of minimal maps. What we're trying to do is allow for queries about query outcomes. This allows us to chain together a sequence of queries. Specifically, there will be some number of sequential binds with queries at each bind for . The sequence of queries at each bind defines a semi-decision process over the observations.
This definition allows us to treat our queries as a monad. While the only immediate use is for function composition, which could've been defined without the formality, this approach makes it easier to think about side-effects and related constructs such as the comonad (categorical dual of the monad) based on the exact outcomes of the queries.
This representation might also be useful to anticipate a natural class of minimal maps to consider. This is because our bind maintains the conditional notation and almost appears as and we could view each query as a projection against the objects.
We'll say the observations and queries have (vector) representations that are rotation invariant if we have, The property is telling us that if we rotate our observations and our query then the resulting determination should remain the same. We'll say that the observation and queries are dual if we have, In other words, if the process is dual then mapping the query over the observations is equivalent to mapping the data over the query. With these two notions, we can state the main insight,
If our semi-decision process on the observations is rotation invariant and dual then we have, Moreover, is a ridge-activation and the outcome of -sequential binds is equivalent to a neural network with -layers.
Proof: The first equality follows from the fact the inner-products are the only rotationally invariant quantities that can be made out of the query and the observation. The second inequality follows from symmetry. Only the inner-product between the query and observation is symmetric under swapping.
For the next part we'll relabel . Say the -th sequential bind has queries. Then we have, In the last equation, we've replaced the index to the queries with a matrix for compactness. The last expression is the definition of a multi-layer neural network.
As an immediate corollary, we see that this covers both radial and ridge activation functions. Our activation is fixed as a ridge activations since duality holds, but if we were to relax the condition to just rotation invariance then we'd have the possibility of radial activations. Moreover, if is monotone in its argument then because is an isomorphism the activation will also be monotone. For example, if is a sigmoid and then is Softplus.
Even if our data doesn't satisfy our representation property, every (reasonable) classifier can be approximated as such by the universal approximation theorem. This means that if we only care about producing approximations of minimal maps we can view the assumptions we made as a Wittgenstein ladder. Overall, what we showed is that the minimal map theorem, the monadic representation of a semi-decision process, and basic symmetry assumptions are enough to derive the general form of a neural network.