TL;DR: I verify that the minimal map lemma can be used as a criterion for functional optimization. Specifically, I derive a broad class of consistent loss functions according to the criterion. I also give an application of the result to determining how to train a neural representation.
Say we receive data where the are observations and are binary class labels. We wish to learn a map that is able to properly classify the data. By the minimal map lemma this is information equivalent to learning the map where is an arbitrary isomorphism. We can make a stylistic choice on the the model class by imposing the constraint, Consider the term . If the prediction is correct then or otherwise . However, we're either right or wrong with probability one. Therefore, if that label and the classifier most likely agree. In other words, we should choose the that maximizes the probability of agreement.
In practice, this means we calculate, to collapse probabilities into binary decisions. The variable is threshold that returns one if the classification is less likely than the other and zero otherwise. This is effectively the zero-one loss.
Theorem: Consider the population risk, The optimal decision rule is given by .
Proof: We should set if which implies that . The result is that the risk is minimized. Thus, the optimal decision rule is given as stated.
As stated before, if then we should take that as our classification. The trick now is to replace with the most general class of loss functions such that the optimal classifier remains unchanged. So we'll consider, where we've introduced a twice differentiable loss function and collapsed our conditional probability to just the function .
Definition: A loss function is consistent if the optimal classifier for is equivalent to the optimal classifier for .
We'll also need a lemma to proceed,
Lemma: If the functional derivative exists, then is locally optimal iff, Proof: This is the first order condition for optimality.
Now we can state the main result.
Theorem: A loss function is consistent if it is of the form, where is a twice differentiable strictly convex function and respects as a minimal map according to the constraint, and for the optimal classifier satisfies, Proof:
To minimize the population risk we need to use the functional deriviative. Now, we want to fix a class of with the assumption that will determine our classifier. Any particulars beyond this are irrelevant to producing a 0-1 optimal classifier. This simplifies the optimality condition to, Note that we assume here that and is differentiable. If we have, then we can do this trick again. So our loss function is a function times a strictly convex function. Thus, Recall that so at we must be at an optimum if we're to respect the minimal map property and treat as a probability. This is the invertible link property. To satisfy this property we need, subsitution makes it clear that the solution we want is . The other solution is degenerate. We're almost done, now we solve for with an integration by parts,
Take and then we have,
which is the logistic loss.
More abstractly, for a neural representation we compute layers as, where is an activation for each layer . Previously I showed that if a neural network is to be interpreted as a probabalistic decision process then we need where is invertible and can be an associated probability.
When we arrive at the output layer, we need to apply a loss function. Given the above analysis it would be best to change the activation to and then apply a consistent loss from above. This ensures that what is going into is indeed a probability.