TL;DR: I verify that the minimal map lemma can be used as a criterion for functional optimization. Specifically, I derive a broad class of consistent loss functions according to the criterion. I also give an application of the result to determining how to train a neural representation.
Say we receive data (x,y)∼X×{−1,1} where the x are observations and y are binary class labels. We wish to learn a map h∈H that is able to properly classify the data. By the minimal map lemma this is information equivalent to learning the map ζ∘p(y|x) where ζ is an arbitrary isomorphism. We can make a stylistic choice on the the model class by imposing the constraint,
ζ−1(h)+ζ−1(−h)=1
Consider the term y⋅h(x). If the prediction is correct then y⋅h(x)>0 or otherwise y⋅h(x)<0. However, we're either right or wrong with probability one. Therefore, if ζ−1(y⋅h(x))>1/2 that label and the classifier most likely agree. In other words, we should choose the y that maximizes the probability of agreement.
In practice, this means we calculate,
ι(y⋅h(x))=12(1−sign(y⋅h(x)))
to collapse probabilities into binary decisions. The variable is threshold that returns one if the classification is less likely than the other and zero otherwise. This is effectively the zero-one loss.
Theorem: Consider the population risk,
R0-1(h)=∫X×Yι(h)p(1|x)+ι(−h)(1−p(1|x))dμ(x)
The optimal decision rule is given by sign(h)=sign(p(1|x)−1/2).
Proof: We should set sign(h)=1 if p(1|x)>1−p(1|x) which implies that p(1|x)>1/2. The result is that the risk is minimized. Thus, the optimal decision rule is given as stated.
As stated before, if ζ−1(y⋅h(x))>1/2 then we should take that as our classification. The trick now is to replace ι→ϕ with the most general class of loss functions such that the optimal classifier remains unchanged. So we'll consider,
Rϕ(h)=∫X×Yϕ∘ζ−1(h)η(x)+ϕ∘ζ−1(−h)(1−η(x))dμ(x)
where we've introduced a twice differentiable loss function ϕ:R→R and collapsed our conditional probability to just the function η.
Definition: A loss function ϕ is consistent if the optimal classifier for Rϕ is equivalent to the optimal classifier for R0-1.
We'll also need a lemma to proceed,
Lemma: If the functional derivative ∇hR exists, then h∗ is locally optimal iff,
⟨∇hR,h∗−h⟩≥0,∀h∈HProof: This is the first order condition for optimality.
Now we can state the main result.
Theorem: A loss function ϕ:R→R is consistent if it is of the form,
ϕ(η)=C(η)+(1−η)C′(η)
where C(η)=C(1−η) is a twice differentiable strictly convex function and η=ζ−1(h(x)) respects h as a minimal map according to the constraint,
ζ−1(h)+ζ−1(−h)=1
and for the optimal classifier h∗ satisfies,
ζ−1∘h∗(x)=p(1|x)Proof:
To minimize the population risk we need to use the functional deriviative.
∇hR=limϵ→0R(h+ϵψ)−R(h)ϵ=[ddϵR(h+ϵψ)]ϵ=0=∫X×Y[∂hϕ∘ζ−1(h)η(x)+∂hϕ∘ζ−1(−h)(1−η(x))]ψ(x)dμ(x)
Now, we want to fix a class of ϕ with the assumption that sign(h) will determine our classifier. Any particulars beyond this are irrelevant to producing a 0-1 optimal classifier. This simplifies the optimality condition to,
sign(∂hϕ∘ζ−1(h)η+∂hϕ∘ζ−1(−h)(1−η))=sign(h)⟺sign(ϕ′(^η)∂hζ−1(h)η−ϕ′(1−^η)∂hζ−1(h)(1−η))=sign(h)⟺sign(ϕ′(^η)η−ϕ′(1−^η)(1−η))=sign(h)
Note that we assume here that limh→∞ζ−1(h)=1 and is differentiable. If we have,
ϕ′(^η)=f(^η)C′′(^η)s.t C′′(η)>0
then we can do this trick again. So our loss function is a function times a strictly convex function. Thus,
sign(ϕ′(^η)η−ϕ′(1−^η)(1−η))=sign(h)⟺sign(f(^η)C′′(^η)η−f(1−^η)C′′(1−^η)(1−η))=sign(h)⟺sign(f(^η)η−f(1−^η)(1−η))=sign(h)
Recall that ^η(x)=ζ−1(h(x)) so at ^η=η we must be at an optimum if we're to respect the minimal map property and treat ζ−1∘h as a probability. This is the invertible link property. To satisfy this property we need,
f(^η)f(1−^η)=1−ηη⇒f(η)≅(1−η)or1/η
subsitution makes it clear that the solution we want is f(η)=(1−η). The other solution is degenerate. We're almost done, now we solve for ϕ with an integration by parts,
ϕ(η)=∫(1−η)⋅C′′(η)=(1−η)C′(η)+∫C′(η)=C(η)+(1−η)C′(η)□
Examples:
Take C(η)=H(η) and ζ−1(v)=1log(2)ev1+ev then we have,
ϕ(v)=1log(2)log(1+ev)
which is the logistic loss.
More abstractly, for a neural representation we compute layers as,
→hj+1=σ(Wj→hj)
where σ is an activation for each layer →hj. Previously I showed that if a neural network is to be interpreted as a probabalistic decision process then we need σ=ζ∘p where ζ is invertible and p can be an associated probability.
When we arrive at the output layer, we need to apply a loss function. Given the above analysis it would be best to change the activation to ζ−1 and then apply a consistent loss ϕ from above. This ensures that what is going into ϕ is indeed a probability.
TL;DR: I verify that the minimal map lemma can be used as a criterion for functional optimization. Specifically, I derive a broad class of consistent loss functions according to the criterion. I also give an application of the result to determining how to train a neural representation.
Say we receive data (x,y)∼X×{−1,1} where the x are observations and y are binary class labels. We wish to learn a map h∈H that is able to properly classify the data. By the minimal map lemma this is information equivalent to learning the map ζ∘p(y|x) where ζ is an arbitrary isomorphism. We can make a stylistic choice on the the model class by imposing the constraint, ζ−1(h)+ζ−1(−h)=1 Consider the term y⋅h(x). If the prediction is correct then y⋅h(x)>0 or otherwise y⋅h(x)<0. However, we're either right or wrong with probability one. Therefore, if ζ−1(y⋅h(x))>1/2 that label and the classifier most likely agree. In other words, we should choose the y that maximizes the probability of agreement.
In practice, this means we calculate, ι(y⋅h(x))=12(1−sign(y⋅h(x))) to collapse probabilities into binary decisions. The variable is threshold that returns one if the classification is less likely than the other and zero otherwise. This is effectively the zero-one loss.
Theorem: Consider the population risk, R0-1(h)=∫X×Yι(h)p(1|x)+ι(−h)(1−p(1|x)) dμ(x) The optimal decision rule is given by sign(h)=sign(p(1|x)−1/2).
Proof: We should set sign(h)=1 if p(1|x)>1−p(1|x) which implies that p(1|x)>1/2. The result is that the risk is minimized. Thus, the optimal decision rule is given as stated.
As stated before, if ζ−1(y⋅h(x))>1/2 then we should take that as our classification. The trick now is to replace ι→ϕ with the most general class of loss functions such that the optimal classifier remains unchanged. So we'll consider, Rϕ(h)=∫X×Yϕ∘ζ−1(h)η(x)+ϕ∘ζ−1(−h)(1−η(x)) dμ(x) where we've introduced a twice differentiable loss function ϕ:R→R and collapsed our conditional probability to just the function η.
Definition: A loss function ϕ is consistent if the optimal classifier for Rϕ is equivalent to the optimal classifier for R0-1.
We'll also need a lemma to proceed,
Lemma: If the functional derivative ∇hR exists, then h∗ is locally optimal iff, ⟨∇hR,h∗−h⟩≥0, ∀h∈H Proof: This is the first order condition for optimality.
Now we can state the main result.
Theorem: A loss function ϕ:R→R is consistent if it is of the form, ϕ(η)=C(η)+(1−η)C′(η) where C(η)=C(1−η) is a twice differentiable strictly convex function and η=ζ−1(h(x)) respects h as a minimal map according to the constraint, ζ−1(h)+ζ−1(−h)=1 and for the optimal classifier h∗ satisfies, ζ−1∘h∗(x)=p(1|x) Proof:
To minimize the population risk we need to use the functional deriviative. ∇hR=limϵ→0R(h+ϵψ)−R(h)ϵ=[ddϵR(h+ϵψ)]ϵ=0=∫X×Y[∂hϕ∘ζ−1(h)η(x)+∂hϕ∘ζ−1(−h)(1−η(x)) ]ψ(x)dμ(x) Now, we want to fix a class of ϕ with the assumption that sign(h) will determine our classifier. Any particulars beyond this are irrelevant to producing a 0-1 optimal classifier. This simplifies the optimality condition to, sign(∂hϕ∘ζ−1(h) η+∂hϕ∘ζ−1(−h)(1−η))=sign(h)⟺sign(ϕ′(^η)∂hζ−1(h) η−ϕ′(1−^η)∂hζ−1(h)(1−η))=sign(h)⟺sign(ϕ′(^η) η−ϕ′(1−^η)(1−η))=sign(h) Note that we assume here that limh→∞ζ−1(h)=1 and is differentiable. If we have, ϕ′(^η)=f(^η)C′′(^η)s.t C′′(η)>0 then we can do this trick again. So our loss function is a function times a strictly convex function. Thus, sign(ϕ′(^η) η−ϕ′(1−^η)(1−η))=sign(h)⟺sign(f(^η) C′′(^η)η−f(1−^η)C′′(1−^η)(1−η))=sign(h)⟺sign(f(^η)η−f(1−^η)(1−η))=sign(h) Recall that ^η(x)=ζ−1(h(x)) so at ^η=η we must be at an optimum if we're to respect the minimal map property and treat ζ−1∘h as a probability. This is the invertible link property. To satisfy this property we need, f(^η)f(1−^η)=1−ηη⇒f(η)≅(1−η) or 1/η subsitution makes it clear that the solution we want is f(η)=(1−η). The other solution is degenerate. We're almost done, now we solve for ϕ with an integration by parts, ϕ(η)=∫(1−η)⋅C′′(η)=(1−η)C′(η)+∫C′(η)=C(η)+(1−η)C′(η) □
Examples:
Take C(η)=H(η) and ζ−1(v)=1log(2)ev1+ev then we have,
ϕ(v)=1log(2)log(1+ev)
which is the logistic loss.
More abstractly, for a neural representation we compute layers as, →hj+1=σ(Wj→hj) where σ is an activation for each layer →hj. Previously I showed that if a neural network is to be interpreted as a probabalistic decision process then we need σ=ζ∘p where ζ is invertible and p can be an associated probability.
When we arrive at the output layer, we need to apply a loss function. Given the above analysis it would be best to change the activation to ζ−1 and then apply a consistent loss ϕ from above. This ensures that what is going into ϕ is indeed a probability.