This is an overview for advanced readers. Main post: Information Loss --> Basin flatness
Inductive bias is related to, among other things:
- Basin flatness
- Which solution manifolds (manifolds of zero loss) are higher dimensional than others. This is closely related to "basin flatness", since each dimension of the manifold is a direction of zero curvature.
In relation to basin flatness and manifold dimension:
- It is useful to consider the "behavioral gradients" for each input.
- Let be the matrix of behavioral gradients. (The column of is ). We can show that .
- Flat basin Low-rank Hessian Low-rank High manifold dimension
- High manifold dimension Low-rank Linear dependence of behavioral gradients
- A case study in a very small neural network shows that "information loss" is a good qualitative interpretation of this linear dependence.
- Models that throw away enough information about the input in early layers are guaranteed to live on particularly high-dimensional manifolds. Precise bounds seem easily derivable and might be given in a future post.
See the main post for details.
In standard terminology, is the Jacobian of the concatenation of all outputs, w.r.t. the parameters.
is the number of parameters in the model. See claims 1 and 2 here for a proof sketch.
Proof sketch for :
- is the set of directions in which the output is not first-order sensitive to parameter change. Its dimensionality is .
- At a local minimum, first-order sensitivity of behavior translates to second-order sensitivity of loss.
- So is the null space of the Hessian.
There is an alternate proof going through the result . (The constant 2 depends on MSE loss.)