New Comment
9 comments, sorted by Click to highlight new comments since:

Careful formulations of Occam's Razor have always spoken of bits of parameters, rather than count of parameters, for basically this reason: a "parameter" could be a boolean yes/no, or a number, or a five-node digraph, or... pretty much anything you can write down. This paper demonstrates that point very well, but I don't think the point itself is new.

Naturally, that one parameter has to be very precise in order to work - if you have 1000 bits of data, the parameter will take at least 1000 bits to write down.

Pretty cool scheme for fitting general scatterplots. You could do the same in higher dimensions, but intuitively it seems like you are actually anti-compressing the data. Their point about not measuring complexity by parameter count is made.

I didn't read the paper, but you can already fit arbitrary data with a high frequency sine curve, sin(x/T) for T close to 0 Not sure if they gain anything extra from their complicated-looking function (it looks like their result is weaker).

ETA: I'm totally wrong (I like that you can downvote yourself).

They want all of the error to be in the vertical direction. So e.g. sin(x/T) cannot approximate (0,1) in their sense for any T. Similarly, it cannot simultaneously approximate (1,0) and (2,1).

Sorry, we need sin(ax + b).

Still doesn't work. If is an arithmetic progression then is a geometric progression, the imaginary part of which is . This implies that for example cannot simultaneously approximate , , and .

Somehow related papers in ML / DL:

  • Keeping NN Simple by Minimizing the Description Length of the Weights (Hinton, 1997);
  • Binarized Neural Networks (Courbariaux, 2016).

Thanks to Alex Tabarrok at Marginal Revolution:

Title: "One parameter is always enough"

Author: Steven T. Piantadosi, ( University of Rochester)


We construct an elementary equation with a single real valued parameter that is capable of fitting any “scatter plot” on any number of points to within a fixed precision. Specifically, given given a fixed  > 0, we may construct fθ so that for any collection of ordered pairs {(xj , yj )} n j=0 with n, xj ∈ N and yj ∈ (0, 1), there exists a θ ∈ [0, 1] giving |fθ(xj ) − yj | <  for all j simultaneously. To achieve this, we apply prior results about the logistic map, an iterated map in dynamical systems theory that can be solved exactly. The existence of an equation fθ with this property highlights that “parameter counting” fails as a measure of model complexity when the class of models under consideration is only slightly broad.

After highlighting the two examples in the paper, Tabarrok provocatively writes:

Aside from the wonderment at the result, the paper also tells us that Occam’s Razor is wrong. Overfitting is possible with just one parameter and so models with fewer parameters are not necessarily preferable even if they fit the data as well or better than models with more parameters.

Occam's Razor in its narrow form--the insight that simplicity is renders a claim more probable--is a consequence of the interaction between Kolmogorov complexity and Bayes' theorem. I don't see how this result affects this idea per se. But perhaps it shows the flaws of conceptualizing complexity as "number of parameters."