Careful formulations of Occam's Razor have always spoken of bits of parameters, rather than count of parameters, for basically this reason: a "parameter" could be a boolean yes/no, or a number, or a five-node digraph, or... pretty much anything you can write down. This paper demonstrates that point very well, but I don't think the point itself is new.

[-]fortyeridania8y40

Well put.

[-]Charlie Steiner8y70

Naturally, that one parameter has to be very precise in order to work - if you have 1000 bits of data, the parameter will take at least 1000 bits to write down.

Pretty cool scheme for fitting general scatterplots. You could do the same in higher dimensions, but intuitively it seems like you are actually anti-compressing the data. Their point about not measuring complexity by parameter count is made.

[-]paulfchristiano8y50

I didn't read the paper, but you can already fit arbitrary data with a high frequency sine curve, sin(x/T) for T close to 0 Not sure if they gain anything extra from their complicated-looking function (it looks like their result is weaker).

ETA: I'm totally wrong (I like that you can downvote yourself).

[-]Dacyn8y60

They want all of the error to be in the vertical direction. So e.g. sin(x/T) cannot approximate (0,1) in their sense for any T. Similarly, it cannot simultaneously approximate (1,0) and (2,1).

[-]paulfchristiano8y00

Sorry, we need sin(ax + b).

[-]Dacyn8y40

Still doesn't work. If $(x_{i})_{i = 1}^{3}$ is an arithmetic progression then $(exp (i (a x_{i} + b)))_{i = 1}^{3}$ is a geometric progression, the imaginary part of which is $(sin (a x_{i} + b))_{i = 1}^{3}$ . This implies that for example $sin (a x + b)$ cannot simultaneously approximate $(0, 0)$ , $(1, 0)$ , and $(2, 1)$ .

[-]yagudin8y40

Somehow related papers in ML / DL:

Keeping NN Simple by Minimizing the Description Length of the Weights (Hinton, 1997);
Binarized Neural Networks (Courbariaux, 2016).

[-]fortyeridania8y20

Thanks to Alex Tabarrok at Marginal Revolution: https://marginalrevolution.com/marginalrevolution/2018/05/one-parameter-equation-can-exactly-fit-scatter-plot.html

Title: "One parameter is always enough"

Author: Steven T. Piantadosi, ( University of Rochester)

Abstract:

We construct an elementary equation with a single real valued parameter that is capable of fitting any “scatter plot” on any number of points to within a fixed precision. Specifically, given given a fixed > 0, we may construct fθ so that for any collection of ordered pairs {(xj , yj )} n j=0 with n, xj ∈ N and yj ∈ (0, 1), there exists a θ ∈ [0, 1] giving |fθ(xj ) − yj | < for all j simultaneously. To achieve this, we apply prior results about the logistic map, an iterated map in dynamical systems theory that can be solved exactly. The existence of an equation fθ with this property highlights that “parameter counting” fails as a measure of model complexity when the class of models under consideration is only slightly broad.

After highlighting the two examples in the paper, Tabarrok provocatively writes:

Aside from the wonderment at the result, the paper also tells us that Occam’s Razor is wrong. Overfitting is possible with just one parameter and so models with fewer parameters are not necessarily preferable even if they fit the data as well or better than models with more parameters.

Occam's Razor in its narrow form--the insight that simplicity is renders a claim more probable--is a consequence of the interaction between Kolmogorov complexity and Bayes' theorem. I don't see how this result affects this idea per se. But perhaps it shows the flaws of conceptualizing complexity as "number of parameters."

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

4

Paper: "A simple function of one parameter (θ) can fit any collection of ordered pairs {Xi,Yi} to arbitrary precision"--implications for Occam?

4

4