What we know about machine learning's replication crisis

[-]TLW4y40

Note that these algorithms were accepted for publication by claiming they made significant improvements, but it turns out most of them could be beaten by the untuned version of a competitor on the 5 five datasets.

Improving by 2 full standard deviations is a significant improvement; this would result in being beaten by the untuned version ~7.9% of the time^[1]^[2], just by random chance. Ditto, improving on task-specific performance by a full standard deviation would still result in ~24%^[1] of task-specific performances getting worse just by random chance.

Why is this considered benchmark hacking?

^{^}
Not analytically calculated; just done via a quick Monte-carlo on 1m samples. If anyone has the 'true' calculation I'm all ears.

>>> import random
>>> count = 10**6
>>> sum(random.normalvariate(mu=0, sigma=1) > random.normalvariate(mu=2, sigma=1) for _ in range(count)) / count
0.07863699999999996
>>> sum(random.normalvariate(mu=0, sigma=1) > random.normalvariate(mu=1, sigma=1) for _ in range(count)) / count
0.23995100000000003
^{^}
...assuming normally-distributed performances, which is a terrible assumption but meh.

[-]Younes Kamel4y20

You're right, I should have written "but it turns out most of them could be beaten by the untuned version of several competitors on the 5 five datasets", as one can see in the figures. Thank you for pointing it out, I'll edit the post.

[-]TLW4y30

I should have written "but it turns out most of them could be beaten by the untuned version of several competitors on the 5 five datasets",

You're inadvertently P-hacking I think. There are some that are beaten on specific tasks by the untuned version of several competitors, but these improve in other tasks.

Consider the following simple model:

Each algorithm has a normal distribution of scenario-specific performances with mean =<some parameter dependent on algorithm> and stddev=sqrt(2)/2
Each scenario has a normal distribution of run-specific performances with mean=<scenario-specific performance for this algorithm and task> and stddev=sqrt(2)/2

I have algorithm X, with mean=0. Overall run performance is roughly ^[1] a normal distribution with mean=0 and stddev=1.

I improve this into algorithm Y, with mean=1. Overall run performance is roughly ^[1] a normal distribution with mean=1 and stddev=1.

Algorithm X will have a higher mean scenario-specific performance than algorithm Y ~16%^[2] of the time!

^{^}
I don't know offhand if this is exact; it appears close at the very least.
^{^}
>>> import random
>>> count = 10**6
>>> sum(random.normalvariate(mu=0, sigma=2**0.5/2) > random.normalvariate(mu=1, sigma=2**0.5/2) for _ in range(count)) / count
0.158625

[-]Younes Kamel4y30

Perhaps the most important takeaway from our study is hidden in plain sight: the field is in danger of being drowned by noise. Different optimizers exhibit a surprisingly similar performance distribution compared to a single method that is re-tuned or simply re-run with different random seeds. It is thus questionable how much insight the development of new methods yields, at least if they are conceptually and functionally close to the existing population.

This is from the author's conclusion. They do also acknowledge that a couple optimizers seem to be better than others across tasks and datasets, and I agree with them (and with you if that's your point). But most optimizers do not meet the "significant improvement" claims their authors have been making. They also say most tuned algorithms can be equaled by trying seevral un-tuned algorithms. So the point is twofold :

1. Most new algorithms can be equaled or beaten by re-tuning of most old algorithms.
2. Their tuned versions can be equaled or beaten by many un-tuned versions of old algorithms.

This seems to be consistent with there being no overwhelming winner and low variance in algorithm performance.

If I understand your model correctly, and let me know if I do, if an algorithm Y improves performance by 1 std on a specific task, it woulds still get beaten by an unimproved algorithm 16% of the time. Sure, but you have to compute the probability of the Y algorithm (mean=1, std=1) being beaten by the X1, X2, X3, X4 algorithms (all mean=0, std=1) , which is what is happening in the authors' experiment, and it is much lower.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

36

What we know about machine learning's replication crisis

36

36

Peer review in machine learning

Benchmark hacking

Recommender systems replication attempt

Kapoor et al. replication attempt in applied ML

Is it possible to estimate the replication rate in ML ?

Final thoughts

Further reading and links :