The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations.
Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.
"However, to the best of our knowledge there are no quantitative scaling laws for robustness yet."
For scaling laws for adversarial robustness, see appendix A.15 of openreview.net/pdf?id=sckjveqlCZ#page=22
See section 5.3 "Reinforcement Learning" of https://arxiv.org/abs/2210.14891 for more RL scaling laws with number of model parameters on the x-axis (and also RL scaling laws with the amount of compute used for training on the x-axis and RL scaling laws with training dataset size on the x-axis).
re: youtube estimates
You'll probably find some of this twitter discussion useful:
I give a crisp definition from 6:27 to 7:50 of this video:
> Re: "Extrapolating GPT-N performance" and "Revisiting ‘Is AI Progress Impossible To Predict?’" sections of google doc
Read Section 6 titled "The Limit of the Predictability of Scaling Behavior" of "Broken Neural Scaling Laws" paper:
One other goal / theme of mechanistic interpretability research imo:
When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.
Play around with in this code to see what I mean:
Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.
Did ARC try making a scaling plot with training compute on the x-axis and autonomous replication on the y-axis?