The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations.

Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.

"However, to the best of our knowledge there are no quantitative scaling laws for robustness yet."

For scaling laws for adversarial robustness, see appendix A.15 of openreview.net/pdf?id=sckjveqlCZ#page=22

124d

Thanks, I'd missed that!
Curious if you have any high-level takeaways from that? Bigger models do better,
clearly, but e.g. how low do you think we'll be able to get the error rate in
the next 5-10 years given expected compute growth? Are there any follow-up
experiments you'd like to see happen in this space?
Also could you clarify whether the setting was for adversarial training or just
a vanilla model? "During training, adversarial examples for training are
constructed by PGD attacker of 30 iterations" makes me think it's adversarial
training but I could imagine this just being used for evals.

See section 5.3 "Reinforcement Learning" of https://arxiv.org/abs/2210.14891 for more RL scaling laws with number of model parameters on the x-axis (and also RL scaling laws with the amount of compute used for training on the x-axis and RL scaling laws with training dataset size on the x-axis).

re: youtube estimates

You'll probably find some of this twitter discussion useful:

https://twitter.com/HenriLemoine13/status/1572846452895875073

OP will find this paper useful:

https://arxiv.org/abs/2210.14891

I give a crisp definition from 6:27 to 7:50 of this video:

14mo

Ethan finds empirically that neural network scaling laws (performance vs size,
data, other things) are characterised by functions that look piecewise linear on
a log log plot, and postulates that a “sharp left turn” describes a transition
from a slower to a faster scaling regime. He also postulates that it might be
predictable in advance using his functional form for scaling.

-54mo

14mo

You drew a right turn, the post is asking about a left turn.

> Re: "Extrapolating GPT-N performance" and "Revisiting ‘Is AI Progress Impossible To Predict?’" sections of google doc

Read Section 6 titled "The Limit of the Predictability of Scaling Behavior" of "Broken Neural Scaling Laws" paper:

https://arxiv.org/abs/2210.14891

One other goal / theme of mechanistic interpretability research imo:

twitter.com/norabelrose/status/1588571609128108033

When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is **very** large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.

Play around with in this code to see what I mean:

https://github.com/ethancaballero/broken_neural_scaling_laws/blo...

95mo

Ethan posts an annotated image from openai's paper
https://arxiv.org/pdf/2001.08361.pdf [https://arxiv.org/pdf/2001.08361.pdf] ,
stating that it's "apparently wrong now" after the compute-efficient scaling
laws
[https://www.semanticscholar.org/paper/Training-Compute-Optimal-Large-Language-Models-Hoffmann-Borgeaud/011a4019aa0d0ce3edfa56bb2ca1e7586eb43fb2]
paper from deepmind:
https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png
[https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png]
- the screenshot claims that the crossover point between data and compute in the
original openai paper predicts agi.
Ethan, my impression is that you're mildly overfitting. I appreciate your
intellectual arrogance quite a bit; it's a great attitude to have as a
researcher, and more folks here should have attitudes like yours, IMO. But, I'd
expect that data causal isolation quality is going to throw a huge honkin wrench
into any expectations we form about how we can use strong models - note that
even humans who have low causal quality training data form weird and false
superstitions! I agree with the "test loss != capability" claim because the test
distribution is weird and made up and doesn't exist outside the original
dataset. IID is catastrophically false, and figuring that out is the key limiter
preventing robotics from matching pace with the rest of ML/AI right now, imo.
So, your scaling model might even be a solid representation space, but it's
misleading because of the correlation problem.

Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.

Did ARC try making a scaling plot with training compute on the x-axis and autonomous replication on the y-axis?