All of Ethan Caballero's Comments + Replies

Did ARC try making a scaling plot with training compute on the x-axis and autonomous replication on the y-axis?

The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations.

Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.

"However, to the best of our knowledge there are no quantitative scaling laws for robustness yet."


For scaling laws for adversarial robustness, see appendix A.15 of openreview.net/pdf?id=sckjveqlCZ#page=22

1AdamGleave24d
Thanks, I'd missed that! Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space? Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.

See section 5.3 "Reinforcement Learning" of https://arxiv.org/abs/2210.14891 for more RL scaling laws with number of model parameters on the x-axis (and also RL scaling laws with the amount of compute used for training on the x-axis and RL scaling laws with training dataset size on the x-axis).
 

re: youtube estimates

You'll probably find some of this twitter discussion useful:
https://twitter.com/HenriLemoine13/status/1572846452895875073

I give a crisp definition from 6:27 to 7:50 of this video: 

1David Johnston4mo
Ethan finds empirically that neural network scaling laws (performance vs size, data, other things) are characterised by functions that look piecewise linear on a log log plot, and postulates that a “sharp left turn” describes a transition from a slower to a faster scaling regime. He also postulates that it might be predictable in advance using his functional form for scaling.
-5weverka4mo
1weverka4mo
You drew a right turn, the post is asking about a left turn.

> Re: "Extrapolating GPT-N performance" and "Revisiting ‘Is AI Progress Impossible To Predict?’" sections of google doc


Read Section 6 titled "The Limit of the Predictability of Scaling Behavior" of "Broken Neural Scaling Laws" paper: 
https://arxiv.org/abs/2210.14891

When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.

Play around with  in this code to see what I mean: 
https://github.com/ethancaballero/broken_neural_scaling_laws/blo... (read more)

9the gears to ascension5mo
Ethan posts an annotated image from openai's paper https://arxiv.org/pdf/2001.08361.pdf [https://arxiv.org/pdf/2001.08361.pdf] , stating that it's "apparently wrong now" after the compute-efficient scaling laws [https://www.semanticscholar.org/paper/Training-Compute-Optimal-Large-Language-Models-Hoffmann-Borgeaud/011a4019aa0d0ce3edfa56bb2ca1e7586eb43fb2] paper from deepmind: https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png [https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png] - the screenshot claims that the crossover point between data and compute in the original openai paper predicts agi. Ethan, my impression is that you're mildly overfitting. I appreciate your intellectual arrogance quite a bit; it's a great attitude to have as a researcher, and more folks here should have attitudes like yours, IMO. But, I'd expect that data causal isolation quality is going to throw a huge honkin wrench into any expectations we form about how we can use strong models - note that even humans who have low causal quality training data form weird and false superstitions! I agree with the "test loss != capability" claim because the test distribution is weird and made up and doesn't exist outside the original dataset. IID is catastrophically false, and figuring that out is the key limiter preventing robotics from matching pace with the rest of ML/AI right now, imo. So, your scaling model might even be a solid representation space, but it's misleading because of the correlation problem.

Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.