See section 5.3 "Reinforcement Learning" of https://arxiv.org/abs/2210.14891 for more RL scaling laws with number of model parameters on the x-axis (and also RL scaling laws with the amount of compute used for training on the x-axis and RL scaling laws with training dataset size on the x-axis).
re: youtube estimates
You'll probably find some of this twitter discussion useful:
https://twitter.com/HenriLemoine13/status/1572846452895875073
OP will find this paper useful:
https://arxiv.org/abs/2210.14891
I give a crisp definition from 6:27 to 7:50 of this video:
> Re: "Extrapolating GPT-N performance" and "Revisiting ‘Is AI Progress Impossible To Predict?’" sections of google doc
Read Section 5.6 titled "The Limit of the Predictability of Scaling Behavior" of "Broken Neural Scaling Laws" paper:
https://arxiv.org/abs/2210.14891v4
One other goal / theme of mechanistic interpretability research imo:
twitter.com/norabelrose/status/1588571609128108033
When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.
Play around with in this code to see what I mean:
https://github.com/ethancaballero/broken_neural_scaling_laws/blob/main/make_figure_1__decomposition_of_bnsl_into_power_law_segments.py#L25-L29
Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.
arxiv.org/abs/2210.14891