Compute the small change in data dx which would induce a small change in trained parameter values d\theta along each of the narrowest directions of the ridge in the loss landscape (i.e. eigenvectors of the Hessian with largest eigenvalue).
Can you unroll that?
"Small change in data" = one additional training sample is slightly modified? "Induce" = via an SGD update step on that additional training sample? Why is there a ridge in the loss landscape? What are "the narrowest directions"?
Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that's not very plausible and for different learning rate schedules it is maybe even less plausible.
Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws?
If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.
Some more questions:
Meanwhile, the resulting model would not be nearly as big as PaLM. The optimal compute law actually puts it at 63B params.
How come PaLM_opt is smaller than Chinchilla? Isn't Chinchilla supposed to be Gopher_opt?
Insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got.
These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down?
I think it would be a great follow-up post to explain why you think repeating data is not going to be the easy way out for the scaling enthusiasts at Deepmind and OpenAI.
I find the Figure 4 discussion at your first link quite confusing. They study repeated data i.e. disbalanced datasets to then draw conclusions about repeating data i.e. training for several epochs. The performance hit they observe seems to not be massive (when talking about scaling a couple of OOMs) and they keep the number of training tokens constant.
I really can't tell how this informs me about what would happen if somebody tried to scale compute 1000-fold and had to repeat data to do it compute-optimally, which seems to be the relevant question.
3. Stop worrying about finding “outer objectives” which are safe to maximize. I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function). Instead, focus on building good cognition within the agent. In my ontology, there's only an inner alignment problem: How do we grow good cognition inside of the trained agent?
3. Stop worrying about finding “outer objectives” which are safe to maximize. I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
This vibes well with what I've been thinking about recently.
There a post in the back of my mind called "Character alignment", which is about how framing alignment in terms of values, goals, reward etc is maybe not always ideal, because at least introspectively for me these seem to be strongly influenced by a more general structure of my cognition, i.e. my character.
Where character can be understood as a certain number of specific strategic priors, which might make good optimisation targets because they drop out of game theoretic considerations, and therefore are possibly quite generally and robustly modelled by sufficiently advanced agents.
I see many incorrect assumptions about what it takes to be a good conceptual researcher floating around [...] you can pick up the relevant part [of ML] and just work on approaches different to pure prosaic alignment
This seemed to imply that you might be a conceptual alignment researcher, but also work on pure prosaic alignment, which was the point were I thought: Ok, maybe I don't know what "conceptual alignment research" means. But the link definitely clears it up, thank you!
Could you maybe add a paragraph (or comment) how exactly you define "conceptual" alignment research? What would be an example of alignment research that is not conceptual?
Well, I don't, so I don't.
Seriously, there is absolutely nothing about strong commitment being bad in my comment, or is there?
I think there are many marriages where one side defects (be it cheating, alcohol, abuse, ...) and the other side has very good reason to get out and conversely there are many cases where it is advantageous for one side to hold on to the marriage long after things have gone south (for example for financial reasons).
That's why I think allowing one side to veto divorce is a bad idea. Making divorce harder is very different from making it impossible.
The marriage vow allows one partner to veto divorce, which seems like a bad idea.
In that case, did death part you?
Not more than sleep, I would say.