The APPS repository also gives the fine-tuned weights for GPT-Neo-2.7 and code to run it. Though without a GPU it takes roughly forever.
I asked Dan Hendrycks for the performance of GPT-J-6B on APPS on the Eleuther AI discord. He didn't say they were definitely going to test it, but my take-away was that it might happen.
I could image a test driven automated programming evolving in the next ten to twenty years, were a LM-guided search tries to create functions according to a description that pass all the test cases.
The second idea reminds me of a talk years back about swarm behavior. Some fish swim faster in the sunlight, which makes the entire swarm "seek out" the shady parts of the pond.
There is a second mechanism at play here, where fish try to keep close to their neighbors, so the entire swarm kind of turns into the direction of shade as soon as the part of the swarm in the shade slows down.
This suggests an optimizer for parallel training which doesn't completely synchronize the weights on the different machines, but instead only tries to keep all sets of weights reasonably close to some of the other sets of weights.
The effect should be that the swarm of different weights turn into the direction of low noise.
Yes, definitely, thank you!
Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.
Ah, I guess I understand now. I was always thinking about an updating of the parameters. But you are talking about adding to the function output.
Ok, I thought your F(x) was one update step of the gradient of f times ΔΘ away from f. I guess then I just don't understand the equation.
No, when I say single update, I just mean that the final model can in principle be reached by a single update with the initial gradient. I'm aware that in practice you need more steps to compute the correct delta.
My argument is solely about the initial gradient. It does not point to the minimum SGD would reach, because the initial gradient tries harder to solve common problems, but the SGD-minimum (ideally) solves even rare problems. SGD manages to do this because common problems do not influence later gradients, because they will already be solved.
Maybe I am misunderstanding something, but I don’t think the parameter tangent hypothesis can be generally correct.
Let’s say we have 1 datapoint A to be mapped to -1 and 100 datapoint B to be mapped to +1. The model is randomly initialised. Now the parameter tangent space is just the current model + the gradient over the dataset * delta. The gradient over the entire dataset is the sum of the gradients for each datapoint. Therefore the gradient points a hundred times more towards the solution for the input B than towards the solution for input A. If we search solely in the tangent space, we will either solve B and get a miniscule improvement for A or solve A and massively overshoot the best parameters for B.
I.e. to reach the final parameters with a single update, the computed gradient has to be balanced between all possible skills the model is supposed to be exhibiting after training. Otherwise the gradient will not point towards a global-ish minimum but a frequency-of-problem-weighted minimum.
SGD solves this problem because the gradient for B shrinks, the more updates into the direction of B-solution have been made. So in our examples the gradient for B would shrink to zero during training and A would get its time in the sun.
Inutitively, the parameter tangent space would be correct for MNIST and other well balanced small datasets. But for large language models to pick a random example, it is not clear what „well balanced“ in the above sense even means.
Wouldn't the role of transposons be easy enough to investigate by incapacitating functional transposons with Crispr/CAS9? Has something like that been done in mice?
Maybe I am misunderstanding something, but to me it is very intuitive that there is a big jump from the embedding output to the first transformer block output. The embedding is backpropagated into so it makes sense to see all representations as representations of the prediction we are trying to make, i.e. of the next word.
But the embedding is a prediction of the next word based on only a single word, the word that is being embedded. So the prediction of the next word is by necessity very bad (the BPE ensures that, IIUC, because tokens that would always follow one another are merged).
The first transformer block integrates hundreds of words of context into the prediction, that’s where the big jump comes from.
Great post! I had trouble wrapping my head around the „inconsistency“ in the first paper, now I think I get it: TL;DR in my own words:
There are three regimes of increasing information uptake, ordered by how cheap they are in terms of compute:
- Increasing sampling efficiency by increasing model size —> this runs into diminishing returns because sample efficiency has a hard upper bound. —> context window increase?
- Accessing more information by training over more unique samples —> will run into diminishing returns when unique data runs out. —> multi-modal data?
- Extracting more information by running over the same samples several times —> this intuitively crashes sampling efficiency because you can only learn the information not already extracted in earlier passes. —> prime candidate for active learning?
I had also missed the implication of the figure in the second paper that shows that GPT-3 is already very close to optimal sampling efficiency. So it seems that pure text models will only see another order of magnitude increase in parameters or so.
If you are looking for inspiration for another post about this topic: Gwern mentions the human level of language modeling and Steve Omohundro also alludes to the loss that would signify human level, I don’t really understand neither the math nor where the numbers come from. It would be very interesting to me to see an explanation of the „human level loss“ to put the scaling laws in perspective. Of course I assume that a „human level“ LM would have very different strengths and weaknesses compared to a human, but still.