Biological [perfectly editing our own DNA], Financial [mass switch to blockchain], and Computational [AI] singularities may occur together. My current prior on <1 year to singularity is at 70% [July 2020].
The data for GPT-2 has been replicated by the open source OpenWebText project. To my knowledge the same dataset was utilised for GPT-3, so accessing it is not a problem.
The parallelizability of GPT-3 is something I've been looking into. The current implementation of zero-2 seems like the best way to memory-optimally train a 170B parameter transformer model.
Artificial eXpert Intelligence and Artificial Super Intelligence. Sorry for not being clear- edited title to be more obvious.
What I'm going for here is that the few shot examples show GPT-3 is a sort of DWIM AI (Do What I Mean): something it has implicitly learned from examples of humans responding to other human's requests. It is able to understand simple requests (like unscramble, use word in sentence, etc.) with about as much data as a human would need: understanding the underlying motive of the request and attempting to fulfill it.
On firefighters, the instruction would work even with children who had never seen a fire, but only heard its attributes verbally described ("hot bright thing that causes skin damage when touched; usually diminished by covering with water or blocking air").
On cancer, have a look at the CRISPR completion - what do you think GPT-4 would say? Is it really that far out to believe that in an endeavour to predict the next word in thousands of biology research papers, GPT-4 will gain an implicit understanding of biology? In a literal sense, GPT would've "read" more papers than any human possibly could, and might be better placed to probabilistically rank all genes that might be involved in a cancer cure, than the best human researcher (who is also relying on a less probabilistic grasp of the same papers).
The big jump in performance between the zero shot and few shot setting in arithmetic and other non-linguistic reasoning tasks[esp. 3D- & 3D+] is why I think it is almost certain #2 is true. Few shot inference relies on no further training [unlike fine tuning], so the improvement in 'pattern recognition' so to speak is happening entirely at inference. It follows that the underlying model has general reasoning abilities, i.e. the ability to detect and repeat arbitrary patterns of ever increasing complexity, that occur in its input (conditioning) data.
Interestingly, the model fails to completely learn 4D and 5D arithmetic, where its zero-shot scores were really low. However few shot inference does show improvement. I wonder if problems of increasing complexity can also be solved using increasing numbers of examples in few shot (say k=500 for 4D+). Though of course this will run into the roadblock of context size very soon.
If increasing number of few-shot examples allows it to correctly solve ever-harder problems, there is a strong case for scaling the reformer, with a context window of 1 million tokens, to a GPT-3 like size.
It would be fascinating to probe how much of the general reasoning capabilities arise from the size of transformer itself, and how much they arise from training on a large volume of language data. Does language training implicitly impart it with the tools for all human symbolic reasoning?
A test anybody with 1024 GPUs for even a few minutes can perform, is to load an untrained GPT-3 size model, train it for a few steps on a few hundred 3D, 4D, and 5D calculations, and then test its inference. It will help show if these skills can be learnt absent a basis in language. It parallels a question in humans - can a human learn math without first learning language?
A success would indicate the existence of general reasoning as an innate attribute of large transformers themselves; failure would not however falsify general reasoning: it would imply that any general reasoning originates in language learning - which could justify why pre-trained models can perform arithmetic but untrained models can't.
[Note: my use of "trained" and "untrained" refers to pre-training on CommonCrawl.]
GPT-3 made me update my prior for "scaling current techniques will get us to Superintelligence", from probably not (<30%) to likely (>60%). The phase shifts in many tasks mentioned by dxu, and its ability to perform non-lingustic reasoning at inference, are the reasons for this shift. I tried a number of ways to make gpt-2 perform basic arithmetic but always failed, which was responsible for my earlier prior.
My updated models predict that a model between 1-2 orders of magnitude bigger will almost certainly be able to utilise calculus, trigonometry, and derivations in a human-like way to reach conclusions, given a few examples.
Essentially, I see no evidence against the proposition that language, math, and abstract reasoning are points along the same continuum - and this paper provides strong evidence that these abilities lie on the same continuum, the difference is only one of complexity.
I feel the few shot arithmetic results were some of the most revolutionary results we've received this decade. Learning math from nothing but verbal examples shows that we have an agent that can reason like us. The progression to complex algebra seems inevitable, with nothing but more parameters.
Assumption #2 is entirely correct, which is why few shot examples matter. The system is literally smart enough to figure out what addition is with those 50 examples. I bet most of us took longer than 50 examples.
" But #2 is wild: it would represent a kind of non-linguistic general intelligence ability which would be remarkable to find in a language model."
Yes, it does represent exactly that. The few-shot results on wordscramble further bear this out. I would go so far as to say gpt-3 is the first ever proto-AGI. For all the things people say it can't do, it's on track to becoming superintelligent in everything that matters: reasoning , math, and pattern recognition.
Gpt-2 was a toddler with decent english. GPT-3 just reached primary school.