The most relevant paper I know of comes out of data privacy concerns. See Extracting Training Data from Large Language Models, which defines "k-eidetic memorization" as a string that can be elicited by some prompt and appears in at most k documents in the training set. They find several examples of k=1 memorization, though the strings appear repeatedly in the source documents. Unfortunately their methodology is targeted towards high-entropy strings and so is not universal.
I have a related question I've been trying to operationalize. How well do GPT-3's memories "generalize"? In other words, given some fact in the training data, how far out of the source distribution can GPT-3 "gain information" from that fact?
E.g. training: "Ixlthubs live in the water." Test: does this affect the predicted likelihood of "Ixlthubs live in the Pacific"? What about "Ixlthubs cannot survive on land"? I'd consider this another interesting measure of sample efficiency/generalization performance. I'm attempting to put together a proposal for the BigScience project (some set of synthetic facts to sprinkle throughout the data), but it's my first try at something like this and slow going.
See also "Evaluating Large Language Models Trained on Code", OpenAI's contribution. They show progress on the APPS dataset (Intro: 25% pass, Comp: 3% pass @ 1000 samples), though note there was substantial overlap with the training set. They also only benchmark up to 12 billion params, but have also trained a related code-optimized model at GPT-3 scale (~100 billion).
Notice that technical details are having a large impact here:
There is no state saving or learning at test time. The prompts were prepended to the API calls, you could see it in the requests
I think the appeal of symbolic and hybrid approaches is clear, and progress in this direction would absolutely transform ML capabilities. However, I believe the approach remains immature in a way that the phrase "Human-Level Reinforcement Learning" doesn't communicate.
The paper uses classical symbolic methods and so faces that classic enemy of GOFAI: super-exponential asymptotics. In order to make the compute more manageable, the following are hard-coded into EMPA:
Additionally, the entire algorithm is tuned to their own custom dataset,. None of this would be feasible for Atari games, or indeed the GVGAI competition, whose video game descriptive language they use to write their own environments. There's a reason they don't evaluate on any of the many existing benchmarks.
I come across a paper like this every once in a while: "The Revenge of GOFAI". Dileep George et al's Recursive cortical networks. Deepmind's Apperception engine. Tenenbaum's own Omniglot solver. They have splashy titles and exciting abstracts, but look into the methods section and you'll find a thousand bespoke and clever shortcuts, feature engineering for the modern age. It's another form of overfitting, it doesn't generalize. The super-exponential wall remains as sheer ever and these approaches simply cannot scale.
I'll reiterate that any progress in these areas would mean substantially more powerful, more explainable models. I applaud these researchers for their work on a hard and important problem. However, I can't consider these papers to represent progress. Instead, I find them aspirational, like the human mind itself: that our methods might someday truly be this capable, without the tricks. I'm left hoping and waiting for insight of a qualitatively different sort.
I think this is a very interesting discussion, and I enjoyed your exposition. However, the piece fails to engage with the technical details or existing literature, to its detriment.
Take your first example, "Tricking GPT-3". GPT is not: give someone a piece of paper and ask them to finish it. GPT is: You sit behind one way glass watching a man at a typewriter. After every key he presses you are given a chance to press a key on an identical typewriter of your own. If typewriter-man's next press does not match your prediction, you get an electric shock. You always predict every keystroke, even before he starts typing.
In this situation, would a human really do better? They might well begin a "proper continuation" after rule 3 only to receive a nasty shock when the typist continues "4. ". Surely by rule 11 a rule 12 is ones best guess? And recall that GPT in its auto-regressive generation mode experiences text in exactly the same way as when simply predicting; there is no difference in its operation, only in how we interpret that operation. So after 12 should come 13, 14... There are several other issues with the prompt, but this is the most egregious.
As for Winograd, the problem of surface associations mimicking deeper understanding is well known. All testing today is done on WinoGrande which is strongly debiased and even adversarially mined (see in particular page 4 figure 1). GPT-3 0-shot scores (70%) well below the human level (94%) but also well above chance (50%). For comparison, BERT (340 million param) 0-shot scores 50.2%.
There are also cases, like multiplication, where GPT-3 unequivocally extracts a deeper "world model", demonstrating that it is at least possible to do so as a language model.
Of course, all of this is likely to be moot! Since GPT-3's release, a primary focus of research has been multimodality, which provides just the sort of grounding you desire. It's very difficult to argue that CLIP, for instance, doesn't know what an avocado looks like, or that these multimodal agents from Deepmind aren't grounded as they follow natural language instructions (video, top text is received instruction).
In all, I find the grounding literature interesting but I remain unconvinced it puts any limits on the capabilities even of the simplest unimodal, unagentic models (unlike, say, the causality literature).