Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for
New Comment
23 comments, sorted by Click to highlight new comments since:

A postmortem of my predictions about GPT-3 from 21 March 2019:

When it will appear? (My guess is 2020). True
Will it be created by OpenAI and will it be advertised? (My guess is that it will not be publicly known until 2021, but other companies may create open versions before it.) False
How much data will be used for its training and what type of data? (My guess is 400 GB of text plus illustrating pictures, but not audio and video.) True for text, false for pictures "The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens"
What it will be able to do? (My guess: translation, picture generation based on text, text generation based on pictures – with 70 per cent of human performance.) False for pictures
How many parameters will be in the model? (My guess is 100 billion to trillion.) True "175 billion parameters"
How much compute will be used for training? (No idea.) "training the GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model"

With #3, I think you fell into the trap of being overly-specific and overly-committed to a specific organizational strategy. It would be very reasonable to assume that OA would be working on multimodal, because you need that for efficiency & generalization & ability to do things like text instructions to control a robot arm, and indeed, I quote TR about how they are working hard on large multimodal self-supervised Transformers... but you assumed that would have to be the "GPT-3", instead of a parallel project while GPT-3 winds up being a scaled up GPT-2. It would have made more sense to split the predictions and try to be agnostic about whether OA would choose to do 2 big models or attempt 1 multimodal model, since it could be the case that the multimodal stuff would not mature in time (as seems to be the case), and predict instead more end outcomes like "human-level text article generation" or "models with >100b parameters", since there are many possible routes to relatively few outcomes of interest.


Awesome, thanks for following up on this.

[-]Rohin ShahΩ11190

Planned summary for the Alignment Newsletter:

The biggest <@GPT-2 model@>(@Better Language Models and Their Implications@) had 1.5 billion parameters, and since its release people have trained language models with up to 17 billion parameters. This paper reports GPT-3 results, where the largest model has _175 billion_ parameters, a 10x increase over the previous largest language model. To get the obvious out of the way, it sets a new state of the art (SOTA) on zero-shot language modeling (evaluated only on Penn Tree Bank, as other evaluation sets were accidentally a part of their training set).

The primary focus of the paper is on analyzing the _few-shot learning_ capabilities of GPT-3. In few-shot learning, after an initial training phase, at test time models are presented with a small number of examples of a new task, and then must execute that task for new inputs. Such problems are usually solved using _meta-learning_ or _finetuning_, e.g. at test time MAML takes a few gradient steps on the new examples to produce a model finetuned for the test task. In contrast, the key hypothesis with GPT-3 is that language is so diverse, that doing well on it already requires adaptation to the input, and so the learned language model will _already be a meta-learner_. This implies that they can simply ""prime"" the model with examples of a task they care about, and the model can _learn_ what task is supposed to be performed, and then perform that task well.

For example, consider the task of generating a sentence using a new-made up word whose meaning has been explained. In one notable example, the prompt for GPT-3 is:

_A ""whatpu"" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:_
_We were traveling in Africa and we saw these very cute whatpus._
_To do a ""farduddle"" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:_

Given this prompt, GPT-3 generates the following output:

_One day when I was playing tag with my little sister, she got really excited and she
started doing these crazy farduddles._

The paper tests on several downstream tasks for which benchmarks exist (e.g. question answering), and reports zero-shot, one-shot, and few-shot performance on all of them. On some tasks, the few-shot version sets a new SOTA, _despite not being finetuned using the benchmark’s training set_; on others, GPT-3 lags considerably behind finetuning approaches.

The paper also consistently shows that few-shot performance increases as the number of parameters increase, and the rate of increase is faster than the corresponding rate for zero-shot performance. While they don’t outright say it, we might take this as suggestive evidence that as models get larger, they are more incentivized to learn “general reasoning abilities”.

The most striking example of this is in arithmetic, where the smallest 6 models (up to 6.7 billion parameters) have poor performance (< 20% on 2-digit addition), then the next model (13 billion parameters) jumps to > 50% on 2-digit addition and subtraction, and the final model (175 billion parameters) achieves > 80% on 3-digit addition and subtraction and a perfect 100% on 2-digit addition (all in the few-shot regime). They explicitly look for their test problems in the training set, and find very few examples, suggesting that the model really is learning “how to do addition”; in addition, when it is incorrect, it tends to be mistakes like “forgetting to carry a 1”.

On broader impacts, the authors talk about potential misuse, fairness and bias concerns, and energy usage concerns, and say about what you’d expect. One interesting note: “To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed.” They find that while there was significant discussion of misuse, they found no successful deployments. They also consulted with professional threat analysts about the possibility of well-resourced actors misusing the model. According to the paper: “The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.”

Planned opinion:

For a long time, I’ve heard people quietly hypothesizing that with a sufficient diversity of tasks, regular gradient descent could lead to general reasoning abilities allowing for quick adaptation to new tasks. This is a powerful demonstration of this hypothesis.

One critique is that GPT-3 still takes far too long to “identify” a task -- why does it need 50 examples of addition in order to figure out that what it should do is addition? Why isn’t 1 sufficient? It’s not like there are a bunch of other conceptions of “addition” that need to be disambiguated. I’m not sure what’s going on mechanistically, but we can infer from the paper that as language models get larger, the number of examples needed to achieve a given level of performance goes down, so it seems like there is some “strength” of general reasoning ability that goes up. Still, it would be really interesting to figure out mechanistically how the model is “reasoning”.

This also provides some empirical evidence in support of the threat model underlying <@inner alignment concerns@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@): they are predicated on neural nets that implicitly learn to optimize. (To be clear, I think it provides empirical support for neural nets learning to “reason generally”, not neural nets learning to implicitly “perform search” in pursuit of a “mesa objective” -- see also <@Is the term mesa optimizer too narrow?@>.
A year ago, Joaquin Phoenix made headlines when he appeared on the red carpet at the Golden Globes wearing a tuxedeo with a paper bag over his head that read, "I am a shape-shifter. I can't change the world. I can only change myself."

-- GPT-3 generated news article humans found easiest to distinguish from the real deal.

... I haven't read the paper in detail but we may have done it; we may be on the verge of superhuman skill at absurdist comedy! That's not even completely a joke. Look at the sentence "I am a shape-shifter. I can't change the world. I can only change myself." It's successful (whether intended or not) wordplay. "I can't change the world. I can only change myself" is often used as a sort of moral truism (e.g. Man in the Mirror, Michael Jackson). In contrast, "I am a shape-shifter" is a literal claim about one's ability to change themselves.

The upshot is that GPT-3 can equivocate between the colloquial meaning of a phrase and the literal meaning of a phrase in a way that I think is clever. I haven't looked into whether the other GPTs did this (it makes sense that a statistical learner would pick up this kind of behavior) but dayum.

Google's Meena (2.6 billion parameters, February 2020) creates original puns in "Conversation D", and I think "Cross-turn Repetition Example 2" is absurdist comedy, but maybe more as a result of the way the model fails.

Conversation D
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.



Thanks! This is great.

"In addition, inspection of incorrect answers reveals that the model often makes mistakes
such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than
memorizing a table.
Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings." (emphasis mine)

Does this seem right? If so, is this impressive? It seems so to me; people often say "reasoning" is something current methods can't do, and this is updating me more towards thinking that's false.


I'm a bit confused about this as a piece of evidence-- naively, it seems to me like not carrying the 1 would be a mistake that you would make if you had memorized the pattern for single-digit arithmetic and were just repeating it across the number. I'm not sure if this counts as "memorizing a table" or not.

Excellent point! Well, they do get the answer right some of the time... it would be interesting to see how often they "remember" to carry the one vs. how often they "forget." It looks like the biggest model got basically 100% correct on 2-digit addition, so it seems that they mostly "remember."


But does it ever hallucinate the need to carry the one when it shouldn't?


As abergal wrote, not carrying the "1" can simply mean it does digit-wise addition (which seems trivial via memorization). But notice that just before that quote they also write:

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized.

That seems like evidence against memorization, but maybe their simple search failed to find most cases with some relevant training signal, eg: "In this diet you get 350 calories during breakfast: 200 calories from X and 150 calories from Y."

To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among
college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model.

This seems like a data point in favor of Yudkowsky's old argument about crossing the human range. I wonder what the standard deviation is for humans answering SAT questions like this; I would guess it is something like 10 percentage points (though probably with a non-normal distribution?) So in this case at least, it looks like all they had to do to get a human-standard-deviation of improvement was add another order of magnitude of compute.

On the other hand this still looks more like a service than part of a path towards general intelligence, even if it's a very broad, flexible, and fairly general service, but, for example, I don't expect GPT-3 to do things like come up with things to do, only to do the things it is asked to (although I'm sure there's some interesting wandering that can happen by applying GPT-3 recursively).


The obvious thing to do here is to plug it into a DRL agent. Something like learning from instructions or from game manuals: Nethack was recently packaged up, so imagine finetuning GPT-3 on the Nethack wiki and then providing text embeddings from GPT-3 to MuZero or Agent57 etc.

One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI.

As Schmidhuber put it: "one model to rule them all". Cross-modal learning ought to be much more efficient and give even more human-like reasoning eg GPT-3 is a text-only self-supervised world-model; being unimodal (so no visual transfer from SimCLR or other recent highly-successful image self-supervision) and not benefiting from any RL loops, it has a lot of weaknesses, but it's a start.

Between the continued scaling, how scaling/pretraining produces ever more human-like systems in terms of performance/adversarial-examples, cross-modal learning, transfer learning working in RL, self-supervised learning suddenly crushing it, the potential of brain imitation learning, the next decade is very exciting indeed (contra predictions that DL will surely top out any time - real soon now, just you wait and see). One can easily imagine a multi-headed architecture where a multimodal GPT-3-like module, trained by self-supervised learning on large text and image and video datasets (like VideoBERT), feeds into a trunk with modules for ALE, DMLab, Dactyl robot arm etc, doing per-task MuZero-style policy-learning+planning, collecting new experience which is fed back into the self-supervised model, enabling it to do zero-shot tasks in games or robotics or text generation from video or text inputs, learning extremely sample-efficiently (and the more so the more tasks it trains on)...

We are increasingly limited mostly by researchers' ability to actually write and tweak and integrate these darn things.


Doesn't seem too hard. Here's a DM example tweeted about today: (videos).

HN comment unsure about the meta-learning generalization claims that OpenAI has a "serious duty [...] to frame their results more carefully"

So, the paper title is "Language Models are Few-Shot Learners" and this commenter's suggested "more conservative interpretation" is "Lots of NLP Tasks are Learned in the Course of Language Modeling and can be Queried by Example." Now, I agree that version states the thesis more clearly, but it's pretty much saying the same thing. It's a claim about properties fundamental to language models, not about this specific model. I can't fully evaluate whether the authors have enough evidence to back that claim up but it's an interesting and plausible idea, and I don't think the framing is irresponsible if they really believe it's true.


I'd like to see them using the model to generate the problem framing which produces the highest score on a given task.

Even if it's just the natural language description of addition that comes before the addition task, it'd be interesting how it thinks addition should be explained. Does some latent space of sentences one could use for this fall out of the model for free?

More generally, a framing is a function turning data like [(2,5,7), (1,4,5), (1,2,_)] into text like "Add. 2+5=7, 1+4=5, 1+2=", and what we want is a latent space over framings.

More generally, I expect that getting the full power of the model requires algorithms that apply the model multiple times. For example, what happens if you run the grammar correction task multiple times on the same text? Will it fix errors it missed the first time on the second try? If so, the real definition of framing should allow multiple applications like this. It would look like a neural net whose neurons manipulate text data instead of number data. Since it doesn't use weights, we can't train it, and instead we have to use a latent space over possible nets.