125

LESSWRONG
LW

124
GPTFictionFiction (Topic)AI
Frontpage

63

GPT-3 Fiction Samples

by gwern
25th Jun 2020
1 min read
15

63

This is a linkpost for https://www.gwern.net/GPT-3

63

GPT-3 Fiction Samples
13Kaj_Sotala
12Quintin Pope
3Daniel Kokotajlo
9gwern
2Charlie Steiner
2Pattern
1Quintin Pope
2[anonymous]
5Raemon
3Kaj_Sotala
1[anonymous]
6Kaj_Sotala
4fiddler
3gwern
1emanuele ascani
1[comment deleted]
1[comment deleted]
New Comment
15 comments, sorted by
top scoring
Click to highlight new comments since: Today at 12:02 PM
[-]Kaj_Sotala5y130

Okay, my intuitions for AI timelines just shifted to put considerably more probability on the short end.

Reply
[-]Quintin Pope5y*120

Same. Specifically, I went from predicting 50% chance of human-level AGI within 40 years to 50% chance within 10 years.

Andrew Mayne was also given access to the GPT-3 API. You can read his impressions here: https://andrewmayneblog.wordpress.com/

I found his results very impressive as well. For example, he's able to prompt GPT-3 to summarize a Wikipedia article on quantum computing at either a second grade or an eighth grade level, depending on the prompt.

I actually put together a presentation on GPT-like architectures and their uses for my advisor: https://docs.google.com/presentation/d/1kCJ2PJ_3UteHBX5TWZyrF5ontEdNx_B4vi6KTmQmPNo/edit?usp=sharing

It's not really meant to be a stand alone explanation, but it does list some of GPT-2/3's more impressive abilities. After compiling the presentation, I think we'll look back on GPT-3 as the "Wright brothers" moment for AGI.

Consider, this post suggests GPT-3 cost ~$4.6 million to train: https://lambdalabs.com/blog/demystifying-gpt-3. It would be well within Google/Microsoft/Amazon/DoD/etc's budget to increase model size by another 2 (possibly 3) orders of magnitude. Based on the jump in GPT-3's performance going from 13 B parameters to 175 B parameters, such a "GPT-4" would be absolutely stunning.

Reply
[-]Daniel Kokotajlo5y30

On the bright side, according to OpenAI's scaling laws paper, GPT-3 is about the size that scaling was predicted to start breaking down. So maybe GPT-4 won't actually be better than GPT-3. I'm not counting on it though.

Reply
[-]gwern5y90

It's possible that GPT-3 is roughly at where the maximally naive simple text LM begins to hit the constant wall, but I don't regard this as important; as I emphasize at every turn, there are many distinct ways in which to improve it greatly using purely known methods, never mind future research approaches. The question is not whether there is any way GPT-4 might fail, but any way in which it might succeed.

Reply
[-]Charlie Steiner5y20

There's a typo in your Andrew Mayne link, but thanks for linking it - that's wild!

Reply
[-]Pattern5y20

https://andrewmayneblog.wordpress.com/

Reply
[-]Quintin Pope5y10

Thanks, fixed.

Reply
[-][anonymous]5y*20
Reply
[-]Raemon5y50

My own take (not meant to be strong evidence of anything, mostly just kinda documenting my internal updating experience)

I had already updated towards fairly shortish (like, maybe 20% chance of AGI in 20 years?). I initially had a surge of AAAAUGH maybe the end times around now right around the corner with GPT-3, but I think that was mostly unwarranted. (At least, GPT-3 didn't seem like new information, it seemed roughly what I'd have expected GPT-3 to be like, and insofar as I'm updating shorter it seems like that means I just made a mistake last year when first evaluating GPT-2)

I'm also interested in more of Kaj's thoughts.

Reply
[-]Kaj_Sotala5y30

My largest update came from the bit where it figured out that it was expected to produce Harry Potter parodies in different styles. Previously GPT had felt cool, but basically a very advanced version of a Markov chain. But the HP thing felt like it would have required some kind of reasoning.

Reply
[-][anonymous]5y*10
Reply
[-]Kaj_Sotala5y60

I'm not sure how exactly reasoning should be defined and whether that part really requires reasoning or not. But if it's just very advanced and incredible recognition and mimicry abilities, it still shifts my impression of what can be achieved using just advanced and incredible recognition and mimicry abilities. I would previously have assumed that you need something like reasoning for it, but if you don't, then maybe the capacity for reasoning is slightly less important than I had thought.

Reply
[-]fiddler5y40

I’m really curious to see some of the raw output (not curated) to try and get an estimate on how many oysters you have to pick through to find the pearls. (I’m especially interested w.r.t. the essay-like things-the extension of the essay on assertions was by far the scariest and most impressive thing I’ve seen from GPT-3, because the majority of its examples were completely correct, and it held a thesis for the majority of the piece.)

On a similar note, I know there have been experiments using either a differently-trained GPT or other text-prediction models to try to score and collate GPT-3 output. I wonder if a. The best-of functionality could be used for something like this with some tweaks, and b. Whether there would be a way to imbed a simple reasoning framework into the best-of instead of scoring based on GPT-3, so the resultant pieces were scored on their logical sensibility instead of text quality, given that text quality seems to be universally acceptable. Encoding seems like the barrier here, but it might not be completely impossible, especially because raw->tagged data processors exist.

Reply
[-]gwern5y*30

I’m really curious to see some of the raw output (not curated)

You can read the random sample dump to get an idea of that, or Max Woolf's repo (both of which I link around the beginning). I'm not doing that for any of my prompts because right now the Playground is just way too much of a pain and errors out too regularly to make it feasible to generate, say, 100 1024 completions for a specific prompt. I would need to get set up with the Python library for the API, and I've been busy exploring prompts & writing them up rather than programming.

On a similar note, I know there have been experiments using either a differently-trained GPT or other text-prediction models to try to score and collate GPT-3 output. I wonder if a. The best-of functionality could be used for something like this with some tweaks

Yes, best-of rankers like Meena are basically just a ranker which happens to use the same model to estimate & score by total likelihood of the final sample completion. It works because the final sample may have a different total and better likelihood than the partial completions would indicate, and if you greedily maximized, you immediately fall into repetition traps, while quasi-random (but still local) samples of the tree appear to avoid those very high likelihood traps in favor of sensible but still high likelihood completions.

Preference learning would be nice, but at least for GPT-2 it didn't work too well for me. I don't know if you could finetune a sanity-checking GPT-3 by doing something like flipping texts to generate logical vs illogical completions.

Reply
[-]emanuele ascani5y10

I searched if there was a funny cat video called "what do you mean, Fetch?" and I found this. (Not that it was necessary for meaning though - sorry if this is noise).

Reply
[+][comment deleted]5y10
[+][comment deleted]5y10
Moderation Log
More from gwern
View more
Curated and popular this week
15Comments
GPTFictionFiction (Topic)AI
Frontpage
Deleted by Victória Molgado, 07/19/2020
Reason: Wrong text field!
Deleted by jerrystrazzeri, 07/28/2020
Reason: mistyping
Mentioned in
89Collection of GPT-3 results