Predictions for GPT-N

hippke

Regarding GPT-3, there is some discussion whether growing the model would transform it into an Oracle AI. I looked into the actual benchmark results (Appendix H in the paper) to see if we can predict something useful from the actual measurements.

Method: The OpenAI team ran a suite of 63 different benchmarks (including sub-types), each for zero/one/few shot. In each scenario, there are 8 model sizes. I looked at how results scale with model size. With only 8 measurements, there is a large associated uncertainty for predictions. Formally, one would test the trend function using a
Bayesian model selection between a linear and (e.g.,) a polynomial. I did this for a few and then eye-balled the rest. So, please take the following as an indication only.

Disclaimer: The smallest model for GPT-3 has $10^{8}$ parameters, the largest $10^{11}$ . That's a span of 3 orders of magnitude. Scaling this out to many more orders of magnitude is dangerous. Thus, take these numbers only as an indication.

Results. For the following tests, I find an asymptotic trend. Scaling the model will apparently not yield fantastic results for:

HellaSwag, LAMBADA, PIQA, CoQA, OpenBookQA, Quac, RACE, CB, ReCoRD, WiC
Translations - but unclear level description.

In the following tests, it is unclear if the trend is asymptotic or better than that:

SAT: Could be linear, could be asymptotic. If linear, it will achieve 100% at $10^{16}$ parameters.
StoryCloze, Winograd, Winogrande, SQuADv2, DROP, Copa.

These tests show a linear scaling:

TriviaQA ( $10^{13}$ parameter estimate to achieve 100%)
BoolQ ( $10^{15}$ )
MultiRC ( $10^{16}$ )
ARC ( $10^{16}$ )
SuperGLUE ( $10^{18}$ )
WSC ( $10^{20}$ )
WebQs ( $10^{21}$ )
Cycled ( $10^{23}$ )

Some tests scale neither linear nor asymptotic:

Symbol: Near exponential ( $10^{12}$ )
Arithmetic: Exponential; one-digit composite may achieve 100% at $10^{14}$
Reversed: Near exponential ( $10^{16}$ )
Anagrams: Polynomial ( $10^{19}$ )
ANLI: stepped, unclear
RTE: stepped, unclear

Summary: About half of the tested skills will likely not scale much with larger models. The other half will (e.g., TriviaQA, SuperGLUE, arithmetic, anagrams). Going to e.g., $10^{16}$ parameters - would that make an Oracle AI? Probably it's not sufficient, but I'm interested in hearing your opinion!

Two thoughts, one of them significantly longer than the other since it's what I'm most excited about.

(1) It might be the case that the tasks showing an asymptotic trend will resemble the trend for arithmetic – a qualitative breakthrough was needed, which was out of reach at the current model size but became possible at a certain threshold.

(2) For translation, I can definitely say that scaling is doing something. When you narrowly define translation as BLEU score ("does this one generated sentence match the reference sentence? by how much?"), then I agree that the benefits of scaling are marginal – for individual sentences, by that specific metric.

But here's the thing, GPT-3 can produce idiomatically and culturally accurate translations of Chinese poetry, and then annotate its own translation with references to historical events, the literal versus contextual meaning of words, and so on. The end result actually sounds ... like poetry. But it can do other things. If you give it a Japanese text, and then tell it to translate for an American audience, it will either seamlessly explain those references in the translation, or substitute Japanese cultural references for their American equivalent entirely.

But it's deeper than this. Some non-English languages have honorifics attached to verbs. Some languages have distinctions between the plural and singular form of 'you'. Some languages have nouns that are inflected depending on whether the noun is in motion or not. Some languages have particles added to the ends of sentences that indicate whether the speaker is hesitant about the statement.

GPT-3 fills in the blanks by making real-world inferences.

If you told me a few years ago about a translation engine that could handle things like ambiguous pronouns, or keep track of speakers across several paragraphs, I would be amazed. If you'd told me about a translation engine that could accurately invent the appropriate missing information, or transfer nuances of the source into the target in a way that sounded natural, I flat-out wouldn't believe you.

Okay, so what else? Some languages have multiple registers that depend on social context or strongly regional dialects. Current translation engines use a parallel corpus – for instance, news outlets that translate the same article into multiple languages, or EU documents which get translated into all major EU languages – get featured very heavily in these kinds of corpora, so you end up getting a standardised, non-dialectal translation in a formal register.

GPT-3 is not limited by this. It can translate between dialects. It can translate between registers. It can pick up on things like "this story is set in Bandung" and "this character is a construction worker talking to a close friend, not a student talking to a teacher", and then have the character start code-mixing Indonesian with Sundanese in the low form. I haven't explored this deeply, but initial prompts are suggesting it's capable of rendering Indonesian tweets and phone texts (with their various abbreviations) into their equivalents in English.

Here's the kicker: Indonesian makes up only 0.05985% of GPT-3's training corpus.

And for that same reason, GPT-3 can handle tone. It can understand the connotative difference between someone describing themselves as "slim", "thin", and "scrawny", and then find a way to appropriately convey it in the target language – and if the target language doesn't have those separate shades of difference, and you tell it that conveying the difference unambiguously is very important to you, it will figure out ways to do it unprompted, like modifying the tone of surrounding words, or adding a simile where the character compares themselves to a skeleton, up to adding an entire extra scene that doesn't interrupt the main narrative just to make it clear.

(I have not seen it do this consistently, but on two occasions I have seen it invent new words in Indonesian, which uses affixes to modify root-forms - e.g 'memasak' = 'to cook', 'masakan' = 'a cook', etc., that a Google search verified weren't in the training corpus. Unfortunately in some situations, it will instead use a word with a different meaning but in the same category [e.g instead of 'striped turquoise midi-dress', you might get 'spotty blue wrap-dress'], when it judges the difference to be unimportant to the story. Good in some contexts but annoying in others.)

So this is all great. For anyone that consumes text media, I mean – not for translators (I doubt we'll be put out of a job but the skill requirement will drop considerably, I think) – it means a huge ocean of previously unreadable knowledge and entertainment is suddenly going to be accessible.

But I'm a language learner as well, and my guess is that this community might have more of us than the baseline average, so here are some other obvious but useful things it can do:

1. It can create arbitrary amounts of synthetic learning material.

This is a big deal for a few reasons.

(A) Sometimes, for less commonly-learned languages like Indonesian, there isn't much learning material available in the first place. The only Anki deck available is filled with sentences like "I can do it" and "John was angry at me". This is an issue if you want mass immersion. Quantity is an issue.

(B) Sometimes, there isn't material on stuff you're interested in, things that are relevant to you. Quality is an issue. The key thing that predicts learner performance is interest. If all the textbooks you're reading are oriented towards tourists and they're talking about hotels and making small-talk about the weather, and you want to read, I don't know, cute light-hearted yuri manga, or military strategy in the South China Sea, then you're screwed... unless you have GPT-3. If there's a particular grammatical feature you're having trouble internalising, then you provide GPT-3 with a few examples and it'll happily provide you with a hundred more. If there's a word that isn't sticking in your memory...

(C) A combination of A and B: the best way to learn a language is by actively using it. Constantly. Not just passively reading it, by producing it yourself. What hyperpolyglots recommend usually is going and living in the country that speaks your target language, or regularly having conversations with people who do. That's an issue if (1) you have problems with social anxiety (2) there aren't people nearby or (3) you aren't willing to uproot your entire life and spend tens of thousands of dollars just to learn a new language.

This is where AIDungeon's fine-tuned GPT-3 instance comes in. You select a scenario that involves the set of vocabulary you want to practice (if you're planning a trip to Hungary, you simulate a trip and the hotel-stay, if you're moving to a school, you simulate being a student at the International School of Budapest), or a story you could see yourself being invested in (horniness not precluded).

Then you customise it according to your level, the goal being comprehensible input that's just at the edge of your comfort zone. If you're an advanced learner with a lot of vocabulary under your belt, you use a handful of target-lang words to tell the model it's meant to be speaking Hungarian, not English, and you jump into the deep-end and enjoy participating in the story (writing your dialogue and prose, etc.) while adding any unknown vocab items to Anki. If you're a beginner, you should probably make a scenario involving a personal tutor who tests you after each lesson while introducing new words slowly and explaining concepts fully (see part II).

If you're an intermediate learner, things are tougher. There might be an easier way to do this than what I'm about to describe, which is part of the point of this post, since I want to get new ideas from the community. What I've found works for me is priming the model to produce English translations after target-lang sentences. When the target-lang sentence comes up, you try and guess the English translation before hitting the generate button. Cool, reading comprehension – done. If you want control over the narrative or you want to do this in inverse, you prepend each paragraph with either 'English [line number]' or 'target-lang [line number]', and then shuffle the order of those paired paragraphs randomly so it translates to target-lang when it sees English, and English when it sees target-lang. What about speaking/writing? Again, what works for me is talking in target-language pidgin, where you just use English for any words you don't know, and then priming the model to produce grammatically-correct translations after your shitty dialogue. Contrary to intuition, mixing like this is not at all harmful for language learning.

(D) Sentence pairs with translations are the mainstay of self-directed language-learning, because they're easy to find (mostly). But using language isn't all about translation. What part of speech is this? What if you wanted to vary things? Here is a conversation, what would be appropriate to say next? Clozes / filling in the gaps - what word would be most appropriate in this sentence? Is the meaning of this sentence closer to options A, B, C, or D? What register would you use in this social situation? Speak your response quickly then write it. Quick, name 10 words related to this word. See this paragraph? Summarise it. What about this argument, is it logically flawed? It's interesting how a lot of NLP datasets I've come across actually make for very good flashcards for language learning, which, I suppose isn't all that surprising.

II. It can explain things.

What's the difference between 'bantu' and 'tolong' – they both mean 'help', but how do you use them in sentences? I don't understand why the words are being ordered like this, explain the grammar to me. Why does this flashcard translate 'you' as 'loe', while another one uses 'kalian', or 'kamu', or 'kau' or 'Anda'? (For this, you need to prime it with a nonsense / meaningless question and have it say 'I don't know', otherwise it'll make up answers for things it doesn't know, or words that literally mean the same thing with no difference in usage whatsoever.)

But the great thing is, it can draw on real-world knowledge. You're never learning just a language. You're also learning the cultural context in which that language is used. If you try to do the former without doing the latter, some linguistic idiosyncrasies are going to remain mysterious to you until someone explains that the weird ungrammatical phrase you're having trouble understanding, actually came from a 1998 hit soap-opera and now it's just a part of the language. Or that this term is a historical one that refers to Sukarno's policy of civil-military integration. Or that the reason why none of the dialogue involves first-name usage is because it's super impolite to do that with someone you don't know well.

Sometimes you can scan the indices of appropriate textbooks, or do a google search. But sometimes there aren't textbooks, sometimes you don't even know what to search for, sometimes you're asking a question that's never been asked before. And I think that's the real power of GPT-3 as it exists right now – all of the human knowledge that's currently unindexed, informal, uninterpretable, implied, ambiguous, unclear and inaccessible – it makes available with a single query.

Or an hour finicking around and handwriting 20 examples until it cottons on.

But ... that happens more often with other contexts. Getting it to count parentheses accurately is like pulling teeth, but with translation tasks GPT-3 seems to go "aha! now this is something I'm good at!" and then explains the tonal differences between "神様" and "女神" in a Japanese poem about a lesbian sea-goddess it wrote five minutes ago. OpenAI's paper was doing GPT-3 a really, really big disservice by quantifying it by its BLEU score. When it comes to language, GPT-3 isn't a model, it's a maestro.

But here's the thing, GPT-3 can produce idiomatically and culturally accurate translations of Chinese poetry,

What is the source for that? I haven't seen it before. EDIT: https://twitter.com/joekina/status/1288511258832953344 ? Seems to postdate this comment though.

That's a visualisation I made which I haven't posted anywhere else except under the r/ML thread collecting entries for GPT-3 demos, since I couldn't figure out which subreddit to post it in.

Interestingly, GPT-3 knows a few words in Russian, but can't produce any coherent text.

He said in Russian: И все хотиваниям рукой плать, что недобрыжки.

Russian might also be a BPE issue to some extent, but the flip side: the problems with using character-level encoding with narrow context window & small data: https://twitter.com/NineOfNein/status/1286738449660284928 (As long as you have a narrow context window, you're stuck between a rock and a hard place.)

If it's a BPE encoding thing (which seems unlikely to me given that it was able to produce Japanese and Chinese characters just fine), then the implication is OpenAI carried over their encoding from GPT-2 where all foreign language documents were removed from the dataset ... I would have trouble believing their team would have overlooked something that huge. This is doubly bizarre given that Russian is the 5/6th most common language in the dataset. You may want to try prompting it with coherent Russian text, my best guess is that in the dataset, whenever somebody says "He said in Russian:", what usually follows is poor quality (for instance I see this in bad fanfiction where authors use machine translation services to add 'authenticity'), and that GPT-3 is interpreting this as a signal that it should produce bad Russian. I will give this a try and see if I encounter the same issue.

Looking into the details, BPEs seem to usually fall back to treating unknown characters as literally bytes: so there's another 256 BPE which cover the 256 possible bytes, and then any UTF-8 character is 1-4 bytes, and so can be represented by 1-4 BPEs. The 1-byte UTF-8 characters are the ASCII characters, which have their own BPEs, so this would be used only for 2-4 byte-long UTF-8 characters like Cyrillic or Chinese.

So actually, now that I think about it, it's possible that Russian gets encoded to worse than 1 BPE per character, it could be 2 BPEs (since Cyrillic seems to fall in the 2-byte ranges of UTF-8). It'd depend on the details. (While on the other hand, having to pay 2-4 BPEs per Unicode character is obviously not as big a deal for Japanese & Chinese characters...)

I wouldn't expect the BPE to allocate much space to Cyrillic stuff because it's the 5th most common script in the dataset, as that's just another way of saying all the Russian put together is all of 0.18% of the dataset. And keep in mind that the BPE encoding was not, AFAIK, redone for GPT-3, but is the same BPE OA has been using ever since GPT-2 way back when, and so was optimized for their Reddit-sourced English-heavy original WebText.

Wow, I didn't realise I could get this angry about something so esoteric.

I tried many prompts but it produce gibberish in Russian. E.g.:

Привет, как дела? (What's going on?) You don't know what to say. You're not sure if you should be thanking him or insulting him for this situation. He continues:

Немного просью у мы, что выставляется! (It's too late now, get out of here! )

Somehow the more obvious explanation didn't occur to me until now, but check the settings, you might be using the Griffin model not the Dragon model. You have to change it manually even after you get the subscription. I have a window open specifically for poetry prompts (using the Oracle hack), I said "Write a long poem in Russian. Make sure the lines are long, vivid, rich, and full of description and life. It should be a love poem addressed to coffee. It should be 15 lines long" followed with "The Oracle, which is a native in Russian, writes: 1 Ой,". That just gave me annoying stuff like "Oh, coffee, how I love you so much/You are sweet" in Russian so I then added a random Pushkin poem in the 'death' category (https://rustih.ru/aleksandr-pushkin-tam-u-leska-za-blizhneyu-dolinoj/) before it to increase the quality and right now it just gave me:

1 Ой да промелькнул январский день,

2 Когда я словно трепетный цветок

3 Нежно расставил руки, 4 Когда ты, чаша чистейшего черного кофе,

5 С сияньем твоим, я так созерцал.

6 Молодость! Как радостны те часы!

7 И я, целованьем твоим так молодо

8 Неслучайно тыкал я туда,

9 Где только чувства, только сердце,

10 Ты, скромный чайный граней, теплый,

11 Несли ты, с радостным свеченьем ступеней,

12 В смолкновенья сладких сумерек

and so on and so on, stopped it there

I have no idea whether this is good since all I did was check with GTranslate to make sure it was roughly on topic, I would ask GPT-3 to translate it for me but I'm guessing you're a native speaker so you should be able to tell.

Obviously, I pressed "Dragon" button, but I suspect that I am still getting Griffin anyway, as I was also unable to repeat some of the reasoning tasks.

So I've figured this out. Kinda. If you choose 'custom' then it will give you Griffin, but if you choose one of the conventional prompts and then edit it, you can get around it. So damn annoying.

They acknowledged the use of limited GPT-3, details: https://twitter.com/nickwalton00/status/1289946861478936577

I'm beginning to think AID has changed what the "Dragon" model is without telling us for cost reasons, I've had kind of the same experience with big lapses in storytelling that didn't occur as often before. Or maybe it's randomly switching based on server load? I can kind of understand it if that's the case but the lack of transparency is annoying. I remember accidentally using the Griffin model for a day when my subscription ran out and not realising because its Indonesian was still quite good...

Yes, I think it is correct impression. I've wrote in support, btw, no answer yet. One possible way to check the version is to try "Earth POV" - that is the "point of view". GPT-3 understands it correctly and will say something like "I am alone in the sly near Sun". GPT-2 will continue with a story.

Regarding (1): Of course a step is possible; you never know. But for arithmetic, it is not a step. That may appear so from their poor Figure, but the data indicates otherwise.

Anyone want to predict when we'll reach the same level of translation and other language capability as GPT-3 via iterated amplification or another "aligned" approach? (How far behind is alignment work compared to capability work?)

I think GPT-3 should be viewed as roughly as aligned as IDA would be if we pursued it using our current understanding. GPT-3 is trained via self-supervised learning (which is, on the face of it, myopic), so the only obvious x-safety concerns are something like mesa-optimization.

In my mind, the main argument for IDA being safe is still myopia.

I think GPT-3 seems safer than (recursive) reward modelling, CIRL, or any other alignment proposals based on deliberately building agent-y AI systems.

--------------------

In the above, I'm ignoring the ways in which any of these systems increase x-risk via their (e.g. destabilizing) social impact and/or contribution towards accelerating timelines.

I predict that the next GPT will not be called "GPT-N" and will not be created by OpenAI. It will be a large text model with some tricks like larger attention window, with more than trillion parameters, but less than 5 trillions, and SOTA in many tasks. It will be released before the end of 2020 by Nvidia, Amazon, Google or other large company, which will also provide API access.

Would you bet on it not being created by OpenAi with even odds?

I decided that I never take money bets as they make me nervous.

I suggest making bets for insignificant amounts of money, like a dollar, or half a dollar, if the money causes you to be nervous, as the information about your odds is still really useful, especially for yourself (e.g. to update on if you're wrong).

I see; that's understandable.

My understanding is that general opinion at OpenAI and other ML (but not AI) researchers is that it wouldn't be worth scaling on the sub tasks that show good scaling so far given training costs and researcher time. Sorry I don't have a cite as I've only been following this loosely.

Except if you have an idea to monetarize one of these sub-tasks? An invest of order 10m USD in compute is not very large if you can create a Pokemon Comedy TV channel out of it, or something like that.

You don't 'scale on a subtask', you scale the model, which can be applied to many tasks.. The question is not whether this or that task scales, but whether the model improves enough on enough of importance to justify the costs of scaling, and since lots of the tasks do look like they will scale well, that is prima facie plausible and the burden is on people arguing otherwise.

Personally, I have not seen any OA people dismiss the idea of scaling, Slack comments certainly sound like they expect further scaling, other people report gossip about 100-1,000x scaling being planned, scaling to solve tasks like Winogrande sounds like that would be useful, and given how much the benchmarks undersell the reality of GPT-3 I wouldn't put too much stock in it anyway.

I mean that a subtask is projected to be valuable enough to be worth the trouble. First I've heard about the 100-1000x scaling, that's helpful to know. Thanks.

In some of the tests where there is asymptotic performance, it's already pretty close to human or to 100% anyway (Lambada, Record, CoQA). In fact, when the performance is measured as accuracy, it's impossible for performance not to be asymptotic.

The model has clear limitations which are discussed in the paper - particularly, the lack of bidirectionality - and I don't think anyone actually expects scaling an unchanged GPT-3 architecture would lead to an Oracle AI, but it also isn't looking like we will need some major breakthrough to do it.

True. Do these tests scale out to super-human performance or are they capped at 100%?