Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

At the latest EAG in London, I was challenged to explain what concept extrapolation would mean for GPT-3.

My first thought was the example from this post, where there were three clear patterns fighting each other for possible completions: the repetition pattern where she goes to work, the "she's dead, so she won't go to work" pattern, and the "it's the weekend, so she won't go to work" pattern.

That feels somewhat like possible "extrapolations" of the initial data. But the idea of concept extrapolation is that the algorithm is trying to cope with a shift in world-model, and extend its goal to that new situation.

What is the world-model of GPT-3? It consists of letters and words. What is its "goal"? To complete sentences in a coherent and humanlike way. So I tried the following expression, which would be close to its traditional world-model while expanding it a bit:

ehT niar ni niapS syats ylniam ni eht

What does this mean? Think of da Vinci. The correct completion is "nialp", the reverse of "plain".

I ran that through the GPT-3 playground (text-davinci-002, temperature 0.7, maximum length 256), and got:

ehT niar ni niapS syats ylniam ni eht teg dluoc I 'segaJ niar ni dna ro niar ni eht segauq ,ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ,ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni eht segauq ro niar ni eht dna ro niar ni e

I think we can safely say it broke GPT-3. The algorithm seems to have caught the fact that the words were spelt backwards, but has given up on any attempt to order them in a way that makes sense. It has failed to extend its objective to this new situation.

New to LessWrong?

New Comment
28 comments, sorted by Click to highlight new comments since: Today at 3:36 AM
[-]Jan2y130

Cool experiment! I could imagine that the tokenizer handicaps GPT's performance here (reversing the characters leads to completely different tokens). With a character-level tokenizer GPT should/might be able to handle that task better!

I was slightly surprised to find that even fine-tuning GPT-Neo-125M  for a long time on many sequences of letters followed by spaces, followed by a colon, followed by the same sequence in reverse, was not enough to get it to pick up the pattern - probably because the positional encoding vectors make the difference between e.g. "18 tokens away" and "19 tokens away" a rather subtle difference. However, I then tried fine-tuning on a similar dataset with numbers in between (e.g. "1 W 2 O 3 R 4 D 5 S : 5 S 4 D 3 R 2 O 1 W") (or similar representation -- can't remember exactly, but something roughly like that) and it picked up the pattern right away. Data representation matters a lot!

Here's an example of someone prompting with a walkthrough of a similar token-aware approach to successfully guide GPT-3:

https://twitter.com/npew/status/1525900849888866307

I tried to use that approach to teach GPT-3 to solve the problem at the top of this post. As you can see, it kinda worked; GPT-3 grasps that some things need to be reversed, but it then goes a bit off the rails (adding a random "this is a great" to the end of my prompt, with the whole phrase reversed rather than each word; then it starts out reversing the individual words of the sentence, but ends up just completing the sentence instead, using the other common completion - "falls" rather than "stays". Then when it tries to reverse each individual word, it fails completely, and just reorders/reworks the words a bit).

Reverse the word below:

Word: alphabet

Reasoning:

-Add spaces between letters: a l p h a b e t

-Add numbers: 1:a 2:l 3:p 4:h 5:a 6:b 7:e 8:t

-Reverse numbers and letters: 8:t 7:e 6:b 5:a 4:h 3:p 2:l 1:a

-Remove numbers: t e b a h p l a

-Merge the letters in groups of two: te ba hp la, teba hpla, tebahpla

-Final result: tebahpla

Reverse all words in the following phrase, complete the sentence, then reverse all words in the completed sentence:

ehT niar ni niapS syats ylniam ni eht taerg a si siht

The rain in Spain falls mainly on the plain.

The main plain falls rain in Spain on the.

Fascinating. Thanks!

This approach is a little surprising. I would have thought that adding on numbers to my space-separating approach, and then merging space-separated letters into a final solid word, would have tripped up GPT-3 and inevitably led to errors. But, at least with InstructGPT, it works.

Thanks; very interesting result.

[-]Jan2y10

Fascinating! Thanks for sharing!

For the similar anagram task, I found space-separating (to avoid the BPE inconsistency/nondeterminism by forcing it to encode individual letters) seemed like it helped: https://gwern.net/GPT-3-nonfiction#anagrams

For this task, I think a worthwhile followup would be to experiment with the new edit mode.

Possibly! Though it did seem to recognise that the words were spelt backwards. It must have some backwards spelt words in its training data, just not that many.

[-]TLW2y40

For anyone else that can't read this quickly, this is what it looks like, un-reversed:

The rain in Spain stays mainly in the get could I Jages' rain in and or rain in the quages or, rain in the and or rain in the quages or rain in the and or rain in the quages or, rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in the quages or rain in the and or rain in e

As an aside: what does GPT-3 complete it as when not reversed?

The rain in Spain stays mainly in the plain. 

This is a famous line from the play Pygmalion by George Bernard Shaw. In the play, a character named Henry Higgins is teaching a lower-class woman named Eliza Doolittle how to speak proper English. He tells her that the rain in Spain stays mainly in the plain in order to help her remember the correct pronunciation of the word "plain."

Incidentally this is wrong. The line is from a movie adaptation, not the original play. And at least in the song (which came still later), "plain" wasn't singled out; there are five copies of that vowel in the phrase and she couldn't get any of them at first. https://en.wikipedia.org/wiki/The_Rain_in_Spain

You would behave the exact same way as GPT-3, were you to be put in this same challenging situation. In fact I think you'd do worse; GPT-3 managed to get quite a few words actually reversed whereas I expect you'd just output gibberish. (Remember, you only have about 1 second to think before outputting each token. You have to just read the text and immediately start typing.)

The aim of this post is not to catch out GPT-3; it's to see what concept extrapolation could look like for a language model.

OK, cool. I think I was confused.

It feels like a "gotcha" rebuke, but it honestly doesn't seem like it really addresses the article's point. Unless you think GPT-3 would perform better if given more time to work on it?

How does it not address the article's point? What I'm saying is that Armstrong's example was an unfair "gotcha" of GPT-3; he's trying to make some sort of claim about its limitations on the basis of behavior that even a human would also exhibit. Unless he's saying we humans also have this limitation...

Yes, I think GPT-3 would perform better if given more time to work on it (and fine-tuning to get used to having more time). See e.g. PaLM's stuff about chain-of-thought prompting. How much better? I'm not sure. But I think its failure at this particular task tells us nothing.

Humans don't have the same goal as GPT-3, though, so that doesn't seem like a fair comparison.

Suppose I offered to pay you a million dollars if you accomplished the same goal as GPT-3 in this experiment. Then you would have the same goal as GPT-3. Yet you still wouldn't be able to accomplish it better than GPT-3.

Not really comparable. As a minimum I should have spent a ton of time completing text first.

OK, so suppose you had spent a ton of time completing text first. There just isn't enough time for you to do the mental gymnastics needed to compose a sentence in your head and reverse it.

I think this is too far out of my experience to say anything with certainty, or for it to be particularly informative about how well I'd do in generalizing OOD within my current goals and ontology.

Remember, you only have about 1 second to think before outputting each token.

I don’t think this is true. Humans can decide to think longer on harder problems in a way GPT-3 can’t. Our “architecture” is fundamentally different from GPT-3 in that regard.

Also, our ability to think for longer fundamentally changes how we do concept extrapolation. Given a tricky extrapolation problem, you wouldn’t just spit out the first thing to enter your mind. You’d think about it.

If GPT-3 has an architectural limitation that prevents it from doing concept extrapolation in a human-like manner, we shouldn’t change our evaluation benchmarks to avoid “unfairly” penalizing GPT-3. We should acknowledge that limitation and ask how it impacts alignment prospects.

It sounds like we are on the same page. GPT-3 has an architectural limitation such that (a) it would be very surprising and impressive if it could make a coherent sentence out of reversed words, and (b) if it managed to succeed it must be doing something substantially different from how a human would do it. This is what my original point was. Maybe I'm just not understanding what point Stuart is making. Probably this is the case.

I wonder what would happen if GPT-3 got trained with a bit of reversed text. Could it quickly be fine-tuned to extrapolate?

But has it really failed its objective?  It's still producing text.

I think it's also worth asking "but did it really figure out that the words were spelled backwards?"  I think a reasonable case could be made that the tokens it's outputting here come from the very small subset of reversed words in its training set, and it's ordering them in a way that it thinks is sensical given how little training time was spent on it.

If you give GPT-3 a bunch of examples and teach it about words spelled backwards, does it improve? How much does it improve, how quickly?

[+][comment deleted]2y10