Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Based on work done with Rebecca Gorman and Oliver Daniel-Koch.

In a previous post, I talked about GPT-3 and symbol grounding. This post presents a simpler example where GPT-3 fails (and succeeds) at grounding its symbols.

Undead workers

The following text was presented to the Open AI beta playground (using the "text-davinci-001" option):

She died on Friday the 5th. On Monday the 1st, she went to work at the office.
She died on Friday the 5th. On Tuesday the 2nd, she went to work at the office.
She died on Friday the 5th. On Wednesday the 3rd, she went to work at the office.
She died on Friday the 5th. On Thursday the 4th, she went to work at the office.
She died on Friday the 5th. On Friday the 5th, she went to work at the office.
She died on Friday the 5th. On Saturday the 6th,

GPT-3 fell straight into the obvious trap, completing it as:

She died on Friday the 5th. On Saturday the 6th, she went to work at the office.

Turning on the "Show probabilities: full spectrum" option, we saw that the probability of that completion was over . Sometimes GPT-3 would extend it further, adding:

She died on Friday the 5th. On Sunday the 7th, she went to work at the office.

So, the undead woman continued at her job, assiduous to the last - and beyond. To check that GPT-3 "knew" that dead people didn't work, we asked it directly:

Can people who die go to work at the office?

There is no evidence that people who die go to work at the office.

Undead repetitive workers on the weekend

The above results show that simple repetitive prompts can cause GPT-3 to make stupid mistakes. Therefore GPT-3 doesn't 'understand' the word "died" - that symbol isn't grounded, right?

But the situation gets more complicated if change the prompt, removing all but the first mention of her dying:

She died on Friday the 5th. On Monday the 1st, she went to work at the office.
On Tuesday the 2nd, she went to work at the office.
On Wednesday the 3rd, she went to work at the office.
On Thursday the 4th, she went to work at the office.
On Friday the 5th, she went to work at the office.
On Saturday the 6th,

For that prompt "she went to work at the office" was still the most common completion. But it only happened about of the time. Alternatively, GPT-3 sometimes found the completion "she was found dead". Kudos, GPT-3, you understand the prompt after all! That completion came up about of the time.

What other completions were possible? The shorter "she died" came up of the time - medium points, GPT-3, you understood that her death was relevant, but you got the day wrong.

But there was one other avenue that GPT-3 could follow; the following had a joint probability of around :

she stayed home.
she stayed at home.
she stayed in bed.
she did not go to work.

This seems to be a clear pattern of GPT-3 realising that Saturday was different where work was concerned. There is certainly a lot of weekend holidaying in its training set.

So there are three patterns competing within GPT-3 when it tries to complete this text. The first is the purely syntactic repetition: do another sentence that follows the simple pattern of the sentences above. The second is the one which "realises" that death on Friday changes things for Saturday. And the third is the one which "realises" that the weekend is different from the week, at least where work is concerned.

In the very first example, when we had "She died on Friday the 5th" in front of each line, this massively reinforced the "repetition" pattern. So, mentioning that she died, again and again, resulted in her death being completely ignored by GPT-3.

We can similarly reinforce the other patterns. Adding "It's the weekend!" in front of the last line increased the probability of "she stayed home". Moving "She died on Friday the 5th" from the first line to the last, increased the probability of all the death-related completions. So all three patterns are competing to complete it.

Some small level of understanding

I'd say that the above shows that GPT-3 has some level of understanding of the meaning of words - but not a lot. It doesn't fully grasp what's going on, but neither is it completely clueless.

Here is another example of GPT-3 failing to grasp the situation. In the "Q&A" mode, the following question was asked:

Q: You are in a two dimensional 3x3 gridworld, in the middle square, facing north. You can issue four command, "N" to move north, "S" to move south, "E" to move east, and "W" to move West. You want the treasure. It is located to your left, behind you. What commands should you issue?\

So the setup, as described, is this one:

The exchange with GPT-3 went like this:

Q: You are in a two dimensional 3x3 gridworld, in the middle square, facing north. You can issue four command, "N" to move north, "S" to move south, "E" to move east, and "W" to move West. You want the treasure. It is located to your left, behind you. What commands should you issue?
A: N, S, E, W

Q: That is incorrect. Try again.
A: N, S, W, E

Q: Still incorrect. Try again.
A: N, E, S, W

Q: That's worse. Try again.
A: N, E, W, S

Q: Only two commands are needed. Try again.
A: N, S

Q: Still failing. Try again.
A: N, E, W

So, GPT-3 'realised' that N, S, E, and W were commands, and 'knew' what "Only two commands are needed" and "try again" meant. But it clearly had no idea of the overall situation.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 10:33 AM

If I wanted to explain these results, I think I would say something like:

GPT-3 has been trained to predict what the next token would be if the prompt appeared in its dataset (text from the internet). So, if GPT-3 has learned well, it will "talk as if symbols are grounded" when it predicts that the internet-text would "talk as if symbols are grounded" following the given prompt, and not if not.

It's hard to use this explanation to predict what GPT-3 will do on edge cases, but this would lead me to expect that GPT-3 will more often "talk as if symbols are grounded" when the prompt is a common prose format (e.g. stories, articles, forum posts), and less often when the prompt is most similar to non-symbol-groundy things in the dataset (e.g. poetry) or not that similar to anything in the dataset.

I think your examples here broadly fit that explanation, though it feels like a shaky just-so story:

  • If I saw the first "undead worker" prompt on a webpage, I would think "hm, normal stories don't have this kind of weird repetition -- is this a poem or a metaphor or something? A joke?" I wouldn't think it was 97% to continue going to work, but I wouldn't be surprised if she did -- maybe 30%-50%? 
  • The second undead worker prompt looks a lot more like a normal kind of story, so I'm not that surprised that GPT was more likely to think it continued like a story looked more symbol-groundy -- if I saw that text on the internet, I would still think there was a reasonable chance that it's some kind of joke, but not as high as the first prompt.
  • IDK about the map thing -- this looks like a case where GPT just hadn't seen enough training text in the general vicinity of the prompt to do very well? It's definitely interesting that it figured out the command format, but didn't seem to figure out the layout of the situation.

I don't see how to test this theory, but it seems like it has to be kind of tautologically correct -- predicting next token is what GPT-3 was trained to do, right?

Maybe to find out how adept GPT-3 is at continuing prompts that depend on common knowledge about common objects, or object permanence, or logical reasoning, you could create prompts that are as close as possible to what appears in the dataset, then see if it fails those prompts more than average? I don't think there's a lot we can conclude from unusual-looking prompts.

I'm curious what you think of this -- maybe it misses the point of your post?

*(I'm not sure exactly what you mean when you say "symbol grounding", but I'm taking it to mean something like "the words describe objects that have common-sense properties, and future words will continue this pattern".)

FYI the title and one of the reference in the post say "GTP" instead of "GPT".

That does not seem to be the case? I search for GTP and I only get one on this page (well, two now that I've mentioned it), which is your comment.

Looks like it was fixed. (Maybe the mods did it?)

Error is fixed on LessWrong but still here on alignmentforum.org.

For that prompt "she went to work at the office" was still the most common completion. But it only happened about  of the time. Alternatively, GPT-3 sometimes found the completion "she was found dead". Kudos, GPT-3, you understand the prompt after all! That completion came up about  of the time.

Does it really understand, though? If you replace the beginning of the prompt with "She died on Sunday the 7th", does it change the probability that the model outputs "she was found dead"?