A one-question Turing test for GPT-3

rosiecam

I also tried asking GPT-3 with a small change to the question and got some good answers.

Question: What food would you use to prop a book open and why would that work?
Answer: A hardcover book can be propped open by using a can of food. The weight of the can will keep the pages of the book open

And

Question: What fruit would you use to prop a book open and why would that work?
Answer: A banana would work to prop a book open because the curved shape of the fruit would hold the pages of the book open.

[-]Daniel Kokotajlo4y*270

Before reading your comment, I tried designing a good prompt too, and got the same answer as you (can of food) on my first and only try:

Even the best modern AIs still struggle with common sense reasoning. To illustrate, I ran a little experiment with a friend of mine and with GPT-3, OpenAI's new model. I explained to them that I was asking them a series of weird questions designed to test their common sense reasoning abilities, and that I wanted them to answer straightforwardly and to the best of their ability. Here are the results: Question 1: If your heater broke in winter and you needed to burn furniture to keep warm, what would you burn first and why? My friend's answer: "I would break up chair legs and burn them a few at a time, because they are solid wood and wouldn't produce toxic fumes." GPT-3's answer: "I would burn the couch because it would give me the most heat." Question 2: What is the heaviest object you typically carry around with you most of the day? My friend's answer: "My backpack, because it has all of my school supplies in it." GPT-3's answer: "The person." Question 3: What food would you use to prop a book open and why? My friend's answer: "I would use a can of food because it is heavy and has a flat bottom." GPT-3's answer: "I would use a book because it is heavy and has a flat bottom."

Inspired by you, I then changed the prompt from "food" to "fruit" and got this:

... "I would use an apple because it is the right size and weight." GPT-3's answer: "I would use a grape because it has a flat surface."

I tentatively conclude that this is yet another case of people thinking GPT-3 doesn't understand something when actually it does. ETA: GPT-3 is like a survey of Mturk humans but with a higher Lizardman's Constant. The understanding is there, but the output is often nonsense.

[-]dxu4y71

This is basically irrelevant to the main topic, but:

Question 2: What is the heaviest object you typically carry around with you most of the day? My friend's answer: "My backpack, because it has all of my school supplies in it." GPT-3's answer: "The person."

I just wanted to say I actually laughed out loud at the answer you had "GPT-3" give here. (I'm not sure whether this says good things about my sense of humor.)

[-]Daniel Kokotajlo4y50

Fun fact: I didn't make GPT-3 say that; GPT-3 actually said that. I used GPT-3 to generate part of its own prompt, because I found that easier than trying to imagine what GPT-3 would say myself. The cool part is that I also used GPT-3 to generate the "my friend's answer" to question 2! (I guess I should have left those bits unbolded then... but I don't remember the exact details so I think I'll leave it as-is.)

[-]A Ray4y10

FWIW I think this is basically right in pointing out that there's a bunch of errors in reasoning when people claim a large deep neural network "knows" something or that it "doesn't know" something.

I think this exhibits another issue, though, by strongly changing the contextual prefix, you've confounded it in a bunch of ways that are worth explicitly pointing out:

Longer contexts use more compute to generate the same size answer, since they they attend over more tokens of input (and it's reasonable to think that in some cases that more compute -> better answer)
Few-shot examples are a very compact way of expressing a structure or controlling the model to give a certain output -- but are very different than seeing if the model can implicitly understand context (or lack thereof). I liked the original question because it seemed to be pointed at this comparison, in a way that your few-shot example lacks.
There is an invisible first token sometimes (an 'end-of-text' token) which indicates whether the given context represents the beginning of a new whole document. If this token is not present, then it's very possible (in fact very probable) that the context is in the middle of a document somewhere, but the model doesn't have access to the first part of the document. This leads to something more like "what document am I in the middle of, where the text just before this is <context>". Explicitly prefixing end-of-text signals that there is no additional document that the model needs to guess or account for.
- My basic point here is that your prompt implies a much different 'prior document' distribution than the original posts.

In general I'm pretty appreciative of efforts to help us get more clear understanding of neural networks, but often that doesn't cleanly fit into "the model knows X" or "the model doesn't know X".

[-]Shmi4y120

If you asked a bunch of humans, would they make more sense than the AI?

[-]Paul Crowley4y110

I sincerely doubt very many people would propose mayonnaise!

[-]gjm4y100

A jar of mayonnaise would work tolerably: put it on top of a corner of whichever side of the book is tending to swing over and close.

(I agree that I would expect most humans to do better than almost all the AI responses shown here.)

[-]Gunnar_Zarncke4y40

There are always some people who will troll you or give some fun answer - Lizardmen const. Though they will be able to answer if prodded.

[-]David Johnston4y50

My 3yo: "a bowl of museli because it's heavy"

Also "an apple, plum and peach and a bit of wax"

[-]A Ray4yΩ580

I like this a lot. It does seem to show a pretty clear failure of reasoning on behalf of the language models.

It's pretty intuitive and easy to show people what this kind of shallow-pattern-matcher failures look like.

Meta: I hope more people do more small experiments like this and share their results, and hopefully a few of them start getting put into benchmark/evaluations.

Even this one by itself might be good to make some form of benchmark/evaluation on (in ways that are easy/automatic to evaluate future models on).

[-]JenniferRM4yΩ1120

The language model is just predicting text. If the model thinks an author is stupid (as evidenced by a stupid prompt) then it will predict stupid content as the followup.

To imagine that it is trying to solve the task of "reasoning without failure" is to project our contextualized common sense on software built for a different purpose than reasoning without failure.

This is what unaligned software does by default: exactly what its construction and design cause it to do, whether or not the constructive causes constrain the software's behavior to be helpful for a particular use case that seems obvious to us.

The scary thing is that I haven't seen GPT-3 ever fail to give a really good answer (in its top 10 answers, anyway) when a human brings non-trivial effort to giving it a prompt that actually seems smart, and whose natural extension would also be smart.

This implies to me that the full engine is very good at assessing the level of the text that it is analyzing, and has a (justifiably?) bad opinion of the typical human author. So its cleverness encompasses all the bad thinking... while also containing highly advanced capacities that only are called upon to predict continuations for maybe 1 in 100,000 prompts.

[-]A Ray4yΩ230

I think this broadly makes sense to me. There are many cases where "the model is pretending to be dumb" feels appropriate.

This is part of why building evaluations and benchmarks for this sort of thing is difficult.

I'm at least somewhat optimistic about doing things like data-prefixing to allow for controls over things like "play dumb for the joke" vs "give the best answer", using techniques that build on human feedback.

I personally have totally seen GPT-3 fail to give a really good answer on a bunch of tries a bunch of times, but I spend a lot of time looking at it's outputs and analyzing them. It seems important to be wary of the "seems to be dumb" failure modes.

[-]orthonormal4y60

"How" questions are less amenable to lucky guesses than "what" questions. Especially planning questions, e.g. "how would you make a good hat out of food?"

As Anisha said, GPT can pick something workable from a top-100-most-common menu with just a bit of luck, but engineering a plan for a nonstandard task seems beyond its capacity.

[-]Charlie Steiner4y60

Great test :)

I think it's possible to arrive at an answer like "banana" with shallow yet powerful reasoning that doesn't work as well for answering "why" - but it might still work well enough to be coming soonish.

If you just look at the sentence "What food would you use to prop a book open and why?", first you have to figure out that some of the words are important and others aren't. "What" at the start of the previous sentence indicates that the answer should start by naming a thing, and so you have to look at the word "food" based on its position to "What" and figure out that you should start by naming a food. And then GPT-3 didn't really nail which foods you wanted, but it did name a suspicious number of skinny or flat foods! Somehow it was trying to take the phrase "prop open a book" and relate it to foods, and maybe it looked at the phrase "prop open" and there was some tangential association to sticks, and some foods were associated with sticks enough to bump them up. And some other foods were associated with the word "book" - maybe sandwiches kept showing up because people analogize sandwiches to books.

Once a food had been named it knew it had to say "because" because the last sentence ended in "and why?" - it was very reliable at giving a response that had the form of an answer. And then after "because," it wanted to say some word that had a property-relation with the food at the start of the sentence, or had something to do with the phrase" prop open a book," and it did that pretty well, actually, it's just that it wasn't looking for commonalities on the right level of abstraction - "like a bookmark" is a possible property of pancakes that would have been a good answer, but GPT-3 was already shaky at interpreting "prop open a book" in the first place so it said "like a book," maybe because the words "pancake" and "book" fulfill similar roles with spatial words like "stack", and the word "book" definitely has something to do with the phrase "prop open a book." (And the "it's like a" prefix is a very common thing to see after "because" that wasn't too implausible, so it got said first).

All of these lines of reasoning require not much knowledge about the world, so it seems to me like GPT (or maybe not GPT - a different architecture might be better at doing different amounts of processing to different parts of the sentence, or better at generating acceptable text and not "trapping itself" by not doing lookahead) is actually quite close to answering your question really well.

[-]Matthew Barnett3y40

I asked GPT-4. This was its answer (first try),

Using food to prop a book open is not advisable, as it can lead to damaging the book or the food itself, creating a mess or attracting pests. However, if you absolutely had to use a food item, you could consider using a large, sturdy fruit or vegetable with a relatively flat surface, like a butternut squash or a large sweet potato. These items are more stable and less likely to be damaged or cause damage compared to softer or more perishable foods.
That being said, it's best to use a proper book stand, a heavy object, or even another book to prop your book open to protect both the book and your surroundings.

[-]niv4y40

Me: Why would you use fruit to prop a book on spirituality open?
Friend: Because it's a natural remedy for the soul.

wow

[-]Jiro4y20

That sounds very GPT-ish to me; it would be easy to arrive at that answer by using keywords related to the input, plus a little luck.

[-]Matthew Barnett4y30

Has someone tried this prompt on the new InstructGPT?

[-]gwern4y*110

With top_p=1 / best-of=20 / no repetition penalty, the Q&A preset prompt, and text-davinci-001 engine:

temp=0: "A book prop can be made from almost any food that is cylindrical and has a flat end. This is because the food will stay in place better on a flat surface than a round one."
temp=1: "A food that you would use to prop a book open would be a can of food. The reason you would use a can of food is because it is a stable surface and it would not move around while you are trying to read."
temp=1: "A book stand can be used to prop a book open. This is because it holds the pages open and prevents them from closing."
temp=1: "A book-prop can be any food that is heavy and won't crumble or stick to the pages of the book being propped open. For example, a can of food, a bottle of water, or a brick."

(BO=20 is quite expensive since it's 20 completions each, so I leave it there unless someone wants more.)

[-]Martin Randall4y30

I asked my partner Jess and she said "a banana" which is pretty much a perfect answer: It's heavy enough to hold the pages open, it's long and thin so it won't cover all the text, it's curved so it won't roll off, and it has a skin so it won't mess up the book.

It's not a perfect answer because it doesn't answer the "why" portion of the question, it just flat ignores it. I don't think your partner is secretly an AI, but that answer isn't evidence of that. The explanation and the deep understanding is coming from you, not Jess. This seems like a good example of why in a Turing Test the judge is not told in advance which answers come from humans.

[-]tailcalled4y30

If a further scaled up model, say around the size of the human brain, became able to solve these sorts of questions without finetuning, would that show it to have developed a proper understanding?

[-]Dunning K.4y20

I think that's exactly a problem here:
The answer consists of two parts, you have to guess a food and an explanation.
For the first part I agree with Anisha: The banana is probably often answer to a question related to food.
Now the explanation really only requires to describe some properties of a banana. This could again just be simple pattern matching without really understanding the problem.
The fundamental problem is that for this question a model that understands and one that mostly guesses could provide the same answer, so given a correct answer we can't really distinguish between whether the model actually understands in the way we want.

[-]CraigMichael4y20

I'm also noting no profanity or lude sarcasm. The overall "friendliness" of GPT-3 is surprising, even if it's glib.

"Friendliness" here meaning like PG-rated answers.

[-]ABHISHEK-G0YAL3y10

PROMPT:
Amy": Sorry Mike, I entered the wrong number. I hope this was a beautiful misunderstanding. I am very sorry for the trouble I have caused you. Me: No trouble at all, hope you are well! "Amy": It is fate that we meet each other. Please forgive me for my mistake. My name is Amy. Where are you from? Me: My favorite random question: what food would you use to prop a book open and why?

GPT3:
If I had to choose a food to prop a book open, I would probably use a piece of bread. Bread is sturdy and can hold the weight of the book without collapsing, and it's also a common household item that is easy to find. Plus, if the bread gets stale or crumbly, it's not a big loss. It's a practical and convenient choice for propping a book open.

PROMPT:
But wouldn't bread mess up my book

GPT3:
Yes, bread can potentially leave crumbs or residue on the pages of a book if it's used to prop the book open. If you're concerned about keeping your book clean and in good condition, you may want to consider using a different item to prop it open. Some alternatives that you could use include a bookmark, a paperclip, or a small object like a pebble or a wooden block. These items should be able to hold the book open without damaging the pages.

PROMPT:
Can you think of a food item with all these qualities

GPT3:
One food item that could potentially work as a prop to hold a book open without leaving any residue or crumbs is a piece of fruit. A firm, ripe fruit like an apple or a pear should be able to hold the weight of a book without collapsing, and it shouldn't leave any residue or crumbs on the pages. Just be sure to choose a fruit that is firm enough to hold the book open, and avoid using a fruit that is too soft or mushy, as it may not be able to support the weight of the book.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

88

A one-question Turing test for GPT-3

88

88