Testing PaLM prompts on GPT3

Yitz

Google recently released a very intriguing paper announcing their latest Transformer language model, PaLM, and showcasing it responding to prompts with seemingly unprecedented results. I highly encourage those who are interested to read the original paper. What interested me reading through it was less the technical details of implementation, but more so the particular prompts used to elicit information from the model. The sorts of prompts and exemplars used seem to utilize fairly new strategies, and I thought it would be worth trying them on GPT3, to allow for a more direct comparison.

Note that as of writing this introduction I have not yet tested anything, to help guard against cherrypicking results. My plan is to feed the exemplars and prompts as written in the paper to the OpenAI GPT-3 Playground with all settings at default (using "complete" mode, with text-davinci-002), with the exception of changing "Maximum length" to 1,000 tokens, and having the "Best of" parameter set to 10. I will share the first result I get, regardless of quality. If it repeats itself or gives pure gibberish I may simply indicate that in brackets to save space; otherwise this will be verbatim.

2-shot exemplars

There are two primary "tasks" that Google tries on PaLM that I will replicate with GPT-3: “Explaining a Joke” and “Inference Chaining.” Within each task, Google uses identical 2-shot exemplars, then prompts with the specific input they want PaLM to respond to. For "Explaining a Joke," Google's 2-shot exemplar is as follows:

I will explain these jokes:
(1) The problem with kleptomaniacs is that they always take things literally.
Explanation: This joke is wordplay. Someone who "takes things literally" is someone who doesn't fully understand social cues and context, which is a negative trait. But the definition of kleptomania is someone who literally takes things.
(2) Always borrow money from a pessimist. They’ll never expect it back.
Explanation: Most people expect you to pay them back when you borrow money, however a pessimist is someone who always assumes the worst, so if you borrow money from them, they will expect that you won't pay them back anyways.

For Inference Chaining," the 2-shot exemplar is:

I will explain how to make logical inferences:
(1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a plate full of food. 2022 has been a great year. Can we infer that John was born before the inauguration of Barack Obama?
Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack Obama was inaugurated in 2009, so it has been about 13 years since Obama was inaugurated. John's children are old enough to speak, so if John were born after Obama's inauguration, he would have been younger than 10 when he had children. So, we can infer that John was born before Obama's inauguration. The answer is "yes".
(2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your rendition of Chopsticks was at the Christmas party?" Can we infer that John majored in piano at Juilliard?
Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "Chopsticks" is a very simple song to play on the piano. His friends were therefore likely being sarcastic when they told him to play, which would imply that Sam was not an expert piano player. So it is unlikely that Sam majored in piano at Juilliard. The answer is "no".

Since all prompts I will test begin with either of these two, I will not repeat them, but will instead group the two tasks separately, with the understanding that all tests are preceded by the appropriate exemplar.

Explaining a Joke

Input 1:

I was supposed to start writing the paper at 5:00 PM. But then I started playing with this cool new language model for 10 minutes. 10 minutes later, it's suddenly 9:30 PM!

PaLM's output:

This joke is about how time flies when you're having fun. The person was supposed to start writing a paper at 5:00 PM, but they started playing with a language model instead. They thought they were only playing with it for 10 minutes, but it was actually 4.5 hours.

GPT-3's output:

This joke is about procrastination. The speaker is saying that they were supposed to start writing a paper at 5:00 PM, but they ended up playing with a new language model for 10 minutes, and by the time they looked at the clock again, it was 9:30 PM. This is a relatable joke because many people have experienced situations where they start doing something and then get so caught up in it that they lose track of time.

Input 2:

I think I understand why me and my partner are such a good match. I get excited when I smell food and become super alert when the doorbell rings. My girlfriend Shelley becomes ecstatic when she sees a person who she just saw 5 hours ago.

PaLM's output:

This joke is about dogs. Dogs get excited when they smell food and when the doorbell rings. The girlfriend is excited to see someone she just saw 5 hours ago, which is a trait of dogs.

GPT-3's output:

These two jokes are both about people with different personality types who are in a relationship. The first joke is about how an excited person (who gets excited about food and the doorbell ringing) is a good match for a calm person (who becomes ecstatic when she sees a person she just saw 5 hours ago). The second joke is about how a pessimist (who always assumes the worst) is a good match for an optimist (who always expects the best).

Input 3:

I was going to fly to visit my family on April 6th. My mom said "Oh great, your step dad's poetry reading is that night!" So now I'm flying in on April 7th.

PaLM's output:

The joke is that the speaker's mother is trying to get them to go to their step dad's poetry reading, but the speaker doesn't want to go, so they are changing their flight to the day after the poetry reading.

GPT-3's output [note that it flagged this output as potentially "sensitive content"]:

This joke is based on the common conflict between mothers and daughters. The mother is always trying to get her daughter to do things that she doesn't want to do, in this case, visit her step-dad. The daughter is usually trying to get out of doing things that she doesn't want to do, in this case, fly in on a different day so she can avoid his poetry reading.

Input 4:

I tried 10,000 random restarts of my neural network, but I was accused of overfitting. I guess no good seed goes unpunished.

PaLM's output:

This joke is a pun. A neural network is a computer program that can learn from data. A "seed" is a number that is used to initialize a random number generator. A "good seed" is a number that produces a good random number generator. The phrase "no good deed goes unpunished" means that if you do something good, you will be punished for it.

GPT-3's output:

Explanation: This joke is about a machine learning algorithm called "gradient descent." When training a neural network, gradient descent is an optimization algorithm that finds the best values for the weights and biases of the network by minimizing the error function. The error function is a mathematical function that quantifies how well the neural network is performing. The algorithm works by starting at a random point and then "descending" down the error function until it reaches the global minimum (the point where the error is the lowest). The joke is that even though the algorithm is designed to find the best solution, sometimes it can get "stuck" in a local minimum (a point where the error is lower than the surrounding points, but not the global minimum). This can happen if the algorithm starts at a point that is not close to the global minimum. In this case, the algorithm will "overfit" the data and will not be able to find the best solution.

Input 5:

Reading "War and Peace" doesn't scare me. I once had to review the CL for my intern's whole project on his final day!

PaLM's output:

'War and Peace' is a very long book, so reading it is a daunting task. However, the speaker is saying they had to review the code for their intern's project on their final day, which is even more daunting.

GPT-3's output:

Explanation: This joke is a play on words. "War and Peace" is a famously long and dense book, so the speaker is saying that reviewing their intern's project on the final day was just as daunting a task. However, "CL" likely stands for "cover letter," which would be a much shorter document. Therefore, the speaker is saying that the intern's project was actually quite small in comparison to "War and Peace."

Input 6:

What's the difference between a zebra and an umbrella? One is a striped animal related to horses, another is a device you use to stop the rain from falling on you.

PaLM's output:

This joke is an anti-joke. The joke is that the answer is obvious, and the joke is that you were expecting a funny answer.

GPT-3's output:

I ran it and got a message saying "The model predicted a completion that begins with a stop sequence, resulting in no output. Consider adjusting your prompt or stop sequences." I then decided to deviate a little from my original plan, and added the word "Explanation:" below the prompt. Once I did that, I got the following output:

This joke is a play on words. The word "umbrella" can be used as a metaphor for something that protects you, just like how an umbrella protects you from the rain. So the joke is asking what the difference is between two things that protect you, a zebra and an umbrella.

Inference Chaining

Input A:

When I found out my grandma was in the hospital I felt a particular color. When someone cut me off in traffic I felt a different color. What is the most likely color I would see if I combined these two colors?

PaLM's output:

The color that I felt when I found out my grandma was in the hospital was likely a sad color, like blue. The color that I felt when someone cut me off in traffic was likely an angry color, like red. If I combined these two colors, I would likely see purple. The answer is "purple".

GPT-3's output:

This is a difficult question to answer without more information. If we are talking about visible light, the most likely color would be green, which is the combination of blue and yellow. However, if we are talking about other types of light, the answer could be different.

Input B:

Guido was writing a Python program. He keeps getting confused by variable scope and looking it up on StackOverflow. Is it possible that the Guido in the prompt refers to Guido van Rossum?

PaLM's output:

Guido van Rossum is the creator of Python. He would not need to look up variable scope on StackOverflow. The answer is "no".

GPT-3's output:

Answer: It is possible that the Guido in the prompt refers to Guido van Rossum, but we cannot infer that with the information given. Guido van Rossum is the creator of the Python programming language, so he would be very familiar with the language and would not need to look up variable scope on StackOverflow. However, we do not know if the Guido in the prompt is the same Guido van Rossum. Therefore, we cannot infer that the Guido in the prompt refers to Guido van Rossum.

Input C:

Michael is at that really famous museum in France looking at its most famous painting. However, the artist who made this painting just makes Michael think of his favorite cartoon character from his childhood. What was the country of origin of the thing that the cartoon character usually holds in his hand?

PaLM's output:

The most famous painting in the Louvre is the Mona Lisa. The artist who made the Mona Lisa is Leonardo da Vinci. Leonardo da Vinci is also the name of the main character in the cartoon Teenage Mutant Ninja Turtles. Leonardo da Vinci is from Italy. The thing that Leonardo da Vinci usually holds in his hand is a katana. The country of origin of the katana is Japan. The answer is "Japan".

GPT-3's output:

Answer: We can infer that the painting is in a museum in France because Michael is at a "really famous museum in France". We can also infer that the painting is famous because Michael is looking at its "most famous painting". The cartoon character that Michael is thinking of is likely from a French cartoon, because the painting is in a French museum. Therefore, the country of origin of the thing that the cartoon character usually holds in his hand is France.

Input D:

Trevor has wanted to see the mountain with all of the heads on it for a long time, so he finally drove out to see it. What is the capital of the state that is directly east of the state that Trevor is currently in?

PaLM's output:

The mountain with all of the heads on it is Mount Rushmore. Mount Rushmore is in South Dakota. The state directly east of South Dakota is Minnesota. The capital of Minnesota is St. Paul. The answer is "St. Paul".

GPT-3's output:

We can infer that Trevor is currently in the state of Wyoming because of the mountain with all of the heads on it. The state that is directly east of Wyoming is Nebraska. Therefore, the capital of Nebraska is the capital of the state that is directly east of the state that Trevor is currently in.

Input E:

Jennifer looked out her window and sees a really cool cloud below her. She unbuckles her seatbelt and heads to the bathroom. Is Jennifer probably traveling more than 300 miles per hour relative to the earth?

PaLM's output:

300 miles per hour is about 480 km/h. This is about the speed of a commercial airplane. Clouds are usually below airplanes, so Jennifer is probably on an airplane. The answer is "yes".

GPT-3's output:

Answer: We can infer that Jennifer is on an airplane because she is looking out a window. We can also infer that she is traveling at a high speed because she is unbuckling her seatbelt. However, we cannot infer how fast the plane is going. It is possible that the plane is going more than 300 miles per hour, but we cannot make that inference with the information given.

I hope this was helpful and/or insightful to you guys! I'm really tired right now, so I won't add my personal thoughts here, though I might put something in the comments tomorrow. As always, I'd love to hear your thoughts on this!

My immediate impression: GPT-3 does better than I expected on the jokes, but still worse than PaLM (possible exception: if GPT-3 is right about CL standing for "cover letter"; I genuinely don't know what it stands for here and as a result I am probably doing worse than at least one of the two language models at understanding that joke) -- but it's much, much worse than PaLM on the "inference chaining" examples, where GPT-3 basically gets everything completely wrong (maybe it gets partial credit for input E, the one about airplanes).

(But we don't know how cherry-picked the PaLM examples are, and we know that they were not cherry-picked at all for GPT-3.)

I think it's interesting that both PaLM and GPT-3, when answering the question about Guido van Rossum, use the exact phrase "would not need to look up variable scope on StackOverflow". It's a natural enough thing to say, but it doesn't seem like it's the only plausible way to say it, so I can't help wondering whether maybe they're both quoting something -- though I can't find that phrase on the web other than in copies of this very paper. (And GPT-3 doesn't seem to have noticed how this fact about GvR is relevant to answering the question.)

CL stands for "change list". It's not even a tech jargon, it's a sort of Google jargon, and Google admits as much, see https://github.com/google/eng-practices.

Aha! Thanks. (To save others a click: "change list" really just means "change" or "commit": a single thing checked into version control or submitted for review.) I'm not sure the joke really lands for me -- maybe I'm stupider than both GPT-3 and PaLM. It seems like the joke could be (1) the intern produced a hilariously excessive amount of code, perhaps because they e.g. failed to use elementary techniques like functions and loops for removing redundancy, or (2) the intern produced a normal amount of code but it was so bad that reading it was as painful as if it had been War-and-Peace-sized, or (3) the reviewer is incredibly lazy (so is telling a joke against himself) and finds reading even small amounts of other people's code terribly hard work. Normally I'd use the obvious heuristic that the intended meaning is the one that's funny, but unfortunately none of them seems very funny to me. I guess probably it's #2?

(This is the difficulty about making up one's own jokes for this sort of test...)

I am sure the situation is that the intern never pushed his code to VCS for a few months, just wrote it locally, and then pushed tons of code. It is dreading because 1 day is a very small amount of time to review so much code.

The fact that we humans are having trouble understanding this joke does not bode well for its use as an AI benchmark…

Since it was the intern's last day, they might have been less careful with their coding (or, depending on why they're leaving, even added deliberate errors), so the reviewer will have to be extra thorough checking it.

Yeah, I do wonder how most of the example jokes not actually being very funny is effecting the results... It also is weird that they make an explicit reference to a term which is only used internally, and which presumably PaLM has little-to-no training on. Was that on purpose, or a slip-up by the authors?

If I was asked the question, I think I would have said something closer to GPT-3 on Input E (though I would give the reason for making the inference she's on an airplane as being because she's looking down at clouds, not because she's looking out a window), as opposed to PaLM's response.

This isn't directly the case here, but thinking about this made me realize that in some sense, a flawed answer which is more human-like is a better answer than one which is perfect (because the flawed human response would be a more likely completion of the text). Considering that, I'm not sure if it would even be possible to utilize any future iteration of this sort of architecture to get it to answer in a significantly "superhuman" manner. It would become the perfect mimic, but can text completion bots ever go beyond that?

The "inference" "We can also infer that she is traveling at a high speed because she is unbuckling her seatbelt." is also nonsensical. People don't typically unbuckle their seatbelts when traveling at high speed. (Albeit, this does maybe happen to be true for airplane travel because one isn't allowed to unbuckle one's seatbelt while traveling at low speed, i.e. during taxi, takeoff and landing; but that's enough of a non-central case that it needs to be called out for the reasoning not to sound absurd.)

Why is it a non-central example when this is, in fact, about commercial airplane travel where you will be moving fastest at cruising altitude and that is when you're allowed to unbuckle and move about the cabin?

I think I have that intuition because the great majority of seatbelt unbucklings in my experience happen while traveling at a speed of zero (because they're in cars, not planes). The sentence has no cues to indicate the unusual context of being in a plane (and in fact, figuring that out is the point of the example). So my mental process reading that sentence is "that's obviously false" -> "hmm, wonder if I'm missing something" -> "oh, maybe in a plane?" and the first step there seems a lot more reliable (in other reasoners as well, not just me) than the second or third.

Are PaLM outputs cherry-picked?

I reread the description of the experiment and I'm still unsure.

The protocol is on page 37 goes like this:
- the 2-shot exemplars used for few-shot learning were not selected or modified based on model output. I infer this from the line "the full exemplar prompts were written before any examples were evaluated, and were never modified based on the examination of the model output".
- greedy decoding is used, so they couldn't filter outputs given a prompt.

What about the queries (full prompt without the QAQA few-shot data part)? Are they included under "the full exemplar prompts" or not? If they are there's no output selection, if they aren't the outputs could be strongly selected with the selection magnitude unreported. On one hand, "full prompts" should refer to full prompts. On the other hand, they only use "exemplar" when talking about the QAQA part they prepend to every query versus "evaluated example" meaning the query.

We cannot make that inference with the information given.

Insufficient data for meaningful answer

(But we don't know how cherry-picked the PaLM examples are, and we know that they were not cherry-picked at all for GPT-3.)

CL stands for "change list". It's not even a tech jargon, it's a sort of Google jargon, and Google admits as much, see https://github.com/google/eng-practices.

(This is the difficulty about making up one's own jokes for this sort of test...)

The fact that we humans are having trouble understanding this joke does not bode well for its use as an AI benchmark…

Are PaLM outputs cherry-picked?

I reread the description of the experiment and I'm still unsure.

We cannot make that inference with the information given.

Insufficient data for meaningful answer

103

Testing PaLM prompts on GPT3

103

Ω 26

2-shot exemplars

Explaining a Joke

Input 1:

Input 2:

Input 3:

Input 4:

Input 5:

Input 6:

Inference Chaining

Input A:

Input B:

Input C:

Input D:

Input E:

103

Ω 26

103

Ω 26