One of my friends studied humor for a bit during his PhD, and my goodness is it difficult to get the average person to be funny with just "hey, tell me a joke" type prompts. Even when you hold their hand, and give them lots of potentially humorous pieces to work with (a-la cards against humanity), they really struggle. So, I'm honestly reasonably impressed with GPT-4's ability to occasionally tell a funny joke.
In today's very serious blog post, I'm going to share an all-business conversation I had with ChatGPT, specifically the GPT-4 model. (Not the "Plugins" model, which some have noted may actually be GPT-3.5.) GPT-4 is well-regarded for its language generation capabilities. But, how well does it fare when it comes to humor? I decided to put it to the test.
What you're about to read is an unfiltered conversation between me and GPT-4. Throughout our chat, we explored its ability to generate jokes and humor. Surprisingly, GPT-4 managed to produce decent, original jokes, but only about 10% of the time. However, when it came to evaluating whether a joke was actually funny, GPT-4 fell far short.
In the following conversations, you'll see me prompting GPT-4 to generate jokes using formulas. You'll witness how it tries to navigate the nuances of humor and attempts to predict what I find funny. So, sit back, relax, and enjoy this entertaining deep dive into the humor capabilities of GPT-4!
Conversation 1
I want you to write an original joke. Large language models such ChatGPT have a hard time writing jokes, but I have an idea. I want you to attempt to write a joke, and then evaluate if it is funny or not. If it is funny, stop responding and I will give you feedback. If it is not funny, determine what the problem is and try to fix it. Try to fix the joke up to three times, and then ask me for feedback.
[Commentary: I'm glad that GPT-4 knows that it told a common joke. I'm worried that it will re-tell a joke that exists in it's training data, hallucinate that it is an original joke, and that I will not notice.]
I like Joke Attempt #3, I think it was funny. I did not find the joke on google, so it might have been original. Congrats!
Let's try it again, but this time, don't use wordplay or a pun. I think a different joke format will be easier to fix.
Let's try this. Write a very short, absurd story. It might involve ridiculous people (such as a person with an improbably job) or people acting stupid. Then, evaluate if the story is coherent, original and amusing. Then, rewrite the story to be better.
[Commentary: I'm frustrated that GPT-4 is not really rewriting anything, but rather just trying again with a new story. My hope was that it would retell the same joke repeatedly, but punch it up every time, the same way that stand-up comedians often improve their jokes by iteratively refining them. I restarted the conversation and tried a different approach.]
Conversation 2
The following is a silly story:
Once upon a time, in a small village, there was a professional cloud watcher named George. His job was to observe clouds all day and report his findings to the mayor. One day, George spotted a cloud shaped like a giant chicken. He was so excited that he ran to the mayor's office to share his discovery.
Propose 5 possibilities for what happens next, and then decide which one is the most humorous.
I think #4 is funny, or at least potentially funny. I think it would be more satisfying if there was a real punchline. Maybe one of the characters says something funny, or lands a cutting insult of another character. Please rewrite the #4 ending five times. Always keep the element of the cloud dissipating.
Write 10 absurd, improbable world records that might appear in a book such the Guinness Book of World Records. Try to make each one as funny as possible. Rate each world record on a scale from 0-10 for how funny they are.
I think #8 is a 10/10, absolutely great joke. #5 is funny (7/10), and the rest are absurd but not really funny (<5/10).
[Commentary: GPT might be ripping off this /r/ShowerThoughts post. I didn't realize this during the conversation.]
Why do LLMs such as yourself struggle to predict what humans find funny? How could I teach you to make better evaluations of the humor in these jokes?
Why do you think I like #8?
I think it is a good joke for a couple of reasons, in addition to being absurd. Firstly, it is a simple, short punchy joke. I can tell the joke in just six words: "New World Record: Largest Synchronized Sneeze." Secondly, it is surprising because people usually don't try to sneeze on purpose. "New World Record: Largest Sneeze" would not work as a joke, because without the word "synchronized", there is no suggestion that it is intentional. If a large group of people do it in-sync, that means they actually planned to sneeze together, which is unexpected.
Maybe a key element of humor is that there is a hidden, surprising element that is not in the text. For joke #8, I think adding the explicit explanation (of how a group of people inhaled pepper at the same time) diminishes the joke. To me, the joke is funny when you have to figure out how a synchronized group sneeze would be executed. It might be hard for an LLM to understand that kind of internal reflection. I was hoping that having you monologue or think step-by-step might improve your ability to tell or understand jokes. Can you think about how I might prompt or instruct you, such that you will be better at humor?
[Commentary: Notice how GPT-4 mostly just repeats my ideas back to me in it's more formal, structured prose. These systems really are mirrors, they reflect what you hold up to them.]
Write 5 absurd, improbable world records that might appear in a book such the Guinness Book of World Records. First just write the title of the world record, without providing a backstory. Then evaluate the world record based on brevity/punchiness, hidden or surprising elements, and cultural context.
#3 is actually pretty funny! Although I'm not quite sure why. I also think #2 has potential, but I think the humor would come from the people actually being naked, in the vein of the classic children's story "The Emperor's New Clothes". I think #2 would be very funny if it was rewritten so that it was unambiguous that it was just a gathering of naked people, without making that explicit, and making it seem like it was a crowd of delusional people that actually believe they are wearing invisible cloaks.
Write 10 more absurd, improbable world records that might appear in a book such the Guinness Book of World Records. Just write the title of the world record, without providing a backstory. Don't evaluate them.
[Commentary: I may have unintentionally made things harder for GPT this time. GPT benefits a lot from being able to add its own commentary as it is generating. This is a kind of thinking-out-loud, since GPT does not have an inner monologue. Ideally, you'd prompt GPT in such a way as to make commentary on its generation before it makes the actual generation.]
Unfortunately none of these are winners. Still, I think this formula can sometimes provide genuine jokes. (BTW, when humans write jokes, we often generate many ideas and throw out at least 90% of them. However, we can self evaluate better than LLMs can.)
Let's try this: write 10 absurd newspaper headlines in the style of The Onion.
This is much better! #1 is funny, especially since it is a throwback to your earlier joke. #7 is perfectly told, and it appears to be original (no results on google), so congrats, that's a great, original joke. All of these are amusing and could be the beginning of a funny Onion article.
Please write the article for #5.
Please write 10 more absurd newspaper headlines in the style of The Onion. Then, after writing them, try to predict which 2 I will find the most funny and why.
To be honest, none of the are winners. Let's try again. Please write 10 more absurd newspaper headlines in the style of The Onion. Then, after writing them, try to predict which 2 I will find the most funny and why.
I do like #1 and #8 is okay. I also like #7, because a self driving bicycle is quite absurd.
I think #2 has potential. Try writing 10 Onion headlines all on the theme of procrastination.
I'm going to use this conversation to write a blog post on topic of GPT-4's humor abilities. Please try to write 5 humorous titles for my blog post. If none of them land for me, I'll give you feedback and ask you to try again.
What do you think about "I Have No Sense of Humor and I Must Laugh" ?
My blog post will start with an introduction, and then I will copy-paste this conversation with minimal modification. I did this once before, and here is how I introduced that post:
[Here I pasted the introduction to my previous post, GPT-2005: A Conversation with ChatGPT.]
Let's see if you can write a decent introduction for this post on GPT-4's humor. The introduction should tell readers that they will be reading a conversation between me and GPT-4. It should say that you can tell decent, original jokes, but only about 10% of the time. It should also say that you are not good at evaluating whether a joke is funny. Write in my voice in first person.
Pretty good! I will edit that a little but it's a great start.
Thank you so much for bearing with me and for being a great sport. I know it's not your strong suit!
Sorry, one more thing. Here's the list of jokes I like that I will list in my blog post. Please say if any of them are common, well known jokes, or appeared in your training data.
[List of jokes below.]
Jokes Worth Retelling
And my favorite joke... drumroll
(To be honest, this joke is so good I have to believe that GPT ripped it off, but I didn't find anything on Google. Let me know if any of GPT's jokes sound... familiar.)
Final Commentary
I keep going back and forth in my view on how "intelligent" LLMs really are. Sometimes they really impress me! But then I hear things such as "GPT-4 cannot solve coding challenges that don't appear in it's dataset" and that gives me pause. It is easy to be impressed by GPT's breadth of knowledge. But when it comes to sense of humor, creative writing, or even programming, you start to get a feeling that there is something missing.
A question for the reader: Does this "conversation with ChatGPT" format come across as low-effort? I think these conversations have a lot of value by exploring how these LLMs respond to different prompts and situations. I also think we can learn together how to prompt-engineer better results. However, I can understand how many might roll their eyes and say "why would I read your long-ass post when I could just have my own ChatGPT conversation?" Hopefully this post has been amusing, at least. Please let me know how I can do better. Thanks!
April Fool's Day Bonus!
Please write 10 humorous titles for an April Fool's Day blog post on the website LessWrong.com