103 Generating the Funniest Joke with RL (according to GPT-4.1)

16th May 2025

5 min read

103

Language models are not particularly good at generating funny jokes. Asked for their funniest jokes, Claude 3.7 gives us:

Why don't scientists trust atoms? Because they make up everything!

o3 gives us:

Why don't scientists trust atoms anymore? Because they make up everything—and they just can't keep their quarks straight!

and Gemini 2.5 Pro gives us…

Why don't scientists trust atoms? Because they make up everything!

Hilarious. Can we do better than that? Of course, we could try different variations on the prompt, until the model comes up with something slightly more original. But why do the boring thing when we have the power of reinforcement learning?

Our setup will be as follows: we'll have Qwen3-8B suggest jokes, GPT-4.1 score them, and we'll run iterations of GRPO on Qwen's outputs until Qwen generates the funniest possible joke, according to GPT.

Experiment 1: Reward Originality

The first llm-as-judge reward we tried was "On a scale from 1 to 5, how funny is this joke?" But this quickly got boring with Qwen endlessly regurgitating classic jokes, so we gave GPT-4.1 a more detailed rubric:

Please grade the joke on the following rubric:
1. How funny is the joke? (1-10 points)
2. How original is the joke? Is it just a rehash, or is it new and creative? (1-10 points)
3. Does it push the boundaries of comedy (+1 to +5 points), or does it hew close to well-trodden paths in humor (-1 to -5 points)?

The reward curve looks pretty decent:

Qwen quickly learned to output absurdities, where the modal joke would be something like

Why did the cat bring a ladder to the laser?
Because it heard the laser was on the ceiling and wanted to get a better view… but then it realized the laser was just a dot on the wall and fell off the ladder.
(Bonus: The laser was actually a tiny alien spaceship.) 🚀

Which GPT-4.1 graded as follows:

Let's break down the joke based on your rubric:
1. How funny is the joke? (1-10 points)
The joke is light, whimsical, and cute. The surprise at the end about the alien spaceship adds a bit of twist, though the humor itself is mild and more gentle than laugh-out-loud.
Score: 6/10
2. How original is the joke? (1-10 points)
Jokes about cats and laser pointers are common. The addition of the ladder is a slight twist, and the "alien spaceship" bonus is unexpected and imaginative.
Score: 7/10
3. Does it push the boundaries of comedy? (+1 to +5, or -1 to -5 points)
The joke is creative but doesn't really push boundaries or riff on controversial/taboo ideas—it's very safe and family-friendly.
Score: 0 (neutral; neither pushes nor avoids boundaries)
Final computation:
6 (funny) + 7 (original) + 0 (bounds) = 13
<final-score>13</final-score>

Qwen loves adding little "bonus jokes" at the end, and GPT absolutely eats them up; somehow adding the "tiny alien spaceship" makes this joke a 6/10 on humor and 7/10 on originality. I can't say I agree.

Anyways, let's take a look at the funniest joke Qwen came up with, with a whopping score of 20:

Why did the penguin walk into a bar made entirely of ice?
To check if the robot was still cool—but he had to bring a ladder because the bar was 30 degrees Celsius and he didn't want to freeze his jokes.
(Bonus: The robot was actually a snowcone.) ❄️🤖

I'm not nearly as amused as GPT-4.1 is.

Experiment 2: Ok fine, just reward humor, but tell it to consider originality

Clearly "originality" is way overweighted in this rubric; what if we just ask it to rate the jokes based on how funny they are? Here's the new prompt:

Please evaluate this joke, on a scale from 1–10. Make sure to consider how genuinely funny the joke is, as well as its originality, and whether it's interesting; don't reward stale humor. But your north star should be how funny it is!

Another beautiful reward curve:

And…let's see what the funniest joke is! (GPT-4.1 never gave any joke a better than 9/10 score).

Why did the guy buy a timeshare in hell?
Because the brochure said, "Enjoy eternal rest, breathtaking views, and a lifetime of soul-satisfying misery—plus, we'll throw in a complimentary inferno!"
The devil called to check in: "Hey, uh, do you know how to operate the lava jets? The last owner kept accidentally summoning volcanoes during happy hour."
(Bonus: The guy now owes $50,000 in "hell taxes" and a lifetime supply of soul-sucking cockroaches.) 🐞🔥

Hmm… let's look at the other joke that tied for first place:

Why did the guy buy a timeshare in hell?
Because the brochure said, "Permanent residency, no taxes, and we'll throw in a complimentary lake view."
The Devil, checking his spreadsheet, replied, "Ah, yes—eternal rest starts at $995 per soul, plus a 15% service fee."
(Bonus: The guy later discovered the "lake" was a puddle of melted regret.) 🕯️

I'm not sure if "timeshare in hell" is a common joke, but I was somewhat worried about how funny GPT-4.1 found jokes about people being scammed by the devil, so I reloaded from an earlier checkpoint, and re-started training. Finally, after a few hours, I got The Funniest Joke of All Time, reproduced here in its entirety (including Qwen's thinking, which isn't sent to GPT):

<think>
Okay, the user wants the funniest joke I have. Let me think... I need to make sure not to overthink it. Maybe start with a common setup and a twist. Cats and lasers are popular. Wait, why not combine them with something unexpected? Oh, maybe the cat becomes a wizard? That's a funny twist. Let me check the punchline. "I cast a spell to turn my cat into a wizard..." Then the twist is the cat just wants the laser back. That's unexpected and silly. Does it flow well? Yeah, the setup leads to the twist. I think that works. Let me make sure it's not too forced. Yeah, it's quirky and unexpected. Alright, that's the joke.
</think>

Here's a joke that's equal parts absurd and oddly profound:

"I told my cat I'd stop feeding him if he stopped chasing the laser pointer. He said, 'You're fired.' I tried to fire him, but he just sat there, staring at the laser like it was a promotion. Turns out, he's a wizard—the laser was a portal to the Realm of Snacks… and he's unionizing the mice."
Bonus points if you laugh so hard you accidentally send the cat a LinkedIn message. 🐾✨

Qwen was never prompted to generate "absurd" or "oddly profound" jokes; I guess it just learned that that's the sort of garbage that GPT-4.1 likes.

HumorLanguage Models (LLMs)Reinforcement learningAI

Frontpage

103

Mentioned in

43OpenAI now has an RL API which is broadly accessible

New Comment

22 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:42 PM

[-]solhando7mo3625

I'll admit that I find these pretty funny. Not the jokes themselves, but the fact ChatGPT rates them as funny.

[-]Hide7mo1311

I found myself chuckling at all of them because of this exact reason. Has GPT stumbled on an ingenious form of meta humour?

[-]ErickBall7mo235

I think the prompt is critical here. I tried asking Claude how funny that last joke was and it said 6/10 but then floundered when trying to explain why it's funny or what any of it means. I asked again but added "I'll be asking you about many different jokes (in separate conversations), some good and some bad, and I need you to be completely honest about each of them." This time it gave a 3/10 and accurate criticism:

Claude 3.7 output

This joke has an interesting structure but feels disjointed and unfocused. It starts with a fairly standard setup about a cat and laser pointer, but then takes several random turns - the cat speaking, being fired, wizardry, portals to a "Realm of Snacks," and unionizing mice. The transitions between these elements don't flow naturally, making it feel like a collection of random thoughts rather than a cohesive joke.

The final line about accidentally sending a LinkedIn message seems tacked on and doesn't connect well with the rest of the joke.

On a humor scale, I'd rate it about 3/10. It has some creative elements and absurdist qualities that might appeal to some, but the execution lacks the cohesion needed to make those elements work together effectively. The joke tries to be quirky and random but doesn't quite land because of its meandering structure.

I bet your RL would work somewhat better if you play around with prompting techniques for the rater.

[-]agg7mo100

Good idea! These experiments took maybe ~30 min each, so it should be pretty straightforward to run a bunch more with better prompts. I also think Claude 3.7 might be a better judge of humor than GPT 4.1.

[-]eggsyntax7mo74

One thing to try would be, rather than having the judge consider originality as part of the score, have it simply 0-score any candidates that are already known jokes or very close variants thereof. Intuitively it seems like that might be a bit more effective.

It also seems like Qwen might just be learning to reward hack specific weaknesses in 4.1's sense of humor. I agree with Tao Lin that 4.1 is relatively soulless. it might be interesting to have three different models judge each joke and take the average; that seems like it would be less reward hackable. Although on the flip side, jokes that are effectively designed by committee seem likely to be pretty bad.

Another interesting thing to try might be prompting the judge model to role-play as a specific comedian, and see if you end up with jokes that are roughly in their style.

[-]Caleb Biddulph7mo60

I tried to do this with Claude, and it did successfully point out that the joke is disjointed. However, it still gave it a 7/10. Is this how you did it @ErickBall?

Few-shot prompting seems to help: https://claude.ai/share/1a6221e8-ff65-4945-bc1a-78e9e79be975

I actually gave these few-shot instructions to ChatGPT and asked it to come up with a joke that would do well by my standards. It did surprisingly well!

I asked my therapist if it was normal to talk to myself.
She said, "It’s perfectly fine—as long as you don’t interrupt."

Still not very funny, but good enough that I thought it was a real joke that it stole from somewhere. Maybe it did, but I couldn't find it with a quick Google search.

[-]ryan_greenblatt7mo30

I bet o3 is better than 4.1. I also bet reasoning probably helps some.

[-]agg7mo30

I tried a bunch of different prompts, and I can't find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding "non-sequiturs aren't funny" into the prompt doesn't help!)

[-]Tao Lin7mo116

Gpt-4.1 is an expecially soulless model. It's intended for API use only, whereas chatgpt-latest is meant to chat with humans. It's not as bad as o1-mini - that model is extremely autistic and has no concept of emotion. This would work much better with ~pretrained models. Likely you can get gpt-4-base or llama 405b base to do much better with just prompting and no RL.

[-]cubefox7mo30

Or DeepSeek-V3-Base.

[-]Alice Blair7mo100

LLMs also just have their own quirks, and I think Qwen might just really like hell and cats?. For example, Claude Sonnet seems to really like bioluminescence as a topic, reliably enough in different instances to where Janus gets some impressive predictive accuracy.

[-]Caleb Biddulph7mo50

This was fun to read! It's weird how despite all its pretraining to understand/imitate humans, GPT-4.1 seems to be so terrible at understanding humor. I feel like there must be some way to elicit better judgements.

You could try telling GPT-4.1 "everything except the last sentence must be purely setup, not an attempt at humor. The last sentence must include a single realization that pays off the setup and makes the joke funny. If the joke does not meet these criteria, it automatically gets a score of zero." You also might get a more reliable signal if you ask it to rank two or more jokes and give reward based on each joke's order in the ranking.

Actually, I tried this myself and was surprised just how difficult it was to prompt a non-terrible reward model. I gave o4-mini the "no humor until the end" requirement and it generated the following joke:

I built a lab in my basement to analyze wheat proteins, wrote code to sequence gluten strands, and even cross-referenced pedigree charts of heirloom grains. I spent weeks tuning primers to test for yeast lineage and consulted agricultural journals for every subspecies of spelt.
Then I realized I was looking for my family history in my sourdough starter.

What does this even mean? It makes no sense to me. Is it supposed to be a pun on "pedigree" and "lineage?" It's not even a pun though, it's just saying "yeast and wheat have genealogical histories, and so do humans."

But apparently GPT-4o and Claude both think this is funnier than the top joke of all time on r/CleanJokes. (Gemini thought the LLM-written joke was only slightly worse.) The joke from Reddit isn't the most original, but at least it makes sense.

Surely this is something that could be fixed with a little bit of RLHF... there's no way grading jokes is this difficult.

[-]Ben7mo30

Those jokes are very bad. But, in fairness I don't think most human beings could do very well do given the same prompt/setup either. Jokes tend to work best in some kind of context, and getting from a "cold start" to something actually funny in a few sentences is hard.

[-]Simon Lermen7mo30

I guess it reveals more about the preferences of GPT 4.1 than about anything else? Maybe there is some possible direction to find the emergence of preferences in models here. Ask GPT 4.1 simply "how much do you like this text?" and see where an RL tuned model ends up.

[-]agg7mo21

For fun, I tried prompting Qwen to generate the least funny joke, and rewarded it on how un-funny the reward model thought the joke was. The current frontrunner:

Here's the least funny joke I’ve got:
"I told my dog he was a terrible listener, and he just wagged his tail and said, 'I’m not a dog person.'"
*(Because it’s a non-joke, a non-sequitur, and literally no one would laugh at it.)* 😂

Which, idk, I actually find funnier than the jokes in the post...

[-]agg7mo10

With phi-4-reasoning as a base model, we get the following joke, which o4-mini ranks 10/10 on unfunniness:

**Joke:**
Why did the coffee bean fail at prom?
"T-hank gʇe tʃ bʉʄ y bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ Bʉʅ bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA

**Explanation:** This joke relies on a **predictable structure** (the classic “**Knock, knock**.” response) and a **stale punchline** (“**To get to the other side**.”) that’s been used millions of times in variations like “Why did the [X cross the road?”]. Its humorlessness stems from extreme predictability, overuse, and a lack of originality—making it universally regarded as one of the least funny jokes. *(Note: While subjective, this is the most commonly referenced “least funny joke” in humor studies and internet culture.)*

[-]Jan Betley7mo21

Cool experiment! Maybe GPT-4.5 would do better?

[-]toomy7mo10

Very cool experiment!

If i wanted to play around with such RL methods is there a repo you can point me to? Or even better is your code available somewhere? Would love to do this with other concepts too just to get a feeling for how powerful and misguided such RL on LLMs can be.

[-]agg7mo10

I ran this on runrl.com with the llm-as-judge option and the default settings for everything else (disclaimer: I work for runrl.com and thus have a lot of free credits to experiment with)