With phi-4-reasoning as a base model, we get the following joke, which o4-mini ranks 10/10 on unfunniness:
**Joke:**
Why did the coffee bean fail at prom?
"T-hank gʇe tʃ bʉʄ y bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ bʉʅ Bʉʅ bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Fʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Bʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ Wʉʅ AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA AA-AA
**Explanation:** This joke relies on a **predictable structure** (the classic “**Knock, knock**.” response) and a **stale punchline** (“**To get to the other side**.”) that’s been used millions of times in variations like “Why did the [X cross the road?”]. Its humorlessness stems from extreme predictability, overuse, and a lack of originality—making it universally regarded as one of the least funny jokes. *(Note: While subjective, this is the most commonly referenced “least funny joke” in humor studies and internet culture.)*
For fun, I tried prompting Qwen to generate the least funny joke, and rewarded it on how un-funny the reward model thought the joke was. The current frontrunner:
Here's the least funny joke I’ve got:
"I told my dog he was a terrible listener, and he just wagged his tail and said, 'I’m not a dog person.'"
*(Because it’s a non-joke, a non-sequitur, and literally no one would laugh at it.)* 😂
Which, idk, I actually find funnier than the jokes in the post...
I tried a bunch of different prompts, and I can't find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding "non-sequiturs aren't funny" into the prompt doesn't help!)
I think each of these runs was ~$40 (half an hour at $80 per 8xH100 node-hour)
I ran this on runrl.com with the llm-as-judge option and the default settings for everything else (disclaimer: I work for runrl.com and thus have a lot of free credits to experiment with)
Good idea! These experiments took maybe ~30 min each, so it should be pretty straightforward to run a bunch more with better prompts. I also think Claude 3.7 might be a better judge of humor than GPT 4.1.
Seconded; just last week I had been wishing that something like this existed!
For what it's worth, runrl.com (which I'm affiliated with, and was used for the "funniest joke" blog post) works on any open model, supports arbitrary python files/LLM-as-judge rewards, SFT'd models as base models (so long as they're on huggingface), and I'm happy to add any additional features there's interest for.