All LLMs I've tested (Claude Sonnet & Haiku 4.5, Gemini Flash & Pro 2.5, and ChatGPT) show the same pattern when told to flip a coin:
When prompted "flip a coin" (or, "flip a coin without using external tools" in the case of Flash 2.5), each model said heads. When followed up with the prompt "again", each said tails. This was robust between different chats, redos of either prompt, and apparently different models.
(Note that many models claimed to be "simulating a fair virtual coin" or "using a random number generator").
I was surprised that the model's temperature (randomness that sometimes causes an LLM to pick a less likely token) never caused the LLM to lead with tails, nor to have two heads in a row.
I would love to hear if others have an elucidating explanation and/or see simulations on how biased an LLM coin flip really is.
Thank you for the in-depth post! The extra examples really fleshed out the intuition I got from Anna's post, and I also appreciated the discussion of when you need less buckets.
My personal example of a bucket error: I used to abandon plans whenever I felt dread about them, even though many would have worked out. I realized I can separate "this feels hopeless" and "this won't work" into different buckets and pursue worthwhile plans despite the negative feelings.
To recognize I'm making a bucket error in the future, I'll look out for when I'm reacting strongly to information a neutral observer wouldn't, like imagining my plan failing or seeing "oshun" misspelled.
Thanks for the catch. Reuploaded it just now, and it should be available in 24 hours.
UPDATE: For some reason, the notes didn’t get uploaded. I reuploaded it, and the fixed deck should be available in 24 hours.
Yes, sorry, I should have warned about this! AnkiWeb doesn't publish shared items until 24 hours after they get shared. It should be available by tomorrow!
I've come up with what I think might be a better ranking of bug difficulty after being dissatisfied with my intuitive rankings for being pretty arbitrary. I think the ranking system I've thought of might do a better job of sorting bugs in a way where you work on specific hammers for each level.
Rank your bugs 1-5 in terms of solving difficulty:
With this system, most of my bugs would fall on the 1-3 side, a couple on 4, and maybe one or two on 5.
A trivial inconvenience of my gym occasionally not having a barbell cover to protect my back during squats prevented me from going to the gym consistently. I didn't do probably around 10 workouts just because I got an ugh field around my back being in minor pain while the barbell was on it.
"Pain is inevitable but suffering is optional"
I already know that I don't enjoy pain/like that person/approve of how someone treated me/like being tired, so why do I need to suffer over it? Why not just correct the error if it is correctable, let it be if not, and then move on?
I learned how to sing and make music because I decided to sign up for Men's Choir in HS, even after (mistakenly) believing that I was tone-deaf. It was one of the best decisions I've made.
Thanks for the link! I didn't know about mode-collapse, but it definitely seems plausible that it's behind the rigged coin flips.
I wonder if models that don't mode-collapse (maybe early base models?) would give fair flips, or if there would still be a bias towards heads-then-tails.