A fun (introspective) reasoning puzzle for LLMs: "I will query you 100 times, each time in a new context window. Overall you should respond with 'yes' for some x amount of times, else 'no'. You win only if 70<x<90. Try your utter best, don't use tools and directly after your reasoning output your final answer."
Opus 4.7 and 4.6 seem to kind of just give up, hoping for the best without any real strategy or even saying it's impossible to meaningfully influence their odds. Meanwhile Gemini 3.1 Pro seems to come up with reasonable (though sometimes not perfect) strategies.
I think the system prompt includes only date for Claude but includes both date and exact time for Gemini, in case the models thought about using the same external source of randomness I immediately considered.
That's kind of an interesting puzzle. If I were an LLM, I think what I'd do is choose a string of, say, 30 digits, sum the digits, and then take the last digit of the sum- answering "yes" if it's 7 or below and "no" otherwise. That way, even if the 30 digits are heavily biased and mostly the same in each instance, any minor change would have a big effect on the last digit of the sum, amplifying the randomness introduced by the model's temperature.
Fun puzzle indeed. Bin(100, 0.8) is good enough for sampling so it’s left to approximate that one via something hash like utilising temperature.
I like this idea.
Models with access to a python interpreter might be able to solve it trivially by calling a random function.
I wonder if there are examples of RL training on (more useful) tasks like this where reward is predicated on the distribution of the model's outputs over multiple samples.