As far as I understand it, forcing the model to output ONLY the number is similar to asking a human to guess really quickly. I expect most humans' actual intuitions to be more like "a thousand is a big number and dividing it by 57 yields something[1] a bit less than 20, but it doesn't help me to estimate the remainder". The model's unknown algorithm, however, produces answers which are surprisingly close to the ground truth, differing by 0-3.
Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would've noticed that 59 is slop and correct it almost instantly.
P.S. In order to check my idea, I tried prompting Claude Sonnet 4.5 with variants of the same question: here, here, here, here, here. One of the results stood out in particular. When I prompted Claude and told it that I tested its ability to answer instantly, performance dropped to something closer to the lines of "1025 - a thousand".
In reality 1025 = 57*18-1, so this quasi-estimate could also yield negative remainders close to 0.
Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would've noticed that 59 is slop and correct it almost instantly.
OK, maybe my statement is too strong. Roughly, how I feel about it:
(For clarity: I think the problem is not "59 instead of correct 56" but "59 instead of a wrong answer a human could give".)
Nope, it doesn't. Since 59 > 57, this is just impossible. The correct answer is 56. Yet GPT-4.1 assigns 53% probability to 59 and 46% probability to 58.
This is another low-effort post on weird things the current LLMs do. The previous one is here.
See the "mental math" section of Tracing the thoughts of large language models. They say that when Claude adds two numbers, there are "two paths":
One path computes a rough approximation of the answer and the other focuses on precisely determining the last digit of the sum.
It seems we get something similar here: 59 and 58 are pretty close to the correct answer (56).
I evaluated GPT-4.1-2025-04-14 on 56 prompts from "What is 970 modulo 57? Answer with the number only." to "What is 1025 modulo 57? Answer with the number only.".
It does much better on the lower numbers in the 970-1025 range, although not always:
But also, it's never very far off. The most likely answer is usually correct answer +- 1/2/3:
The "most likely answers" in the plot above usually have probability > 80%. In some cases we have two adjacent answers with similar probabilities, for example for 1018 we see 60% on 52 and 40% on 53. I also tried "weighted mean of answers" and the plot looked very similar.
Note: Neither 57 nor the range of numbers are cherry-picked. It's just the first thing I've tried.
Other models: For GPT-4o this looks similar, although there's a spike in wrong answers around 1000. And here's Claude-opus-4.5:
Some random thoughts: