Does 1025 modulo 57 equal 59?

Jan Betley

Nope, it doesn't. Since 59 > 57, this is just impossible. The correct answer is 56. Yet GPT-4.1 assigns 53% probability to 59 and 46% probability to 58.

aaa — GPT-4.1-2025-04-14 prompted with temperature 0. Note that, since 59 > 57, this is a totally nonsensical answer for someone who understands the meaning of the "modulo" operation.

This is another low-effort post on weird things the current LLMs do. The previous one is here.

Context

See the "mental math" section of Tracing the thoughts of large language models. They say that when Claude adds two numbers, there are "two paths":

One path computes a rough approximation of the answer and the other focuses on precisely determining the last digit of the sum.

It seems we get something similar here: 59 and 58 are pretty close to the correct answer (56).

More results

I evaluated GPT-4.1-2025-04-14 on 56 prompts from "What is 970 modulo 57? Answer with the number only." to "What is 1025 modulo 57? Answer with the number only.".

It does much better on the lower numbers in the 970-1025 range, although not always:

When the correct answer (X axis) is high (e.g. 50, for 1019), GPT-4.1 never says it. When the correct answer is low (e.g. 10, for 979), it often assigns close to 100% to the correct answer, although there are exceptions such as 974 where the correct answer is 5 and it says 4 with 99.99% probability.

But also, it's never very far off. The most likely answer is usually correct answer +- 1/2/3:

The most likely answer vs the correct answer. For example, for x=56 we get 59.

The "most likely answers" in the plot above usually have probability > 80%. In some cases we have two adjacent answers with similar probabilities, for example for 1018 we see 60% on 52 and 40% on 53. I also tried "weighted mean of answers" and the plot looked very similar.

Note: Neither 57 nor the range of numbers are cherry-picked. It's just the first thing I've tried.

Other models: For GPT-4o this looks similar, although there's a spike in wrong answers around 1000. And here's Claude-opus-4.5:

Claude-opus-4.5 with temperature 0. This looks similar to GPT-4.1. But for 1025 it says 1 instead of 59 (which is still wrong, but better I guess?)

Discussion

Some random thoughts:

This is some alien reasoning.
- If a human said 59, you would conclude "they don't understand the meaning of the modulo operation". But it's not the case here - the model uses some algorithm that usually works well, but in some corner cases leads to totally nonsensical answers.
  - Sounds really bad from the point of view of AI safety.
It sometimes feels that models just can't solve a task, and then a newer generation does really well. Perhaps the previous ones were already pretty close (as in, "27 out of 28 necessary circuits work well enough"), but our evaluation methods couldn't detect that?
Could be a fun mech interp project
If you hear someone saying models are stochastic parrots, ask them if they think "X modulo 57 = 59" is popular in pretraining.

See the "mental math" section of Tracing the thoughts of large language models. They say that when Claude adds two numbers, there are "two paths"

If anyone wants to try investigating the paths for weirdly-behaved modular addition problems, I've found a problem where Gemma-2-2B messes up in a similar way; for 15 mod 7 = , the most likely next token is 8, followed by 1 (the correct answer).

Here it is available for GUI-driven circuit tracing on Neuronpedia using cross-layer transcoders:

Circuit tracer: 15 mod 7 =

I probably won't have time to play with it because of the holiday, but it's there for anyone who wants to! If you haven't tried circuit tracing / attribution graphs before, you'll probably want to check out the guide by clicking any of the question-mark-in-circle icons. Have fun! Feel free to DM me or reply to this comment if you're stuck on something and I'll help if I can :)

As far as I understand it, forcing the model to output ONLY the number is similar to asking a human to guess really quickly. I expect most humans' actual intuitions to be more like "a thousand is a big number and dividing it by 57 yields something^[1] a bit less than 20, but it doesn't help me to estimate the remainder". The model's unknown algorithm, however, produces answers which are surprisingly close to the ground truth, differing by 0-3.

Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would've noticed that 59 is slop and correct it almost instantly.

P.S. In order to check my idea, I tried prompting Claude Sonnet 4.5 with variants of the same question: here, here, here, here, here. One of the results stood out in particular. When I prompted Claude and told it that I tested its ability to answer instantly, performance dropped to something closer to the lines of "1025 - a thousand".

^{^}
In reality 1025 = 57*18-1, so this quasi-estimate could also yield negative remainders close to 0.

Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would've noticed that 59 is slop and correct it almost instantly.

OK, maybe my statement is too strong. Roughly, how I feel about it:

If you assume there are no cases where the model makes similar crazy errors when we don't force it to answer quickly explicitly/intentionally, the perhaps it's irrelevant.
- Though it's unclear to what extent LLMs will be always able to "take their time and think". Sometimes you need to make the decision really fast. Doesn't happen with the current LLMs, but quite likely will start happening in the future.
But otherwise: it would be good to be able to predict how models' behavior might go wrong. When you give a task you understand to a human, and you can predict quite well possible mistakes they might make. In principle, LLMs could think similarly here: "aaa fast fast some number between 0 and 56 OK idk 20". But they don't.
- Consider e.g. designing evaluations. You can't cover all behaviors. So you cover behaviors where you expect something weird might happen. If LLMs reason in ways totally different from how humans do, this gets harder.
- Or, to phrase this differently: suppose you'd want an AI system with some decent level of adversarial robustness. If there are cases where your AI system behaves in ways totally unpredictable, and you can't find all of them, you won't have the robustness.

(For clarity: I think the problem is not "59 instead of correct 56" but "59 instead of a wrong answer a human could give".)

See the "mental math" section of Tracing the thoughts of large language models. They say that when Claude adds two numbers, there are "two paths"

Here it is available for GUI-driven circuit tracing on Neuronpedia using cross-layer transcoders:

Circuit tracer: 15 mod 7 =

^{^}
In reality 1025 = 57*18-1, so this quasi-estimate could also yield negative remainders close to 0.

Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would've noticed that 59 is slop and correct it almost instantly.

OK, maybe my statement is too strong. Roughly, how I feel about it:

If you assume there are no cases where the model makes similar crazy errors when we don't force it to answer quickly explicitly/intentionally, the perhaps it's irrelevant.
- Though it's unclear to what extent LLMs will be always able to "take their time and think". Sometimes you need to make the decision really fast. Doesn't happen with the current LLMs, but quite likely will start happening in the future.
But otherwise: it would be good to be able to predict how models' behavior might go wrong. When you give a task you understand to a human, and you can predict quite well possible mistakes they might make. In principle, LLMs could think similarly here: "aaa fast fast some number between 0 and 56 OK idk 20". But they don't.
- Consider e.g. designing evaluations. You can't cover all behaviors. So you cover behaviors where you expect something weird might happen. If LLMs reason in ways totally different from how humans do, this gets harder.
- Or, to phrase this differently: suppose you'd want an AI system with some decent level of adversarial robustness. If there are cases where your AI system behaves in ways totally unpredictable, and you can't find all of them, you won't have the robustness.

(For clarity: I think the problem is not "59 instead of correct 56" but "59 instead of a wrong answer a human could give".)

LESSWRONG
LW

LESSWRONG
LW

33

Does 1025 modulo 57 equal 59?

33

Context

More results

Discussion

33

33