I primarily use LLMs when working with mathematics, which is one of the areas where the recent RL paradigm was a clear improvement—reasoning models are finally useful. However, I agree with you that benchmark-chasing isn’t optimal, in that it still can’t admit when it’s wrong. It doesn’t have to give up, but when it couldn’t do something, I’d rather it list out what it tried as ideas, rather than pretending it can solve everything, because then I actually have to read through everything.

Of course, this can be solved with some amateur mathematicians reading through it and using RL to penalize BS. So, I think this is a case where benchmark performance was prioritized over actual usefulness.

Reply