x
SWE-Bench Pro is even worse — LessWrong