SWE-Bench Pro is even worse
Yesterday, OpenAI announced that they would be no longer using SWE-Bench Verified, instead recommending SWE-Bench Pro. One of their justifications was: many tasks from SWE-Bench Verified are broken. A correct solution might not be accepted. They are correct about this. These issues have been documented by others. However, as bad...