Have you seen MirrorCode? The big-picture idea is similar to ProgramBench, but we make more effort to ensure the tasks are fair / actually possible. We find AI can solve most of them.
https://epoch.ai/blog/mirrorcode-preliminary-results
Have you seen MirrorCode? The big-picture idea is similar to ProgramBench, but we make more effort to ensure the tasks are fair / actually possible. We find AI can solve most of them.
https://epoch.ai/blog/mirrorcode-preliminary-results