I feel like your jumping to cheating way to quickly. I think everyone would agree that there is overfitting to benchmarks and to benchmark like questions. Also, this is a very hard problem. The average person doesn't have a shot at contributing to security research. Even the typical appsec engineer with years of experience would fail at the task of navigating a new codebase, creating a threat model and finding important security issues. This takes an expert in the field at least a few days of work. This is much longer than the time periods that AIs can be expected to work on a problem productively. I don't think you get a... (read more)
I feel like your jumping to cheating way to quickly. I think everyone would agree that there is overfitting to benchmarks and to benchmark like questions. Also, this is a very hard problem. The average person doesn't have a shot at contributing to security research. Even the typical appsec engineer with years of experience would fail at the task of navigating a new codebase, creating a threat model and finding important security issues. This takes an expert in the field at least a few days of work. This is much longer than the time periods that AIs can be expected to work on a problem productively. I don't think you get a... (read more)