This really resonates with me. I don't work in AppSec, but I've seen how benchmark gains often fail to show up when you're doing something non-trivial with the model. It seem that current benchmarks have low ecological validity. Although I wouldn't quickly put the blame on labs possibly cheating. They may or they may not, but it also might just be that we're bad at designing evaluations that tracks real-world usefulness.
When you think about it, even university exams don't really predict job performance either. These are benchmarks we've had centuries to re... (read more)
This really resonates with me. I don't work in AppSec, but I've seen how benchmark gains often fail to show up when you're doing something non-trivial with the model. It seem that current benchmarks have low ecological validity. Although I wouldn't quickly put the blame on labs possibly cheating. They may or they may not, but it also might just be that we're bad at designing evaluations that tracks real-world usefulness.
When you think about it, even university exams don't really predict job performance either. These are benchmarks we've had centuries to re... (read more)