in August, Opus 4.1 was already scoring 80% on this benchmark.
Can someone explain why the SWEBench Verified page still shows a top score of 75% which has not changed since June? Are they delayed, are they using different criteria, etc?
“We’ll, if I wasn’t conscious, I never would have pressed that start button”
Fairly minor but I think I see an unmentioned error in the "41" section:
the first six positive even integers: 2 + 4 + 6 + 8 + 10 + 11 = 41
11 is not even (it seems to be thinking somewhat of 42?)
Edit: Actually the more I think about it, it's a pretty interesting error. ChatGPT 4 produces much better answers than I would for the majority of these questions, but I don't think I would make this error. If you asked it, I'm sure it would correctly explain that the sum of even integers cannot be odd, or that 11 is not even, etc, but I wonder if (for example) a large amount of training text about the number 42 being the sum of the first six positive even integers was somehow "close" enough to "41" to overwhelm any emergent understanding of how to apply those concepts?
In 10 to 20 years, when tensor processors are cheap and power-efficient, it will be common for networks of self-replenishing autonomous drones to surveil and police vast areas of land.
Is there a betting market for this?
A fascinating eukaryotic exception is the white-throated sparrow, which functionally has four genders in an equilibrium where the tan-striped males mostly mate with white-striped females and vice versa. (I first read about it in Joan Strassmann’s book Slow Birding; the Wikipedia page for White-Throated Sparrow also has some introductory info. It seems to involve a chromosomal inversion.)
Thank you. You make a good case for including this as evidence that capabilities are increasing. I suppose the question is whether they are increasing at the rate needed for short timelines. I think it’s worth asking whether the same-infrastructure performance showing zero improvement in four months is something that would have been expected four months ago. Of course, this is only one metric, over a short timeframe.