Most benchmarks face inherent problem of goodhart's law: as soon as they become a target metric, efforts converge on optimizing for the benchmark itself, potentially diverging from the capabilities it was meant to measure.
The methods used to solve these "new" problems are, as I've already stated, highly amenable to brute-force approaches. It's more of a computation problem [BitterLesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). I'm again not surprised these kinds of problems got solved.
While these problems may be new, they employ very similar methods to those definitely used in previous competitive programming and Math Olympiad problems. I don't think the author has really looked into the specifics of how AlphaGeometry and AlphaProof have come up with these solutions. It's honestly disappointing to see that they were able to mislead such people (if he truly hasn't looked into the specifics). It seems more like he wants to use his status to push a particular narrative.
I would bet a lot that this system will fail on almost any combinatorics problem at the moment...
Since this is a point that seemingly causes a lot of confusion and misunderstanding, I'll try to find some time to write down my reasoning and thoughts in a more exhaustive post.
Honestly, I think people are overestimating this. Some quick thoughts:
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing. Also, their solution to one very hard problem was exactly the same as one from a human competitor, and they likely had this exact problem already in their training data, so their training data was likely contaminated.
I'm not really surprised that this type of algorithm will perform well on ad-hoc & constructive type problems. Generally, in a lot of... (read 399 more words →)
Interesting! I remember that ChatGPT-3.5-Turbo could, surprisingly, play very long games without any problems. Although it wasn't very good, it almost never hallucinated moves. e.g https://lichess.org/tKa7ObGI