o1-preview is pretty good at doing ML on an unknown dataset
Previous post: How good are LLMs at doing ML on an unknown dataset? A while back I ran some evaluation tests on GPT4o, Claude Sonnet 3.5 and Gemini advanced to see how good they were at doing machine learning on a completely novel, and somewhat unusual dataset. The data was basically 512 points in the 2D plane, and some of the points make up a shape, and the goal is to classify the data according to what shape the points make up. None of the models did better than chance on the original (hard) dataset, while they did somewhat better on a much easier version I made afterwards. With the release of o1-preview, I wanted to quickly run the same test on o1, just to see how well it did. In summary, it basically solved the hard version of my previous challenge, achieving 77% accuracy on the test set on its fourth submission (this increases to 91% if I run it for 250 instead of 50 epochs), which is really impressive to me. Here is the full conversation with ChatGPT o1-preview In general o1-preview seems like a big step change in its ability to reliably do hard tasks like this without any advanced scaffolding or prompting to make it work. Detailed discussion of results The architecture that o1 went for in the first round is essentially the same that Sonnet 3.5 and gemini went for, a pointnet inspired model which extracts features from each point independently. While it managed to do slightly better than chance on the training set, it did not do well on the test set. For round two, it went for the approach (which also Sonnet 3.5 came up with) of binning the points in 2D into an image, and then using a regular 2D convnet to classify the shapes. This worked somewhat on the first try. It completely overfit the training data, but got to an accuracy of 56% on the test data. For round three, it understood that it needed to add data augmentations in order to generalize better, and it implemented scaling, translations and rotations of the data. It also switched to a slightl
For aggregating several different benchmarks there is a natural way to avoid the y-axis problem, and one that introduces a new natural y-axis with a clear interpretation.
The idea is to use an ELO system.
Treat each benchmark as a set of individual contests between all pairs of models, with only win or lose as outcomes, and update ELOs accordingly.
This converges if you have enough different benchmarks, but of course loses a lot of the signal if you ony have a few (since it discards the information about how large the difference in y is).
Here is an example of this approach used (from a while back): https://x.com/scaling01/status/1919389344617414824?s=20