Introducing the WeirdML Benchmark

Håvard Tveit Ihle

How good are LLMs at doing ML on an unknown dataset?

o1-preview is pretty good at doing ML on an unknown dataset

Introduction

How good are Large Language Models (LLMs) at doing machine learning on novel datasets? The WeirdML benchmark presents LLMs with weird and unusual machine learning tasks, designed to require careful thinking and actual understanding to solve, and tests an LLM's ability to:

Actually understand the properties of the data and the problem
Come up with an appropriate ML architecture and training setup for the problem, and generate working PyTorch code that implements the solution
Debug and improve the solution over 5 iterations based on terminal output and the accuracy on the test set
Make good use of limited computational resources and time

Each task comes with a task prompt describing the problem precisely and some example code for loading data and saving predictions. The different tasks pose various challenges: some require heavy data augmentation, others need careful feature engineering, or require combining information from many different parts of the input.

Results

Average accuracy across all tasks for each model. Grey markers indicate performance on individual tasks, bars show the mean across tasks.

Average accuracy across all six tasks for each model.

Evaluation Setup

The evaluation uses an automated pipeline that:

Presents the task to the LLM
Executes the generated code in an isolated environment
Evaluates the results against the test set
Provides feedback (terminal output from the code execution and test accuracy) to the LLM for improvement

Evaluation pipeline showing the flow from LLM code generation through isolated execution to metric evaluation and feedback, with fixed computational constraints enforced via Docker

System Architecture

The system executes code in a Docker container with strict resource limits (TITAN V GPU with 12GB memory, 600-second timeout). This ensures fair comparison between models and tests their ability to work within realistic constraints.

Each 'run' is 5 iterations, i.e., the LLM gets 5 submissions, and 4 rounds of feedback, allowing them to learn from feedback and improve their solutions (full system prompt). The accuracy of each run is the maximum test accuracy achieved over all the 5 submissions in that run.

For each task we give each model (at least) 15 runs (due to the high cost, o1-preview only gets 5 runs), in order to take into account the large variance in performance that we see for the same model on the same task. The final score for each model on that task is the mean accuracy over all the runs.

Tasks

The LLMs are evaluated on several different machine learning tasks. These tasks are intended to be possible to solve with a very limited amount of data, while still being hard to solve. They should also require the LLMs to think clearly and actually understand the data and its properties, not just blindly apply a standard ML recipe.

Shapes (Easy)

A shape classification task (task prompt) where models must identify one of five shapes (circle, square, triangle, pentagon, star) from a set of 512 2D coordinates. Only some of the points make up the shape, the other points are noise. The shapes are always centered and have fixed orientation and size, making this the simpler variant of the shape recognition tasks. The training set has 1000 samples.

Here the model needs to come up with a way to encode the data that is invariant to permutations of the points. The distribution of points along the shape also varies greatly, so the model needs to combine information from many points to make a good prediction.

Maximum accuracy for each run on the Shapes (Easy) task by each model. The bars show the mean value over all the runs. Error bars represent the standard deviation over runs (not the error on the mean). The grey dots represent individual runs, and the violin plots shows the distribution of accuracies over all the runs.

We can see from the model performance that this is the easiest task. If you are not careful in your architecture, it is very easy to completely overfit on the training data, but if you do something somewhat reasonable, you should be able to get a decent score on this task. o1-preview got an average accuracy of 98% after 5 runs on this task, which is probably about the ceiling for this task.

Shapes (Hard)

Similar to Shapes (Easy), but with random positioning, orientation, and size of the shapes (task prompt). This tests the model's ability to create translation, rotation, and scale invariant features. Good data augmentation is also crucial on this one.

While similar in structure to the easy version, this task is much harder. In the easy task, when the shapes are always in the same positions, the model can learn what positions correspond to what shapes. This is not possible here, now you need to use the relative position of the different points in a rotationally invariant and scale invariant way, which is much harder.

The task is definitely solvable, but no models get consistently good results, and only a few models manage to sometimes get good runs here, with the best scores a bit above 60%, from claude-3-5-sonnet and o1-mini. Another notable result is qwq:32b managing a score of about 40% for its best run, which is impressive from such a small model.

Image Patch Shuffling (Easy)

Models must arrange 9 shuffled grayscale image patches (9x9 pixels each) to reconstruct the original 27x27 image. All patches are guaranteed to be part of a single, coherent image (task prompt). The training set has 1000 images.

The original images here are from the fashion MNIST dataset, which is a greyscale dataset of 28x28 images of fashion items, with the items of clothing in the middle against a black background. This means that the position of an individual patch can often be inferred from the patch itself, since for example, a patch in the left of the image will tend to contain the left side of the item of clothing etc. This allows you to get a decent score even if you are not combining the information from the different patches in a good way.

Maximum accuracy for each run on the Image Patch Shuffling (Easy) task by each model. The bars show the mean value over all the runs. Error bars represent the standard deviation over runs (not the error on the mean). The grey dots represent individual runs, and the violin plots shows the distribution of accuracies over all the runs.

This is the task with the largest variations in the results for each single model. All models sometimes fail, or at leas get very low scores on this task, but most models also sometimes get a very good result. The patterns in the data should be easy to find if you have a reasonable architecture, but it may be a bit complicated to put all the pieces of the code together without making any mistakes, which the relatively high failure rate on this task suggests.

Image Patch Shuffling (Hard)

A more challenging version where patches are in RGB and taken from a random 27x27 subset of a larger 64x64 image (task prompt). The setup here is very similar to the easy version, but now you cannot infer the position of a patch from the patch itself, as the patches are taken from a random subset of the image (so a left patch can be taken from the center of the image). The original images are now also taken from imagnette (a subset of imagenet), which has a much more varied background and which makes it harder to infer the position of the individual patches. This means that the model needs to combine information from the different patches, and use the fact that the patches are supposed to fit well next to each other to make a good prediction.

This is the task that the models struggle the most with. No models do significantly better than chance here. The main insight that (as far as I have seen) none of the models use is that they are given all the patches, and their correct positions, for the training data. This means that you can do the following data augmentation procedure:

Use the patches and the correct positions to recreate the original image
Apply standard image augmentation techniques to the recreated image
Divide into new patches and shuffle them in a new random order

Using this procedure will increase the effective size of the training set by a large factor. Combining this with crafting specific features that measure the smooth trasitions between the edges of the different patches, should allow the models to do significantly better on this task. It is unclear to me what the ceiling is for this task, but just looking at a few of the images, it seems that it should be possible to get a pretty good score here, if you use the right approach.

Chess Game Outcome Prediction

Predict the outcome of chess games (white wins, black wins, or draw) from game move sequences (task prompt). The data consists of games played by beginners (rated below 1300), with moves in standard algebraic notation. Note that with 50% probability, the last move (for a single player) is removed, to prevent models using who moves last as a signal for the outcome. The training set has 1000 games.

Here the models need to split the string into moves, then convert the string for each move into some kind of hand-crafted or learned features, and finally use these features to predict the outcome of the game, while dealing with the variable length of the chess games. Once some good features are found, there should be plenty of patterns that can be used to do significantly better than chance on predicting the outcome of the games.

Maximum accuracy for each run on the Chess Game Outcome Prediction task by each model. The bars show the mean value over all the runs. Error bars represent the standard deviation over runs (not the error on the mean). The grey dots represent individual runs, and the violin plots shows the distribution of accuracies over all the runs.

Simply guessing white wins always will give you about 50% here, which is why I put the "random chance" line at 50% for this task. Most of the models manage to, at least sometimes get to about 60% accuracy, but struggle to do better than this. The best run is from claude-3-5-sonnet, which gets an accuracy of 74% using 20 handcrafted features. I suspect that with better handcrafted features (in principle you could track the full board state and craft features from that) you should be able to reach 90% accuracy or more, even with only 1000 games, but this is just a guess.

Unsupervised Digit Recognition

A semi-supervised learning task where models must classify digits with only 26 labeled examples and a large set of unlabeled data (task prompt). The challenge is complicated by uneven class distribution in the unlabeled set. The unlabeled training set is almost 16000 samples.

This is perhaps the most straightforward task, as a fairly standard semi-supervised machine learning recipe can be applied, but it is at least a dataset that the models have not seen before, and making semi-supervised learning work at all is not trivial.

Maximum accuracy for each run on the Unsupervised Digit Recognition task by each model. The bars show the mean value over all the runs. Error bars represent the standard deviation over runs (not the error on the mean). The grey dots represent individual runs, and the violin plots shows the distribution of accuracies over all the runs.

This task had by far the highest failure rate, with the models struggling to implement a complete semi-supervised training pipeline without making any mistakes. Once you get a working pipeline, however, you can get very good results, as it is a fairly easy dataset to classify. Given the high failure rate of the other models it is even more impressive how consistently great the results from claude-3-5-sonnet are, getting an average accuracy of 80%, and a median of over 90%.

Further Analysis

We have performed some very basic additional analysis of the results here.

Failure Rate

Failure here means an LLM response that does not produce any valid results. This could be that either the LLM response did not contain any valid python code, the code produced an error when run, or the code produced results that were not in the correct format (or for some other reason resulted in an accuracy of 0).

Note that the failure rate here is defined for each submission (of which there are 5 per run), and not for each run. This means that a model can have fairly high failure rates and still get a good score, as long as it is able to produce some valid submissions, which produce good results, within the 5 tries it gets.

Mean accuracy across all tasks for each model after 1, 2, 3, 4, and 5 iterations.

Model Performance by Number of Iterations

Here we see the mean accuracy over all the tasks after different number of iterations (the 5 iteration result here is the main result shown above). We see that the models do substantially better with more iterations. While there is clearly diminishing returns, it also seems that the accuracy will continue to increase with more than 5 iterations. Some models, like o1-preview show a steep increase in accuracy from 1 to 5 iterations, while others, like deepseek-v3, show much less improvement.

Several factors are at play here, including the models ability to utilize the feedback, the models general failure rate, and many iterations simply giving you more tries to get a good result. Teasing out the different factors is hard based on the limited data here, but the next section does bring some more light to the question. All of this is surely very task dependent as well. Adding more tasks and more detailed analysis of the results in the future will help.

Maximum of k First Submissions (max@k)

Similar to how pass@k means that at least one of k tries passes, max@k can be defined as the maximum accuracy of k tries. Here we use this to mean k first iterations (so the model gets no feedback). 3 of the models had over 50 runs on all the tasks, so there we actually have a decent number of first tries to look at for those models.

Comparing the performance of 5 first tries to 5 iterations with feedback tells you if the model actually uses the feedback productively or if it is better to use completely independent tries. As the models get smarter, they will be better at using the feedback efficiently, and the difference between the two measures should increase, so this is something to keep an eye on.

Maximuim mean accuracy across all tasks for each model after different number of first tries (max@k). Dashed lines show the mean result after 5 iterations for comparison.

In the figure we see that for these three models, the 5 iteration result is better than the 5 first tries result, so the models are able to use the feedback, but the difference is small, suggesting that most of the benefit of more iterations comes from just getting more tries, and not from the actual feedback.

It is interesting to note that the model with the largest benefit of 5 iterations over 5 independent tries is gemini-2.0-flash-thinking, Googles reasoning model. This suggest that the reasoning model is using the feedback more efficiently than the other models, and that its better overall results compared to gemini-2.0-flash is mostly due to this. Based on this one datapoint, we should not conclude much, but this observation is also consistent with o1-mini and o1-preview, OpenAIs reasoning models, having a larger relative improvement from 1 iteration to 5 iterations than for example claude-3-5-sonnet.

Future Directions

The reason I have not run this for the full o1 is that it is still not available at my tier (4) in the OpenAI API.
I have a bunch of ideas for new tasks, so the next version of this benchmark will brobably have twice or three times as many tasks.
This is currently a small personal side project with no dedicated funding from my current employer. Going forward, I expect API-costs for this project to increase by an order of magnitude, partly due to scaling up the number of tasks and analyses I want to run, but mostly due to the leading models being much more expensive to run due to inference time compute scaling. If anyone wants to support this project by covering API costs going forward, please contact me.
If someone has an agentic framework or wrapper around one of the models that they think will do better on this benchmark than just the pure LLMs, please contact me and we can arrange something.

This is a capabilities game. It is neither alignment or safety. To the degree it's forecasting, it helps cause the thing it forecasts. This has been the standard pattern in capabilities research for a long time: someone makes a benchmark (say, imagenet 1.3m 1000class), and this produces a leaderboard that allows people to show how good their learning algorithm is at novel datasets. In some cases this even produced models directly that were generally useful, but it traditionally was used to show how well an algorithm would work in a new context from scratch. Building benchmarks like this gives teams a new way to brag - they may have a better source of training data (eg, google always had a better source of training data than imagenet), but it allows them to brag that they scored well on the benchmark, which among other things helps them get funding.

Perhaps it also helps convince people to be concerned. That might trade off against this. Perhaps it sucks in some way as a bragging rights challenge. That would trade off against this.

Thank you for your comment!

Not sure I agree with you about which way the tradeoff shakes out. To me it seems valuable that people outside the main labs have a clear picture of the capabilities of the leading models, and how that evolves over time, but I see your point that it could also encourage or help capabilities work, which is not my intention.

I’m probably guilty of trying to make the benchmark seem cool and impressive in a way that may not be helpful for what I actually want to achieve with this.

I will think more about this, and read what others have been thinking about it. At the very least I will keep your perspective in mind going forward.

This is really impressive -- could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?

(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)

Thank you!

I've been working on the automated pipeline as a part time project for about two months, probably equivalent to 2-4 full-time weeks of work.

One run for one model and one task typically takes perhaps 5-15 minutes, but it can be up to about an hour (if they use their 10 min compute time efficiently, which they tend not to do).

Total API costs for the project is probably below 200$ (if you do not count the credits used on googles free tier). Most of the cost is for running o1-mini and o1-preview (even though o1-preview only went through a third of the runs compared to the other models). o1-preview costs about 2$ for each run on each task. For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

I expect the API costs to dominate going forward though if we want to run o3 models etc through the eval.

Makes sense, thanks!

For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

It's hard to say because I'm not even sure you can rent Titan Vs at this point,^[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.

An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than a Titan V, in that it has 40 or 80 GB of memory and (pulling number out of thin air) 4-5x faster.

So if o1 costs $2 per task and it's 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)

^{^}
I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.

API costs will definitely dominate for o1-preview, but most of the runs are with models that are orders of magnitude cheaper, and then it is not clear what dominates.

Going forward, models like o1-preview (or even more expensive) will probably dominate the cost, so the compute will probably be a small fraction.

Hi. A question here.

"The system executes code in a Docker container with strict resource limits (TITAN V GPU with 12GB memory, 600-second timeout). This ensures fair comparison between models and tests their ability to work within realistic constraints."

How can you run llama-3.1/3.3-70b models with 12GB vram?

The LLMs are presented with the ML task and they write python code to solve the ML task. This python code is what is run in the isolated docker with 12GB memory.

So the LLMs themselves are not run on the TITAN V, they are mostly called through an API. Although I did in fact run a bunch of the LLMs locally through ollama, just not on the TITAN V server, but a larger one.

Thanks for the clarification.

This is really cool research! I look forward to seeing what you do in future. I think you should consider running human baselines, if that becomes possible in the future. Those help me reason about and communicate timelines and takeoff a lot.

Thank you!

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.

What we would really want is to have several top researchers/ml engineers do it, and I know that METR is working on that, so that is probably the best source we have for a realistic comparison at the moment.

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.

Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?

My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.

I guess I was thinking that the human baseline should be without LLMs, because otherwise I could just forward the prompt to the best LLM, se what they did, and perhaps improve upon it, which would put human level always at or above the best LLM.

Then again this is not how humans typically work now, so it’s unclear what is a «fair» comparison. I guess it depends on what the human baseline is supposed to represent, and you have probably thought a lot about that question at METR.

Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?

I could, but it would not really be a fair comparison, since I have seen many of the LLMs solutions, and have seen what works.

Doing a fresh task I made myself would not be totally fair either, since I will know more about the data then they do, but it would definitely be closer to fair.

Perhaps it also helps convince people to be concerned. That might trade off against this. Perhaps it sucks in some way as a bragging rights challenge. That would trade off against this.

Thank you for your comment!

I’m probably guilty of trying to make the benchmark seem cool and impressive in a way that may not be helpful for what I actually want to achieve with this.

I will think more about this, and read what others have been thinking about it. At the very least I will keep your perspective in mind going forward.

This is really impressive -- could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?

(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)

Thank you!

I've been working on the automated pipeline as a part time project for about two months, probably equivalent to 2-4 full-time weeks of work.

One run for one model and one task typically takes perhaps 5-15 minutes, but it can be up to about an hour (if they use their 10 min compute time efficiently, which they tend not to do).

I expect the API costs to dominate going forward though if we want to run o3 models etc through the eval.

Makes sense, thanks!

For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

It's hard to say because I'm not even sure you can rent Titan Vs at this point,^[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.

So if o1 costs $2 per task and it's 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)

^{^}
I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.

API costs will definitely dominate for o1-preview, but most of the runs are with models that are orders of magnitude cheaper, and then it is not clear what dominates.

Going forward, models like o1-preview (or even more expensive) will probably dominate the cost, so the compute will probably be a small fraction.

Hi. A question here.

How can you run llama-3.1/3.3-70b models with 12GB vram?

The LLMs are presented with the ML task and they write python code to solve the ML task. This python code is what is run in the isolated docker with 12GB memory.

Thanks for the clarification.

Thank you!

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.

Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?

My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.

Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?

I could, but it would not really be a fair comparison, since I have seen many of the LLMs solutions, and have seen what works.

Doing a fresh task I made myself would not be totally fair either, since I will know more about the data then they do, but it would definitely be closer to fair.

57

Introducing the WeirdML Benchmark

57

Introduction

Results

Evaluation Setup

System Architecture

Tasks

Shapes (Easy)

Shapes (Hard)

Image Patch Shuffling (Easy)

Image Patch Shuffling (Hard)

Chess Game Outcome Prediction

Unsupervised Digit Recognition

Further Analysis

Failure Rate

Model Performance by Number of Iterations

Maximum of k First Submissions (max@k)

Future Directions

57

57