Maybe check this paper: DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning, https://arxiv.org/abs/2402.17453
Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.
Is there a simple way to run against a given Kaggle competition after that particular competition is over?
These are reasonable benchmarks, but is there a way to make them less of a moving target, so that the ability to run against a given competition extends into the future?
tl;dr: I prompted ChatGPT to participate in a Kaggle data science competition. It successfully wrote scripts that trained models to predict housing prices, and ultimately outperformed 71% of human participants.
I'm not planning to build a benchmark using Kaggle competitions, but I think a well-executed version could be comparable to e.g. SWE-Bench. This is a proof-of-concept to inform others who might want to work on building that benchmark.
Motivation
Someone once said that benchmarking LLM agents on consequential real-world tasks could help the world understand and anticipate the advance of AI capabilities. Data science is a consequential real-world task that's also near and dear to my heart, as I spent a few years as a data scientist. So, could ChatGPT automate my old job?
To answer this question, I turned to Kaggle. Kaggle is a website that hosts online data science competitions. They provide instructions and a training dataset, and participants train models to make predictions on a test set. Based on test set performance, Kaggle ranks everyone on a public leaderboard.
Kaggle has a couple of great features for an LLM agent benchmark:
Arguments against building a benchmark using Kaggle competitions include:
Methods
Ideally, a Kaggle benchmark would include the descriptions and datasets of thousands of competitions. An LLM agent would have access to these files, a scratchpad, and a code interpreter, and it would autonomously train models, iteratively improve them, and submit their predictions to Kaggle for evaluation.
I didn't do any of this. Instead, I had this conversation with ChatGPT.
Initial Prompt
ChatGPT dutifully produced a Python file for training a model:
When I ran the Python script, it hit an error. I replied to ChatGPT:
First Successful Attempt
ChatGPT instructed me to replace several lines in the script, and I obliged. The script hit one more error, which I reported to ChatGPT and again implemented its proposed fix. The third time I ran the script, it worked.
I submitted this first set of predictions to Kaggle. Against the best predictions from 4480 participants, ChatGPT's predictions scored in the 55th percentile.
Iteration and Improvement
The 55th percentile is above average, but it's not excellent. So I asked ChatGPT if it could do better:
It decided to swap out the original random forests model for a new gradient-boosted decision tree. It also implemented a hyperparameter search. Together, these two updates boosted ChatGPT's leaderboard rank to the 71st percentile.
The third and final iteration actually decreased performance. I didn't test to see if ChatGPT could continue to iterate and fix this failure.
Conclusion
Kaggle competitions seem like an underutilized resource for benchmarking the data science capabilities of LLM agents. If you're already interested in building these kinds of evaluations, I'd suggest you consider working with Kaggle competitions. On the other hand, if you're not already planning on benchmarking LLM agent capabilities, this post is not a recommendation that you do so! Consider it carefully.
Here is a GitHub repo with all of the files and code from this experiment. Here is the full conversation with ChatGPT. Happy benchmarking, and if you're currently a data scientist, can I suggest a pivot to AI safety?
Shoutout to Marius Hobbhahn and Jamie Bernardi for running an evals workshop where I built this.