Summary We study language models' capability to perform parallel reasoning in one forward pass. To do so, we test GPT-3.5's ability to solve (in one token position) one or two instances of algorithmic problems. We consider three different problems: repeatedly iterating a given function, evaluating a mathematical expression, and calculating...
some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.
This is definitely true. There were ~100 mathematicians working on this (we don't know how many of them knew) and there's this.
I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic. It might not have been, or maybe to some extent but not as much as one might think.
I'd guess not everyone involved was modeling how the mathematicians would feel. There are multiple (like 20?) people employed at Epoch AI, and multiple people at Epoch AI working on this project. Maybe the... (read more)
https://epoch.ai/blog/openai-and-frontiermath
On Twitter Dec 20th Tamay said the holdout set was independently funded. This blog post from today says OpenAI still owns the holdout set problems. (And that OpenAI has access to the questions but not solutions.)
In the post it is also clarified that the holdout set (50 problems) is not complete.
The blog post says Epoch requested permission ahead of the benchmark announcement (Nov 7th), and they got it ahead of the o3 announcement (Dec 20th). From me looking at timings, the Arxiv paper was updated 7 hours (and some 8 min) before the o3 stream on YT, on the same date. Technically ahead of the o3 announcement, though. I was wrong in... (read more)
Not Tamay, but from elliotglazer on Reddit[1] (14h ago): "Epoch's lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven't yet independently verified their 25% claim. To do so, we're currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.
My personal opinion is that OAI's score is legit (i.e., they didn't train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can't vouch for them until our independent evaluation is complete."
Currently developing a hold-out dataset gives a different impression than
"We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities" and "they do not have access to a separate holdout set that serves as an additional safeguard for independent verification."
That was a quote from a commenter in Hacker news, not my view. I reference the comment as something I thought a lot of people's impression was pre- Dec 20th. You may be right that maybe most people didn't have the impression that it's unlikely, or that maybe they didn't have a reason to think that. I don't really know.
Thanks, I'll put the quote in italics so it's clearer.
FrontierMath was funded by OpenAI.[1]
The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.
Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]
Because the Arxiv version mentioning OpenAI contribution came out right after o3 announcement, I'd guess Epoch AI had some agreement with OpenAI to not mention it publicly until then.
The mathematicians creating the problems for FrontierMath were not (actively)[2] communicated to about funding... (read more)
We meet every Tuesday in Oakland at 6:15
I want to make sure this meeting is still on Wednesday the 15th? Thank you. :) And thanks for organizing.
I think this is a great project. I believe your documentary would have high impact via informing and inspiring AI policy discussions. You've already interviewed an impressive amount of relevant people. I admire your initiative to take on this project quickly, even before getting funding for it.
We study language models' capability to perform parallel reasoning in one forward pass. To do so, we test GPT-3.5's ability to solve (in one token position) one or two instances of algorithmic problems. We consider three different problems: repeatedly iterating a given function, evaluating a mathematical expression, and calculating terms of a linearly recursive sequence.
We found no evidence for parallel reasoning in algorithmic problems: The total number of steps the model could perform when handed two independent tasks was comparable to (or less than) the number of steps it could perform when given one task.
Broadly, we are interested in AI models' capability to perform hidden cognition: Agendas such as scalable oversight and... (read 2670 more words →)
Great post! I'm glad you did this experiment.
I've worked on experiments where I test gpt-3.5-turbo-0125 performance in computing iterates of a given permutation function in one forward pass. Previously my prompts had some of the instructions for the task after specifying the function. After reading your post, I altered my prompts so that all the instructions were given before the problem instance. As with your experiments, this noticeably improved performance, replicating your result that performance is better if instructions are given before the instance of the problem.
For those skeptical about
My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn't guided at all by FrontierMath.