There’s a new preprint from Peking University in China that assesses LLM capabilities in reproducing results from experimental physics papers. Their finding? All the agents had a 0% “end-to-end callback rate,” i.e. they were incapable of reproducing any full, numerical results from any of the papers.
Other tests showed that the LLMs could easily answer questions about the methodology of these papers. However, they consistently made small errors on data analysis and numerical simulation, leading to erroneous final results.
This paper is the first study I’ve seen that analyzes LLM research skills outside of math and theoretical CS. We should take it seriously. I will try to explain what the preprint is about, what failure modes it discovered when evaluating LLMs on physics papers, what might be the causes of these failure modes, and if this paper should affect our timelines for automated AI researchers. A lot of my conclusions are speculation, and I’ve tried to make that clear through my use of the words “probably,” “could,” etc. But if you think that I’m too confident in some areas, please make that known in the comments.
Numerical Simulation ≠ Math and Coding
The paper reproduction benchmark (PRBench) focuses on “computational modeling or numerical simulation.” This description is misleading, at least to people who aren’t experts in the field—numerical simulation sounds like you just have to write some code. But you need to have a firm grasp on the physical setup in order to write such code. The papers don’t tell the LLM what specific numerical methods to use, how to translate the conditions of the problem into code, etc.
When LLMs tried to fill in these gaps, they failed.
Evaluation Results
The highest-performing LLM was OpenAI Codex powered by GPT-5.3. According to researchers, Codex wasn’t able to reproduce any numerical results from the papers within specified accuracy bounds. On the index score for overall reproduction (taking into account comprehension of the paper, code accuracy, and result accuracy), Codex scores 34%.
All of the LLMs passed the reading comprehension evaluation, despite producing incorrect final results. To put it charitably: they understood the methods of the papers but failed in implementation. To put it uncharitably: they could regurgitate text from the papers to appear competent.
The researchers then identified the five most common failure modes of the LLMs, which are recounted below.
Failure Mode 1: “Formula Implementation Errors”
The LLM writes code with a bunch of small, silly mistakes, e.g. “sign mistakes” and “wrong index conventions.”
Failure Mode 2: “Failures in Algorithmic Fidelity”
The LLM oversimplifies when it’s writing the code. Here’s one example:
In a nuclear structure task requiring the full Skyrme–Hartree–Fock equations with spin-orbit coupling and state-dependent effective mass, the agent instead solved a simplified single-particle Schrödinger equation in a fixed potential.
Nuclear and quantum physics is not my area of expertise, but I know some of the theory. In this case, the agent simplified a complex mathematical model of the nucleus into an introductory textbook exercise. Then it solved the textbook exercise.
The LLMs are also bad at extracting results from numerical methods. These methods usually require you to make successive computations to converge on the right answer, but convergence is highly dependent on parameters. LLMs input unsuitable parameters or just parameters that make the problem easier, and the simulation leads them to the wrong answer.
Failure Mode 3: “Methodological Consistency and Completion Failures”
LLMs fail to understand the mildly confusing or ambiguous parts of the original paper.
One form of this issue is methodological convention mismatch, where the agent replaces the formulation used in the paper with a more modern or commonly used variant learned from its training distribution. For example, in a lattice QCD reproduction task, the original work formulates the fermion action in terms of the quark mass, whereas the agent adopts a modern formulation using the hopping parameter , as commonly used in contemporary LQCD libraries. The inconsistency becomes critical when the agent later interprets the symbol as , while in the original paper it denotes the string tension (later commonly written as ). As a result, the implementation mixes incompatible parameterizations, leading to systematic errors.
I’m not convinced that the LLM understands all of the physical context if it’s going to confuse two variables that obviously mean different things. In more elementary physics, there are also tons of variables that have the same name, but we don’t see similar mistakes. LLMs are probably more capable of distinguishing simpler variable conventions because of a comparatively larger amount of training data.
Failure Mode 4: “Inability to Debug Silent Failures”
At least in this study, LLMs tend to assume that things are going well if there are no explicit runtime exceptions. They don’t check to make sure the answers are intuitively correct, and they don’t check their simulations for special cases. Both of these skills are required in physics.
In the cases where the LLMs do check and find that they have messed up, they do not go back to their code and try to fix the underlying errors. Instead, they make over-simplifying assumptions (as in the Schrödinger equation example), or they just make up equations that give superficially plausible results.
Failure Mode 5: “Execution and Resource Constraints”
This failure mode is less of a big deal than the others, and might actually be favorable toward the capability of the LLMs. The researchers found that the LLMs were sometimes able to create correct numerical simulations, but they were too slow to complete execution in the environment used.
Why Do These Failures Happen?
Here, LLMs and agents seem incompetent compared to their results in “hacking into literally every piece of software” (Mythos) and “solving open math problems” (GPT-5.4 solved an Erdos Problem).
If I had to name my top reasons:
GPT-5.3 isn’t SOTA and the researchers had shitty harness engineering. I think a multi-agent setup[1] with more targeted prompting could yield better results. I looked at their example task, and their prompt was pretty bare-bones. It seemed like it would induce agents to cut corners instead of thinking deeply. The researchers also made no mention of skills, which seem useful in this case.
The nature of the task isn’t very interactive. If you don’t force the agents to carefully mull over their results, they will probably just call it a day once the code executes without errors. For hacking, on the other hand, Mythos can get explicit confirmation that it’s making progress through the evaluation harness. Even in math research, a lot of agents are using Lean to formalize their proofs. If there’s a logical error, the Lean code won’t compile.
Physical systems are harder to translate into code. I identify three sub-reasons:
Lack of training data. Most software applications have been written a hundred different ways on the internet. Obscure physical models show up only a few times in hard-to-access papers, and they’re not written in code to begin with.
System-wide awareness to detail is required. For the Erdos problem, GPT-5.4 started off in a way that’s different from most approaches. Then it kept going in the most logical direction until it solved the problem. The different approach could have been triggered by a random association between the problem statement and other material in its training data. But I doubt that GPT-5.4 had seen all the way to the end of the problem when it started. For physical models, however, you have to fit the whole thing into your brain.
There’s a lot of hidden assumptions in physical models. In undergrad physics classes, you take a host of assumptions to ensure that problems are doable, e.g. the string is massless, neglect self-inductance, . These assumptions change as you go to different areas of study. LLMs usually took overly simplified assumptions when evaluated on PRBench.
I would like to see a re-evaluation of PRBench using Opus 4.7/GPT-5.5, with multi-agent harnesses and skills.
How Should We Update on AI Self-Improvement Capabilities?
The automated AI researcher is a centerpiece of most takeoff scenarios. So far, the most convincing evidence I’ve seen for the automated researcher is:
AlphaEvolve from Google speeding up datacenter efficiency (yes, I know, that was a long time ago).
Anthropic employees reporting greatly increased productivity, since they can offload grunt work to Claude Code.
Notably, an AI has never developed algorithmic improvements by itself (as far as we know). So is the automated researcher still far in the future? And what does this paper tell us about that question?
Well, the PRBench evaluation shows that LLMs can’t easily understand advanced physics research techniques. Whether they can understand these techniques at all is still unknown, until someone evaluates PRBench with SOTA agents.
On the other hand, AI R&D isn’t the same as physics. A lot of the less technical advancements in AI (prompt engineering, harnesses, skills) come from repeated low-level experimentation. Automating these research tasks will probably come first. However, Anthropic and OpenAI are definitely gatekeeping their “secret sauce” breakthroughs. And you might need more physics-style thinking to automate the discovery of such breakthroughs. After all, statistical mechanics is basically the progenitor of modern AI.
Personally, I would push back my timelines for automated AI researchers. PRBench shows that LLMs struggle to adapt to difficult tasks outside of their training data.
There’s a new preprint from Peking University in China that assesses LLM capabilities in reproducing results from experimental physics papers. Their finding? All the agents had a 0% “end-to-end callback rate,” i.e. they were incapable of reproducing any full, numerical results from any of the papers.
Other tests showed that the LLMs could easily answer questions about the methodology of these papers. However, they consistently made small errors on data analysis and numerical simulation, leading to erroneous final results.
This paper is the first study I’ve seen that analyzes LLM research skills outside of math and theoretical CS. We should take it seriously. I will try to explain what the preprint is about, what failure modes it discovered when evaluating LLMs on physics papers, what might be the causes of these failure modes, and if this paper should affect our timelines for automated AI researchers. A lot of my conclusions are speculation, and I’ve tried to make that clear through my use of the words “probably,” “could,” etc. But if you think that I’m too confident in some areas, please make that known in the comments.
Numerical Simulation ≠ Math and Coding
The paper reproduction benchmark (PRBench) focuses on “computational modeling or numerical simulation.” This description is misleading, at least to people who aren’t experts in the field—numerical simulation sounds like you just have to write some code. But you need to have a firm grasp on the physical setup in order to write such code. The papers don’t tell the LLM what specific numerical methods to use, how to translate the conditions of the problem into code, etc.
When LLMs tried to fill in these gaps, they failed.
Evaluation Results
The highest-performing LLM was OpenAI Codex powered by GPT-5.3. According to researchers, Codex wasn’t able to reproduce any numerical results from the papers within specified accuracy bounds. On the index score for overall reproduction (taking into account comprehension of the paper, code accuracy, and result accuracy), Codex scores 34%.
All of the LLMs passed the reading comprehension evaluation, despite producing incorrect final results. To put it charitably: they understood the methods of the papers but failed in implementation. To put it uncharitably: they could regurgitate text from the papers to appear competent.
The researchers then identified the five most common failure modes of the LLMs, which are recounted below.
Failure Mode 1: “Formula Implementation Errors”
The LLM writes code with a bunch of small, silly mistakes, e.g. “sign mistakes” and “wrong index conventions.”
Failure Mode 2: “Failures in Algorithmic Fidelity”
The LLM oversimplifies when it’s writing the code. Here’s one example:
Nuclear and quantum physics is not my area of expertise, but I know some of the theory. In this case, the agent simplified a complex mathematical model of the nucleus into an introductory textbook exercise. Then it solved the textbook exercise.
The LLMs are also bad at extracting results from numerical methods. These methods usually require you to make successive computations to converge on the right answer, but convergence is highly dependent on parameters. LLMs input unsuitable parameters or just parameters that make the problem easier, and the simulation leads them to the wrong answer.
Failure Mode 3: “Methodological Consistency and Completion Failures”
LLMs fail to understand the mildly confusing or ambiguous parts of the original paper.
I’m not convinced that the LLM understands all of the physical context if it’s going to confuse two variables that obviously mean different things. In more elementary physics, there are also tons of variables that have the same name, but we don’t see similar mistakes. LLMs are probably more capable of distinguishing simpler variable conventions because of a comparatively larger amount of training data.
Failure Mode 4: “Inability to Debug Silent Failures”
At least in this study, LLMs tend to assume that things are going well if there are no explicit runtime exceptions. They don’t check to make sure the answers are intuitively correct, and they don’t check their simulations for special cases. Both of these skills are required in physics.
In the cases where the LLMs do check and find that they have messed up, they do not go back to their code and try to fix the underlying errors. Instead, they make over-simplifying assumptions (as in the Schrödinger equation example), or they just make up equations that give superficially plausible results.
Failure Mode 5: “Execution and Resource Constraints”
This failure mode is less of a big deal than the others, and might actually be favorable toward the capability of the LLMs. The researchers found that the LLMs were sometimes able to create correct numerical simulations, but they were too slow to complete execution in the environment used.
Why Do These Failures Happen?
Here, LLMs and agents seem incompetent compared to their results in “hacking into literally every piece of software” (Mythos) and “solving open math problems” (GPT-5.4 solved an Erdos Problem).
If I had to name my top reasons:
I would like to see a re-evaluation of PRBench using Opus 4.7/GPT-5.5, with multi-agent harnesses and skills.
How Should We Update on AI Self-Improvement Capabilities?
The automated AI researcher is a centerpiece of most takeoff scenarios. So far, the most convincing evidence I’ve seen for the automated researcher is:
Notably, an AI has never developed algorithmic improvements by itself (as far as we know). So is the automated researcher still far in the future? And what does this paper tell us about that question?
Well, the PRBench evaluation shows that LLMs can’t easily understand advanced physics research techniques. Whether they can understand these techniques at all is still unknown, until someone evaluates PRBench with SOTA agents.
On the other hand, AI R&D isn’t the same as physics. A lot of the less technical advancements in AI (prompt engineering, harnesses, skills) come from repeated low-level experimentation. Automating these research tasks will probably come first. However, Anthropic and OpenAI are definitely gatekeeping their “secret sauce” breakthroughs. And you might need more physics-style thinking to automate the discovery of such breakthroughs. After all, statistical mechanics is basically the progenitor of modern AI.
Personally, I would push back my timelines for automated AI researchers. PRBench shows that LLMs struggle to adapt to difficult tasks outside of their training data.
The researchers technically used two agents (one for orchestration and one for coding), but these agents didn’t check each other’s work.