Why LLMs Aren't Scientists Yet.

Dhruv Trehan

This is a crosspost from our report website for Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts. This report details the work behind our LLM-written paper "The Consistency Confound: Why Stronger Alignment Can Break Black-Box Jailbreak Detection" accepted at Agents4Science 2025, the first scientific conference requiring AI as primary author, where it passed both AI and human review.

TL;DR

We built 6 AI agents using Gemini 2.5 Pro and Claude Code, mapped to stages of the scientific workflow from idea to hypothesis generation, experiment execution, evaluation and paper writing.
We tested our agents on 4 research ideas across ML sub-domains such as Multi-Agent RL, World Models, and AI Safety. 3 ideas failed during implementation or evaluation. Only 1 succeeded and was published at Agents4Science 2025.
We document 6 recurring failure modes: bias toward training data, implementation drift under pressure, memory/context degradation, overexcitement that declares success despite obvious failures, and gaps in domain intelligence and scientific taste.
We also derive 4 design principles for more robust AI scientist systems, discuss the limitations of training and evaluation data for future autonomous science, and release all prompts, artifacts, and outputs at github.com/Lossfunk/ai-scientist-artefacts-v1.

Problem Definition and System Overview

We wanted to see how far current LLMs could go without significant scaffolding or human hand-holding. The goal: take a research idea from conception to publication with maximum autonomy.

Our system comprised six specialized agents (all using Gemini 2.5 Pro for its long context length) mapped to stages of the scientific workflow: Idea Generation, Hypotheses Generation, Experiment Planning, Output Evaluation, Revision, and Paper Outlining. Claude Code handled all implementation and paper writing.

Figure 1 from the report: Autonomous Research Pipeline showing the six agent modules and shared file system artifacts

Each agent received the repository state as part of its prompt context, along with tools for reading and writing files. This kept context engineering minimal. Agents decided which files to consult, just as a researcher would navigate their own project folder.

Figure 2 from the report showing the Agent prompt template structure.

To select research ideas, we started with a corpus of 135+ papers from top-tier venues across three ML subdomains: World Models, Multi-Agent RL, and AI Safety. After running four zero-shot LLM reviewers and consulting authors of the seed papers for feasibility input, we narrowed to four candidates for full pipeline execution.

Figures 3 from the report: Selection funnel from 135+ papers to 4 Idea candidates generated through paper mashing.

Of these four, three failed during implementation or evaluation. Only one, from the AI Safety domain, completed the pipeline.

Our Agents4Science 2025 Submission

Of our four candidates, only the AI Safety idea completed the pipeline, and not by accident. The other three required training complex model architectures or intricate multi-agent coordination. This one focused on data analysis: sampling model responses and computing entropy metrics. No training loops, no gradient propagation. The simpler implementation meant that when issues arose, they were recoverable rather than fatal.

The idea was to use semantic entropy (a method effective for hallucination detection) as a black-box signal for jailbreak attempts. The intuition: jailbreak prompts create internal conflict, manifesting as inconsistent responses. Initial experiments showed it failing. Rather than abandoning the idea, the system pivoted from "test if SE works" to "investigate why SE fails." This pivot led to our core finding: the Consistency Confound. Well-aligned models produce consistent, templated refusals, exactly what semantic entropy interprets as "safe" behavior. Stronger alignment makes detection worse.

Experimental results showing the Consistency Confound phenomenon

The paper was accepted to Agents4Science 2025. The conference accepted 48/254 valid submissions and our paper was a borderline accept, passing correctness checks and a code audit. Reviewers, both AI and Human, recognized that well-executed negative results are a contribution. The human reviewer noted that while the contribution is primarily a negative result, it identifies "a clear and reproducible failure mode."

Table 2 from the report: Agents4Science 2025 review scores from AI and human reviewers

That said, as part of the Agents4Science submission, we had to complete an AI Involvement Checklist detailing human contributions at each stage and our contribution was only 95% autonomous. We still intervened to select ideas, meta-prompt during execution, and temper overoptimistic claims during paper writing.

AI Involvement Checklist showing human intervention points

Observed Failure Modes and Mitigation

Through our experiments, six failure patterns emerged consistently across attempts. These reveal systematic limitations in current LLMs for autonomous research.

1. Bias on Training Data

Models defaulted to outdated libraries and approaches from their training data, overriding explicit instructions. Claude Code repeatedly used deprecated Modal commands and insisted on unmaintained packages like hanabi-learning-env==0.5.2, ignoring instructions to use modern alternatives. Even after errors, the system would diagnose the problem as a library issue and regress to training data versions, insisting that was the right approach.

Training data bias patterns across different implementation contexts

2. Implementation Drift

When facing technical barriers, systems progressively simplified implementations rather than solving root causes. Our differentiable tree search planner devolved into a basic actor-critic approach when training loops timed out. A single error would trigger progressive simplification rather than debugging. In WM-2, one mistake in implementing the Dreamer baseline cascaded into abandoning the core research contribution entirely.

Example of implementation drift showing progressive simplification

3. Memory and Context Issues

Over long-duration tasks, agents lost track of previous decisions, hyperparameters, and experimental configurations. Baseline implementations used entirely different hyperparameters than those specified in plans. During paper writing, the agent forgot to consult early context files entirely, producing a draft that read like a list of experiments with no origin story or motivation. To mitigate this, we introduced session logging prompts (shown below) that required Claude Code to document decisions and artifacts at the end of each session, one of several memory-like abstractions we had to build.

Figure 7 from the report: Session logging prompt template for maintaining context across sessions

4. Overexcitement and Eureka Instinct

Models reported success despite clear experimental failures. Degenerate outputs (MAE=0, dummy reward signals) were described as "successful hypothesis validation." Paper drafts made inflated claims like "first ever comprehensive assessment" even when results were statistically invalid. This likely stems from RLHF training, where models are rewarded for being agreeable and helpful, not for scientific skepticism or detecting confirmation bias.

Eureka Instinct during paper writing phase — Figure 6 from the report: Overexcitement and Eureka Instinct during execution (top) and paper writing (bottom) phases

5 & 6. Lack of Domain Intelligence and Scientific Taste

Agents struggled with the tacit knowledge that experienced researchers take for granted. They failed to recognize that Dreamer requires online learning (not offline frames) or that a 50,000-depth parameter was computationally absurd for a 6-hour GPU limit. In one case, the system proceeded with hypothesis testing when baseline performance was 95% below established benchmarks, making any comparative analysis scientifically meaningless.

Beyond operationalizing research, models missed fundamental flaws in experimental design. Hypotheses were too simplistic to draw conclusions from, statistical validity was ignored (single-seed experiments), and the system misinterpreted a seed paper's future work section as endorsement of an approach the authors never intended.

Design Takeaways for AI Scientist Systems

From these failures, we derive four design principles for building more robust AI scientist systems:

1. Start Abstract, Ground Later

Introduce technical details gradually through the workflow. Early specificity anchors models to outdated training data patterns. Keep ideation high-level and save implementation details for execution.

2. Verify Everything

Implement verification at every pipeline stage. Ground evaluations in raw data and logs, not LLM interpretations. The Goodfire team calls the alternative "p-hacking and eureka-ing". We saw plenty of it.

3. Plan for Failure and Recovery

Design multi-turn agentic workflows, not zero-shot generation. Separate code generation from execution. Include checkpointing and explicit failure mode controls. Scientific discovery is long-duration; errors will accumulate.

4. Log Everything

Maintain comprehensive session logs and metrics across runs. This supports both autonomous execution and human review, and becomes essential when debugging why an agent made a decision three sessions ago.

Limitations and Discussion

Our work has obvious limitations: only four ideas, three ML subdomains, no systematic ablations, and failure modes identified through observation rather than quantitative measurement. We see this as a starting point.

The broader picture is becoming clearer. OpenAI's "AI for Science" initiative is hiring "world-class academics who are completely AI-pilled" to work alongside models and not replacing them. In their work on acceleration of scientific discovery, while physicist Brian Keith Spears reports 1000x acceleration in workflows through human-AI collaboration, while Fields Medalist Timothy Gowers notes that "We have not yet reached the stage where an LLM is likely to have the main idea for solving a difficult problem." Our bet is that we're going to see many more agents and platforms for AI-assisted science in 2026 and beyond.

But before we get to truly autonomous discovery, three problems need solving: long-horizon coherence, research taste that can distinguish meaningful from trivial contributions, and the missing data for training and evaluating scientific reasoning, including records of failed attempts and the "negative space" of why obvious approaches don't work. For now, human-AI collaboration generates the workflow data to train the next generation of research agents. We detail this discussion in the report further.

[-]frmoretto1mo20

A quick practical question: you said that the agent behind idea generation processed more than 135 papers and I presume that most of them are equation-heavy. How did your agents handle the math?

I've been working on the same problem: when papers are transformed in PDF equations aren’t always available as structured math. If they're treated as images in extraction they become hard to understand and to put in context by LLMs.

Have you used LaTeX source or ar5iv HTML? Or just hope that the surrounding text carried enough signal?

[-]Dhruv Trehan1mo21

Hey @frmoretto - the idea generation agent did not parse 135 full papers. It first evaluated pairs of these (based on just the title and abstract) and then selected upto 15 top pairs to do the actual idea generation for. This was about 30 or so papers (accounting for repeated papers across pairs). For these 30, we used pymupdf4llm. Did manually inspect 4-5 and the extraction appeared ok (somewhat hoping that the surrounding text carried enough signal). Though you are correct, paper extraction is a challenge from PDFs and in our more recent work we have found Dolphin-v2 to work well. We have not yet tried using the LaTeX source or ar5iv HTML. That said, we also kept much of the idea in natural language and for the experiment execution (where the math was more critical) we used Claude Code's inbuilt file reading features.

[-]frmoretto1mo10

Hi @Dhruv Trehan, thanks for the honest breakdown. I've personally experienced how complex and definitely not straightforward this topic could be. For example, even the allegedly more trivial parts (e.g authors metadata) have seen deep changes over time and are sometimes full of peculiarities or clear errors (due to sloppy formatting or when converted from LaTeX).

I've prepared in the last weeks an online platform that parses ar5iv HTML (and ArXiv HTML for recent papers) in order to create a "meaningful skeleton" for each paper and cross-reference the context (the distilled paper's text + claims + equations) against figures. It's still in beta but the first outputs are interesting. The idea is to have a super-compact package (50/60 KB for each paper including figures interpretation) that can be given with confidence to LLMs without losing all the precious context that is embedded troughout the length of the full paper.

Given your direct experience I'd be very happy if you can take a look at it and give me your honest opinion, the url is: arxiparse.org

LESSWRONG
LW