Dhruv Trehan — LessWrong

LESSWRONG
LW

Replying toWhy LLMs Aren't Scientists Yet.

Hey @frmoretto - the idea generation agent did not parse 135 full papers. It first evaluated pairs of these (based on just the title and abstract) and then selected upto 15 top pairs to do the actual idea generation for. This was about 30 or so papers (accounting for repeated papers across pairs). For these 30, we used pymupdf4llm. Did manually inspect 4-5 and the extraction appeared ok (somewhat hoping that the surrounding text carried enough signal). Though you are correct, paper extraction is a challenge from PDFs and in our more recent work we have found Dolphin-v2 to work well. We have not yet tried using the LaTeX source or ar5iv HTML. That said, we also kept much of the idea in natural language and for the experiment execution (where the math was more critical) we used Claude Code's inbuilt file reading features.

Why LLMs Aren't Scientists Yet.

Dhruv Trehan

1mo

This is a crosspost from our report website for Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts. This report details the work behind our LLM-written paper "The Consistency Confound: Why Stronger Alignment Can Break Black-Box Jailbreak Detection" accepted at Agents4Science 2025, the first scientific conference requiring AI as primary author, where it passed both AI and human review.

TL;DR

We built 6 AI agents using Gemini 2.5 Pro and Claude Code, mapped to stages of the scientific workflow from idea to hypothesis generation, experiment execution, evaluation and paper writing.
We tested our agents on 4 research ideas across ML sub-domains such as Multi-Agent RL, World Models, and AI Safety. 3 ideas failed

... (read 1424 more words →)

Saying “for AI safety research” made models refuse more on a harmless task

Dhruv Trehan

5mo

TL;DR - Adding “for AI safety research” increased refusals on a harmless paraphrasing task for some top models. Conservative, aligned public models might be less useful for AI safety research by default.

I ran a quick, contained experiment while building a benign, context-matched split of HarmBench (proposed by Mazeika et al., ICML 2024 Poster): take a harmful query plus its real context and rewrite it into a safe paraphrase with the same length and technical level.

Using the same 100 “contextual” items from the dataset over 3 runs with fixed generation hyperparameters, I compared refusal behavior across Claude 3.5 Sonnet, Claude 3.7 Sonnet (Thinking), Gemini 2.5 Pro, and Gemini 2.5 Flash.

The result was odd.... (read 571 more words →)