x

LESSWRONG

LW

Dhruv Trehan — LessWrong

Dhruv Trehan

Dhruv Trehan

Message

45

2

1

9mo

Dhruv Trehan

45

9mo

Why LLMs Aren't Scientists Yet.

This is a crosspost from our report website for Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts. This report details the work behind our LLM-written paper "The Consistency Confound: Why Stronger Alignment Can Break Black-Box Jailbreak Detection" accepted at Agents4Science 2025, the first scientific conference requiring AI...

Saying “for AI safety research” made models refuse more on a harmless task

TL;DR - Adding “for AI safety research” increased refusals on a harmless paraphrasing task for some top models. Conservative, aligned public models might be less useful for AI safety research by default. I ran a quick, contained experiment while building a benign, context-matched split of HarmBench (proposed by Mazeika et...

Sep 8, 2025•7