David Rein

Message

155

METR Research Update: Algorithmic vs. Holistic Evaluation

TL;DR * On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality. * This suggests that automatic scoring used by many benchmarks may overestimate AI agent...

Aug 13, 2025101

NYU Code Debates Update/Postmortem

TL;DR We designed an ambitious scalable oversight experimental setup, where we had people with no coding/programming experience try to answer coding questions (“Which of these two outputs is the correct output of the given function on this input?”), using LLMs that debate or are arguing for the correct/incorrect answer 50%...

May 24, 202427

phone.spinning's Shortform

Sep 6, 20221

LESSWRONG
LW

LESSWRONG
LW

David Rein

David Rein

David Rein

David Rein

METR Research Update: Algorithmic vs. Holistic Evaluation

NYU Code Debates Update/Postmortem

phone.spinning's Shortform

METR Research Update: Algorithmic vs. Holistic Evaluation

NYU Code Debates Update/Postmortem

phone.spinning's Shortform