frmsaul

You're gonna need a bigger boat (benchmark), METR

[EDIT: LawrenceC, who works at METR, responds to this.] In this post, we’ll discuss three major problems with the METR eval and propose some solutions. Problem 1: The METR eval produces results with egregious confidence intervals, and the METR chart misleadingly hides this. Problem 2: There's a lack of sample...

Apr 1320

frmsaul

frmsaul

Is ProgramBench Impossible?

Cyborg evals

We Need to Get Serious about Uplift Studies

You're gonna need a bigger boat (benchmark), METR

frmsaul

Is ProgramBench Impossible?

Cyborg evals

We Need to Get Serious about Uplift Studies

You're gonna need a bigger boat (benchmark), METR

Is Progress Inevitable?

frmsaul's Shortform

We Need to Get Serious about Uplift Studies

Is ProgramBench Impossible?

Cyborg evals

You're gonna need a bigger boat (benchmark), METR