LESSWRONG
LW

2269
Lucas Sato
35030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
METR's Evaluation of GPT-5
Lucas Sato2mo20

This was gpt-5-thinking. Agree this would be good to specify. FWIW section 5.1.3.7 of the GPT-5 System Card (which is about METR's evaluation) is clearer on this point.

Reply
METR's Evaluation of GPT-5
Lucas Sato2mo342

It does not! We redacted the following:

  • Additional details and specific language about the "assurance checklist"
  • Some longer reasoning trace extracts
  • Some additional context around the interim report shared with OpenAI on August 1st

To reiterate, as mentioned in "Note on independence" and its footnote, none of the redactions had any effect on our conclusions or tone, and we think it's reasonable for this review to occur for an output like this (public and not intended as a robust formal oversight artifact).

Reply5
METR’s preliminary evaluation of o3 and o4-mini
Lucas Sato4mo20

(source: I work at METR)

Thanks for the questions!

  • Updates to HCAST are just generally newer tasks, clarifications, bug fixes, etc. No specific change in direction or focus. Any given plot is only including data that is on the same task set version, so yes, Claude 3.7 Sonnet was retested on the updated HCAST and that's the number shown on the headline bar chart. In contrast, the "trendline plot" with the o3 and o4-mini additions (posted to twitter) is showing only results on the original HCAST from the original trendline paper (including for o3 and o4-mini -- we also ran them on the older task set version so that we could put it on the trendline).
  • The 2M limit was originally chosen as "high enough to not be a bottleneck". See Performance of current agents seems to plateau quite early here. So I mostly do not expect that increasing the token budget would meaningfully improve performance. But that choice is possibly outdated by now. A team member has been working on re-evaluating that and I think we may have an update on this soon.
Reply
No wikitag contributions to display.