kndjckt — LessWrong

LESSWRONG
LW

kndjckt — LessWrong

Replying toMETR’s preliminary evaluation of o3 and o4-mini

kndjckt9mo

METR’s preliminary evaluation of o3 and o4-mini

Thanks Lucas!

Replying toMETR’s preliminary evaluation of o3 and o4-mini

kndjckt9mo

METR’s preliminary evaluation of o3 and o4-mini

Thanks for re-running the analysis!

I agree that RE-bench aggregate results should be interpreted with caution, given the low sample size. Let's focus on HCAST instead.

A few questions:

Would someone from the METR team be able to clarify the updates to the HCAST task set? The exec summary states: "While these measurements are not directly comparable with the measurements published in our previous work due to updates to the task set". Was Claude 3.7 Sonnet retested on the updated HCAST test set?
On HCAST o3 and o4-mini get a 16M token limit vs 2M for Claude 3.7 Sonnet (if my reading of the paper is correct). Do we know how Claude would do if given a higher token budget? Maybe this isn't relevant as it never gets close to the budget, and it submits answers well before hitting the limit? I want to make sure improvements are just due to shifting token budgets.

Replying toAI 2027: What Superintelligence Looks Like

kndjckt10mo

AI 2027: What Superintelligence Looks Like

Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I'm curious if you agree.

I'll take a look. Thanks for sharing.

Replying toAI 2027: What Superintelligence Looks Like

kndjckt10mo

AI 2027: What Superintelligence Looks Like

Sweet! Thanks for taking my points into consideration! :)

Replying toAI 2027: What Superintelligence Looks Like

kndjckt10mo

AI 2027: What Superintelligence Looks Like

Thanks! Appreciate you digging that up :). Happy to conclude that my second point is likely moot.

Replying toAI 2027: What Superintelligence Looks Like

kndjckt10mo

AI 2027: What Superintelligence Looks Like

Interesting. I get where you're coming from for blank slate things or front end. But programming is rarely a blank slate like this. You have to work with existing codebases or esoteric libraries. Even with the context loaded (as well as I can) Cursor with Composer and Claude Sonnet 3.7 code (the CLI tool) have failed pretty miserably for me on simple work-related tasks. As things stand, I always regret using them and wish I wrote the code myself. Maybe this is a context problem that is solved when the models grow to use proper attention across the whole context window rather than using shortcuts or hacks due to memory limitations (this will get better as the newer NVIDIA chips land).

I would love to see an LLM fix a meaningful issue in an open source library.

Replying toAI 2027: What Superintelligence Looks Like

kndjckt10mo*

AI 2027: What Superintelligence Looks Like

Wonderful post! I appreciate the effort to create plenty of quantitative hooks for the reader to grab onto and wrestle with.

I'm struggling with the initial claim that we have agentic programmers that are as good as human pros. The piece suggests that we already have Human Pro level agents right now!?

Maybe I'm missing something that exists internally at the AI labs? But we don't have access to such programmers. So far, all "agentic" programmers are effectively useless at real work. They struggle with real codebases, and novel or newer APIs and languages. Most programmers I know use LLMs as a sophisticated autocomplete and as a "search" tool. These offer nice speed-ups, greasing... (read 452 more words →)