kndjckt
kndjckt has not written any posts yet.

kndjckt has not written any posts yet.

Thanks for re-running the analysis!
I agree that RE-bench aggregate results should be interpreted with caution, given the low sample size. Let's focus on HCAST instead.
A few questions:
Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I'm curious if you agree.
I'll take a look. Thanks for sharing.
Sweet! Thanks for taking my points into consideration! :)
Thanks! Appreciate you digging that up :). Happy to conclude that my second point is likely moot.
Interesting. I get where you're coming from for blank slate things or front end. But programming is rarely a blank slate like this. You have to work with existing codebases or esoteric libraries. Even with the context loaded (as well as I can) Cursor with Composer and Claude Sonnet 3.7 code (the CLI tool) have failed pretty miserably for me on simple work-related tasks. As things stand, I always regret using them and wish I wrote the code myself. Maybe this is a context problem that is solved when the models grow to use proper attention across the whole context window rather than using shortcuts or hacks due to memory limitations (this will get better as the newer NVIDIA chips land).
I would love to see an LLM fix a meaningful issue in an open source library.
Wonderful post! I appreciate the effort to create plenty of quantitative hooks for the reader to grab onto and wrestle with.
I'm struggling with the initial claim that we have agentic programmers that are as good as human pros. The piece suggests that we already have Human Pro level agents right now!?
Maybe I'm missing something that exists internally at the AI labs? But we don't have access to such programmers. So far, all "agentic" programmers are effectively useless at real work. They struggle with real codebases, and novel or newer APIs and languages. Most programmers I know use LLMs as a sophisticated autocomplete and as a "search" tool. These offer nice speed-ups, greasing... (read 452 more words →)
Thanks Lucas!