On METR’s AI Coding RCT

[-]Gavin Runeblade3mo155

Peter Wildeford: I definitely think the biggest takeaway from this paper is that we likely can’t trust self-reports. This is pretty surprising to me, but is a finding commonly seen in productivity literature

Not just in productivity, this is a finding from so many fields of research that I struggle to find any situation where self report is trustworthy. Yes sometimes it is the best we can do, but that doesn't remove the weaknesses. With that said, I really appreciate the dual research model showing the self report vs actual situation in this case and hope more projects use it going forward.

[-]Alexander Barry3mo*110

While I think it is plausible the results would have been different if the devs had had e.g. 100 hours more experience with cursor, it is worth also noting that:

- 14/16 of the devs rated themselves as 'average' or above cursor users at the end of the study

- The METR staff working on the project thought the devs were qualitatively reasonable cursor users (based on screen recordings etc.)

So I think it is unlikely the devs were using cursor in an unusually unskilled way.

The forecasters were told that only 25% of the devs had prior cursor experience (the actual number ended up being 44%), and still predicted substantial speedup, so if there is a steep cursor learning curve here that seems like a fact people didn't expect.

With that all being said the skill ceiling for using AI tools is clearly at least *not being slowed down* (as they could simply not use the AI tools), so it would be reasonable to expect eventually some level of experience would lead to that result.

(I consulted with METR on the stats in the paper, so am quite familiar with it).

[-]harsimony3mo80

I feel like people are dismissing this study out of hand without updating appropriately. If there's at least a chance that this result replicates, that should shift our opinions somewhat.

First, a few reasons why the common counterarguments aren't strong enough to dismiss the study:

I've been seeing arguments against this result based on vibes or claims that the next generation of LLM's will overturn this result. But that is directly contradicted by the results of this study, people's feelings are poor indicators of actual productivity.
On Cursor experience, I think Joel Becker had a reasonable response here. Essentially, many of the coders had tried cursor, had some experience with it, and had a lot of experience using LLM's for programming. Is the learning curve really so steep that we shouldn't see them improve over the many tasks? See image below. Perhaps the fact that these programmers don't use it and saw little improvement is a sign that Cursor isn't very helpful.
While this is a challenging environment for LLM coding tools, this is the sort of environment I want to see improvement in for AI to have a transformative impact on coding. Accelerating experienced devs is where a lot of the value of automating coding will come from.

That aside, how should we change our opinions with regard to the study?

Getting AI to be useful in a particular domain is tricky, you have to actually run tests and establish good practices.
Anecdotes about needing discipline to stay on task with coding tools and the cursor learning curve suggest that AI adoption has frictions and requires tacit knowledge to use.
Coding is one of the cleanest, most data-rich, most LLM-developer-supported domains. As of yet, AI automation is not a slam dunk, even here. Every other domain will require its own iteration, testing, and practice to see a benefit.
If this holds, the points above slow AI diffusion, particularly when used as a tool for humans. Modelling the impact of current and near-future AI's should take this into account.

[-]Kabir Kumar3mo50

An issue I have with this is that it was just *16* developers that they tested this on.

I'd like to see this replicated at a larger scale before coming to conclusions

[-]Kabir Kumar3mo10

I am updating towards the possibility of LLM programming not being a speedup, more, for experienced programmers.

I do think, personally, using cursor and other such tools has stagnated my dev skill growth a lot, but it also seems to have allowed me to do a lot of more stuff.

Might try a week without llm assisted coding from tomorrow, see how it goes.

[-]Kabir Kumar3mo1-2

Pay was by the hour so there was large temptation to let the AI cook and otherwise work not so efficiently

this seems pretty important to me. I'd also be interested in seeing this replicated where the pay is per task/issue completed within a given timespan.

LESSWRONG
LW

LESSWRONG
LW

52

On METR’s AI Coding RCT

52

52

Epic Fail

The Core Result

Okay So That Happened

Beginner Mindset

Overall Takeaways