Peter Wildeford: I definitely think the biggest takeaway from this paper is that we likely can’t trust self-reports. This is pretty surprising to me, but is a finding commonly seen in productivity literature
Not just in productivity, this is a finding from so many fields of research that I struggle to find any situation where self report is trustworthy. Yes sometimes it is the best we can do, but that doesn't remove the weaknesses. With that said, I really appreciate the dual research model showing the self report vs actual situation in this case and hope more projects use it going forward.
While I think it is plausible the results would have been different if the devs had had e.g. 100 hours more experience with cursor, it is worth also noting that:
- 14/16 of the devs rated themselves as 'average' or above cursor users at the end of the study
- The METR staff working on the project thought the devs were qualitatively reasonable cursor users (based on screen recordings etc.)
So I think it is unlikely the devs were using cursor in an unusually unskilled way.
The forecasters were told that only 25% of the devs had prior cursor experience (the actual number ended up being 44%), and still predicted substantial speedup, so if there is a steep cursor learning curve here that seems like a fact people didn't expect.
With that all being said the skill ceiling for using AI tools is clearly at least *not being slowed down* (as they could simply not use the AI tools), so it would be reasonable to expect eventually some level of experience would lead to that result.
(I consulted with METR on the stats in the paper, so am quite familiar with it).
I feel like people are dismissing this study out of hand without updating appropriately. If there's at least a chance that this result replicates, that should shift our opinions somewhat.
First, a few reasons why the common counterarguments aren't strong enough to dismiss the study:
I've been seeing arguments against this result based on vibes or claims that the next generation of LLM's will overturn this result. But that is directly contradicted by the results of this study, people's feelings are poor indicators of actual productivity.
On Cursor experience, I think Joel Becker had a reasonable response here. Essentially, many of the coders had tried cursor, had some experience with it, and had a lot of experience using LLM's for programming. Is the learning curve really so steep that we shouldn't see them improve over the many tasks? See image below. Perhaps the fact that these programmers don't use it and saw little improvement is a sign that Cursor isn't very helpful.
While this is a challenging environment for LLM coding tools, this is the sort of environment I want to see improvement in for AI to have a transformative impact on coding. Accelerating experienced devs is where a lot of the value of automating coding will come from.
That aside, how should we change our opinions with regard to the study?
Getting AI to be useful in a particular domain is tricky, you have to actually run tests and establish good practices.
Anecdotes about needing discipline to stay on task with coding tools and the cursor learning curve suggest that AI adoption has frictions and requires tacit knowledge to use.
Coding is one of the cleanest, most data-rich, most LLM-developer-supported domains. As of yet, AI automation is not a slam dunk, even here. Every other domain will require its own iteration, testing, and practice to see a benefit.
If this holds, the points above slow AI diffusion, particularly when used as a tool for humans. Modelling the impact of current and near-future AI's should take this into account.
I am updating towards the possibility of LLM programming not being a speedup, more, for experienced programmers.
I do think, personally, using cursor and other such tools has stagnated my dev skill growth a lot, but it also seems to have allowed me to do a lot of more stuff.
Might try a week without llm assisted coding from tomorrow, see how it goes.
METR ran a proper RCT experiment seeing how much access to Cursor (using Sonnet 3.7) would accelerate coders working on their own open source repos.
Everyone surveyed expected a substantial speedup. The developers thought they were being substantially sped up.
Instead, it turned out that using Cursor slowed them down.
That surprised everyone, raising the question of why.
Currently our best guess is this comes down to a combination of two factors:
Deeply understood open source repos are close to a worst-case scenario for AI tools, because they require bespoke outputs in various ways and the coder has lots of detailed local knowledge of the codebase that the AI lacks.
The coders in question mostly did not have experience with similar AI tools. The lack of a learning curve during the experiment challenges this, but the tools very clearly have a sharp learning curve the same way other programming does.
Thus we should be careful interpreting the result. It was still highly virtuous to run an RCT, and to publish the results even when they were against interest and counterintuitive, and at risk of being quoted endlessly in misleading fashion by AI skeptics. That is how real science works.
Again, due to all the circumstances, one should avoid inferring too much. I would like to see the study done again where everyone had at least a few weeks of working full time with such tools, ideally also while working on other types of projects. And a result this surprising means we should be on the lookout for flaws.
The result was still very much surprising to METR, to the developers in the test, to the forecasters, and also to those who saw the results.
Yo Shavit: something something METR good bc publishing against their priors blah blah
all I care about is that this vindicates my incompetence in using models for my actual work
Dwarkesh Patel: Surely this doesn’t have implications for how I use AI and whether I’m fooling myself about how much more effective it’s making my podcast prep, right?
The Core Result
METR: We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers.
The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn’t.
We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).
We randomly assigned each task to either allow AI (typically Cursor Pro w/ Claude 3.5/3.7) or disallow AI help.
At the beginning of the study, developers forecasted that they would get sped up by 24%. After actually doing the work, they estimated that they had been sped up by 20%. But it turned out that they were actually slowed down by 19%.
We were surprised by this, given a) impressive AI benchmark scores, b) widespread adoption of AI tooling for software development, and c) our own recent research measuring trends in the length of tasks that agents are able to complete.
When AI is allowed, developers spend less time actively coding and searching for information, and instead spend time prompting AI, waiting on/reviewing AI outputs, and idle. We find no single reason for the slowdown—it’s driven by a combination of factors.
To better understand these factors, we investigate 20 properties of our setting, finding 5 likely contributors, and 8 mixed/unclear factors.
We also analyze to make sure the result isn’t a fluke, and find that slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data.
…
What do we take away?
1. It seems likely that for some important settings, recent AI tooling has not increased productivity (and may in fact decrease it).
2. Self-reports of speedup are unreliable—to understand AI’s impact on productivity, we need experiments in the wild.
Another implication:
It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We’re now more pessimistic about these, given how large of a gap we observe between developer-estimated and observed speed-up.
What we’re NOT saying:
1. Our setting represents all (or potentially even most) software engineering.
2. Future models won’t be better (or current models can’t be used more effectively).
David Rein: I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup.
Yes, it is. We got people to preregister their expectations, and even folks who are extremely in-the-know about AI coding abilities still failed to predict this result.
Your *vibes* are not reliable indicators of productivity effects.
Jeffrey Ladish: Surprising results from METR re AI software engineer uplift! Great to see this kind of empirical investigation. Our intuitions are not always correct…
I do think this is to some extent a skill issue. Pretty sure I know some people who’ve learned to use the tools effectively and get a big speed and quality boost.
Daniel Kokotajlo: Very important work! This also has lengthened my timelines somewhat, for obvious reasons. :)
In perhaps the most shocking fact of all, developers actually slightly overestimated their required time in the non-AI scenario, I thought that was never how any of this worked?
Okay So That Happened
So now that we have the result, in addition to updating in general, what explains why this situation went unusually poorly?
Here are the paper’s own theories first:
The big disagreement is over the first factor here, as to whether the development environment and associated AI tools should count as familiar in context.
There are several factors that made this situation unusually AI-unfriendly.
AI coding is at its best when it is helping you deal with the unfamiliar, compensate for lack of skill, and when it can be given free reign or you can see what it can do and adapt the task to the tool. Those didn’t apply here.
Roon: IME really good software ppl who deeply care find the least use from LLM coding and are often ideologically opposed to it because they like to exert editorial control over every line. slop research coders such as myself don’t care as much and have much larger gains.
this result is still surprising, how/why does it slow them down? but I wouldn’t think it generalizes to the average software developer who’s just trying to get some damn thing done and not trying to write maintainable useful code on a top open source library.
Eric Raymond: I think I qualify as an experienced open source developer, and that study looks completely ridiculous to me.
I’ve discussed it with some peers. We think one of the confounders may be that LLMs are much better at accelerating green-field development then fixing or improving large existing codebases.
Also there’s a difference in their performance between front-end and back-end stuff. Big advantage for web front-end dev, not so much for back-end. I’ve experienced this difference myself.
These were projects that the developers already knew intimately, with high context, and they did the task they would otherwise have done next. They were already familiar with the repos and working on at a very high skill level, and were trying to adapt the tool to the task and not the task to the tool.
In particular, they broke down tasks into 1-2 hour chunks before they knew whether they could use AI for the subtasks. That’s great RCT design, but does mean flexibility was limited.
These were large open source projects that thus have a variety of high standards and requirements, and require a lot of tacit knowledge and context. AI code that is ‘good enough’ in other contexts wasn’t up to standards here, and this was identified as the biggest factor, only 39% of Cursor generations were accepted and many of those still required reworking.
Pay was by the hour so there was large temptation to let the AI cook and otherwise work not so efficiently. From Ruby we get the reminder that a natural thing to do when working in Cursor is end up checking social media while it runs.
They certainly weren’t doing AI coder multitasking or anything like that.
As always, there is a lag, this was done with Sonnet 3.5/3.7. Ruby notes that the models we have now are already substantially better.
The tasks were modestly beyond the range of tasks Sonnet 3.7 can do autonomously, as per METR’s own measurements (plus the contractor vs. maintainer contrast).
The AI tools offered were often new to their users, which slows people down. Participants might have been partly learning AI tools on METR’s dime? Developers said they weren’t significantly inconvenienced by the tool changes but you can’t trust self-reports.
We also have a direct post mortem from Quentin Anthony, who was one of the 16 devs and experienced a 38% speedup when using AI, the best result of all participants. He ascribes others getting poor results in large part to:
Falling into the failure mode of pressing the magic bullet AI button hoping the problem gets solved, which is not a good workflow, rather than AI as a tool.
Getting distracted during downtime as they wait for AI, also not a good workflow.
AIs running into various problems where they perform poorly.
All of that is true, but none of it seems like enough to explain the result.
Credit to Emmett Shear for being the first one to prominently lay this out fully.
Emmett Shear: METR’s analysis of this experiment is wildly misleading. The results indicate that people who have ~never used AI tools before are less productive while learning to use the tools, and say ~nothing about experienced AI tool users. Let’s take a look at why.
I immediately found the claim suspect because it didn’t jibe with my own experience working w people using coding assistants, but sometimes there are surprising results so I dug in. The first question: who were these developers in the study getting such poor results?
…
They claim “a range of experience using AI tools”, yet only a single developer of their sixteen had more than a single week of experience using Cursor. They make it look like a range by breaking “less than a week” into <1 hr, 1-10hrs, 10-30hrs, and 30-50hrs of experience.
Given the long steep learning curve for effectively using these new AI tools well, this division betrays what I hope is just grossly negligent ignorance about that reality, rather than intentional deception.
Of course, the one developer who did have more than a week of experience was 20% faster instead of 20% slower.
David Rein: Devs had roughly the following prior LLM experience:
– 7/16 had >100s of hours
– 7/16 had 10-100 hours
– 2/16 had 1-10 hours
We think describing this as “moderate AI experience” is fair, my guess is we’ll have to agree to disagree, but appreciate the feedback!
Emmett Shear: I think conflating the two completely invalidates the study’s headline and summary results. I suppose the future will tell if this is the case. I’m glad to have found the underlying disagreement.
It is clear that the source of disagreement is that I think using Cursor effectively is a distinct skill from talking to ChatGPT while you program and expect fairly low transfer, and the authors think it’s the similar skill and expect much higher.
I think Emmett is right that these tools are not similar. The data point that still needs to be explained is (see Table 1 above) the lack of improvement over those 30-50 hours using Cursor. If the learning curve is steep then devs should be improving rapidly over that time. So I can still definitely see this going either way.
Regardless, this was an unusually hostile setting on many fronts, including the lack of experience. The result still is important in general.
Roon: am curious about a few things. the archetype of an “experienced open source developer” is very different from your average developer. is there a subset of inexperienced developers? developers who work for random companies but are not enthusiasts?
David Rein: yeah the open-source repos do typically have pretty high standards for linting, test coverage, etc.—not all of which is super clear in contributing guidelines necessarily (making it harder for AI to help)
Minh Nhat Nguyen: I would critique the “their own repos” part. by far the biggest unlock i have when using AI coding is navigating unfamiliar repos.
After some iteration, even if an AI made the initial draft, I’d be faster working myself on repos I already know well.
David Rein: One of the most confusing aspects of the result is that we don’t *require* developers to use AI, they’re just *allowed* to use it. So in principle, they should be able to just not use AI if it’s slowing them down.
There are two main explanations we have for this.
The first is that developers think that AI is speeding them up (they estimate they were sped up by 20%).
The second is that developers might be trading some speed for ease—using Cursor may be so much more pleasant that developers don’t notice or mind that they’re slowed down.
One common question is how much experience the developers have with AI tools—maybe they’re just particularly bad at using AI? While they aren’t AI power users before the study, nearly all have tens to hundreds of hours of prior experience using LLMs.
44% of the developers had used Cursor before, and for ~19% of them it was already their primary IDE. Furthermore, throughout the study, they spend around 20 hours allowed to use AI tools—and we don’t see speedup when excluding up to their first 8 tasks with AI.
…
We further rule out a bunch of potential experimental artifacts: we don’t have dropout/attrition issues, the results are robust to variations of our outcome estimation methodology, developers primarily use frontier models (at the time), and we don’t see cheating.
Some other interesting findings! We find that developers are slowed down less on tasks they are less familiar with. This is intuitive—if you really know what you’re doing, AI can be less marginally helpful. Because we collect forecasts from developers on how long they expect issues to take both with and without AI, we can measure their speedup as a function of how much speedup they expect for particular issues. Developers are actually somewhat calibrated on AI’s usefulness!
Just for fun, here are the repositories developers were working on in the study—they’re pretty impressive! I was really impressed by the general skill of the developers—they’re really experienced, and they contribute to large, complex projects.
Another takeaway worth noting is that self-reports of coding productivity, or productivity gains from AI, cannot be trusted, in general Peter’s thread is excellent.
Peter Wildeford: I definitely think the biggest takeaway from this paper is that we likely can’t trust self-reports. This is pretty surprising to me, but is a finding commonly seen in productivity literature.
The March @METR_Evals paper contained this nugget about contractors being much slower [5x-18x slower to fix issues!] than maintainers. This seems born out today, as the new study on AI slowdown was solely on maintainers – METR studied 16 long‑time maintainers with an average of 5 yrs prior work on the repo
Seems important to keep in mind for AI timelines when interpreting the prior METR paper on task horizon length that the comparison was AI to contractors. A comparison of AI to veteran SWEs likely would have been tougher. I guess humans have returns to experience!
This does make me strongly suspect the METR paper on AI productivity slowdown would’ve gone differently if it was measuring junior engineers or senior engineers in new projects, as opposed to where there’s significant pre-existing fit with the exact work. My hypothesis is that the results in this paper are real, but don’t apply to a wide variety of scenarios where AIs do speed people up.
Overall Takeaways
I am somewhat convinced by Emmett Shear’s explanation. I strongly agree that ‘experience with LLMs’ does not translate cleanly to ‘experience with Cursor’ or with AI coding tools, although experience with other AI IDEs would fully count. And yes, there is a rather steep learning curve.
So I wouldn’t get too excited by all this until we try replication with a group that has a lot more direct experience. It should not be too hard to find such a group.
Certainly I still think AI is a vast productivity enhancer for most coders, and that Opus 4 (or your preferred alternative) is a substantial upgrade over Sonnet 3.7. Also Claude Code seems to be the core of the optimal stack at this point, with Cursor as a secondary tool. This didn’t change my estimates of the ‘normal case’ by that much.
I still think this is a meaningful update. The result was very different than people expected, and participants did not seem to be moving up the learning curve.