Alexander Barry
Alexander Barry has not written any posts yet.

Alexander Barry has not written any posts yet.

Good catch! Edited my comment. It had been a while since I had looked at the results and I must have also lost the ability to read in the meantime.
Thanks for the great writeup!
I'm a statistician who does some work with METR, and I recently worked on a very similar project to create a Bayesian version of the Time Horizon model. Mine ended up being somewhat different to yours (mine deviates a bit more from the currently structure of the METR model), but its great to see other people stress testing modelling.
On the 80% Time Horizon results I agree that your 'marginal' approach is correct, and it is the one I also took in my model. However my 80% results ended up being a factor of 2 Edit:higher than the results of METR's current model for recent SOTA LLMs. Here is... (read more)
While I think it is plausible the results would have been different if the devs had had e.g. 100 hours more experience with cursor, it is worth also noting that:
- 14/16 of the devs rated themselves as 'average' or above cursor users at the end of the study
- The METR staff working on the project thought the devs were qualitatively reasonable cursor users (based on screen recordings etc.)
So I think it is unlikely the devs were using cursor in an unusually unskilled way.
The forecasters were told that only 25% of the devs had prior cursor experience (the actual number ended up being 44%), and still predicted substantial speedup, so if there is... (read more)
I chatted with Thomas a bit about this, and I also agree that the default METR model should also output things that are close to the 'marginal' definition of time horizon (or at least as well as it can be approximated with the inverse logit sigmoid).
I think the important thing to realise is that while one needs to take additional steps for the 'marginal' approach when fitting a model that explicitly accounts for the deviation in task-length-for-humans vs task-difficulty-for-llms, models that don't explicitly account for this (such as the original METR model) should have it naturally learned into the shape of their logistic curve.
(A similar thing is also true for having the discrimination parameter vary by task instead of by model - if it varies by task this uncertainty needs to be accounted for in the time horizon calculations, but since this is not the case in the original METR model it does not).