(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)
Update: Added GPT-5.2 to the main part of the text, this uses all data from v1.1. Added appendix using all METR models, by joining v1.0 and v1.1. Added appendix with marginal vs typical P(success) curves. Thanks to Thomas Kwa for telling me about this. TLDR I reanalyzed the METR task...
Feb 1321