This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR’s methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
- Cyber task horizons are doubling every ~5 months.
- The best
... (read 3323 more words →)
Yes I agree. And given these times were my primary anchor for other CTF estimates, this could be a significant contributor to why the SOTA time horizons I found were so much lower than METR found in their work with software engineering.