Thanks to Megan Kinniment for helpful comments and discussion, and to Jean-Stanislas Denain for helpful comments and pointers to past work. TL;DR: We claim that useful task attributes for forecasting AI capabilities should be measurable, interpretable, stable in its trend over time, and sufficient to explain task difficulty. task.human_completion_time (human...
Thanks to Megan Kinniment for helpful comments and discussion. TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capacity: we can either try to build automated judges that match human judgment, or...
Software is made of information flows Software encodes information flows. An ERP system, for instance, takes procurement and locks it into a specific sequence of purchase orders, approval routing, invoice matching, and payment release. Git takes multiple people changing code and imposes a protocol of branching, diffing, reviewing, and merging....
Why did software change the world? In the 1900s, much of the work being done by knowledge workers was computation: searching, sorting, calculating, tracking. Software made this work orders of magnitude cheaper and faster. Naively, one might expect businesses and institutions to carry out largely the same processes, just more...