rohuang — LessWrong

Fable is SOTA at CIFAR Speedrun (& specification gaming)

Fulcrum is working on an AI R&D optimization benchmark. Here, we present results from one of our tasks, including preliminary results from Fable. For more detail on Fable’s solution, check out github.com/fulcrumresearch/cifar-10-speedrun. Summary: We gave current frontier models 100M tokens to see whether they could beat the human record for...

Jul 2029

Agents are under-elicited: A case study in optimization tasks

by zef, kaivu, leni, and rohuang

> "Knowing is not enough; we must apply. Willing is not enough; we must do." > > — Johann Wolfgang von Goethe In our previous post, we introduced inverse rubric optimization (IRO): tasks where an agent must learn the preferences of a black-box judge under a label budget. These are...

Jun 1817

Inverse Rubric Optimization: A testbed for agent science

by zef, leni, kaivu, and rohuang

Jun 1110

Tracking Difficulty with Feature Portfolios

by kaivu, leni, zef, and rohuang

Thanks to Megan Kinniment for helpful comments and discussion, and to Jean-Stanislas Denain for helpful comments and pointers to past work. TL;DR: We claim that useful task attributes for forecasting AI capabilities should be measurable, interpretable, stable in its trend over time, and sufficient to explain task difficulty. task.human_completion_time (human...

May 1923

Benchmarking Real Work

by kaivu, leni, rohuang, and zef

Thanks to Megan Kinniment for helpful comments and discussion. TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capacity: we can either try to build automated judges that match human judgment, or...

May 1630

The bitter lesson for software

by zef, rohuang, and kaivu

Software is made of information flows Software encodes information flows. An ERP system, for instance, takes procurement and locks it into a specific sequence of purchase orders, approval routing, invoice matching, and payment release. Git takes multiple people changing code and imposes a protocol of branching, diffing, reviewing, and merging....

Mar 1615

More is different for intelligence

by zef, rohuang, and kaivu

Why did software change the world? In the 1900s, much of the work being done by knowledge workers was computation: searching, sorting, calculating, tracking. Software made this work orders of magnitude cheaper and faster. Naively, one might expect businesses and institutions to carry out largely the same processes, just more...

Mar 717