This is the July update of our misalignment bounty program.
Transcripts/logs can be found here.
Thanks, I find this pretty interesting. Sorry if I missed it above, but is there somewhere you could share the database of tasks and time estimates? This seems like the key input going into your result, and I would be interested to get a better sense of how much I trust the estimates.
(agree, didn't intend to imply that they were)
Note that the REBench correlation definitionally has to be 0 because all tasks have the same length. SWAA similarly has range restriction, though not as severe.
This seems plausible to me but I could also imagine the opposite being true: my working memory is way smaller than the context window of most models. LLMs would destroy me at a task which "merely" required you to memorize 100k tokens and not do any reasoning; I would do comparatively better at a project which was fairly small but required a bunch of different steps.
The METR report you cite finds that LLMs are vastly cheaper than humans when they do succeed, even for longer tasks:
The ARC-AGI results you cite feel somewhat hard to interpret: they may indicate that the very first models with some capability will be extremely expensive to run, but don't necessarily mean that human-level performance will forever be expensive.
I think the claim is that things with more exposure to AI are more expensive.
Thanks!
Yeah, my guess is that what Garry was saying is technically true but does not actually imply that companies are growing quickly in a more meaningful sense. Would love to learn more if someone has access to that data.