I already expect some (probably substantial) effect from AIs helping to build RL environments
I think scraping and filtering MCP servers then RL training to navigate them is largely even if not fully automatable and already being done (cf this for SFT), but doesn't unlock massive value.
In their BOTEC, it seems you roughly agree with a group size of 64 and 5 reuses per task (since 5 * 64 is between 100 and 1k).
You wrote $0.1 to $1 per rollout, whereas they have in mind 500,000 * $15 / 1M = $7.5. 500,000 doesn't seem especially high for hard agentic software engineering tasks which often reach into the millions.
Does the disagreement come from:
ongoing improvements to RL environments are already priced into the existing trend
I agree with this especially for e.g. METR tasks, or proxies for how generally smart a model is.
A case for acceleration in enterprise revenue (rather than general smarts) could look like:
Ultimately I don't really buy this either, since we already have e.g. some Excel/Sheets integrations that are not great but better than what there was a couple months ago. And increase in breadth of RL environments is probably already factored into the trend somewhat.
ETA: this also matters less if you're primarily tracking AI R&D capabilities (or it might but indirectly, through driving more investment etc.).
Another way to think about this is that it could be reasonable to spend within the same order of magnitude on each RL environment as you spend in compute cost to train on that environment. I think the compute cost for doing RL on a hard agentic software engineering task might be around $10 to $1000 ($0.1 to $1 for each long rollout and you might do 100 to 1k rollouts?), so this justifies a lot of spending per environment. And, environments can be reused across multiple training runs (though they could eventually grow obsolete).
Agreed, cf https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/
[I work at Epoch AI]
Thanks for your comment, I'm happy you found the logs helpful! I wouldn't call the evaluation broken - the prompt clearly states the desired format, which the model fails to follow. We mention this in our Methodology section and FAQ ("Why do some models underperform the random baseline?"), but I think we're also going to add a clarifying note about it in the tooltip.
While I do think "how well do models respect the formatting instructions in the prompt" is also valuable to know, I agree that I'd want to disentangle that from "how good are models at reasoning about the question". Adding a second, more flexible scorer (likely based on an LLM-judge, like we have for OTIS Mock AIME) is in our queue, we're just pretty strapped on engineering capacity at the moment :)
ETA: since it's particularly extreme in this case I plan to hide this evaluation until we have the new scorer added
there's this https://github.com/Jellyfish042/uncheatable_eval
This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting.
Well that was timely
Amazon recently bought a 960MW nuclear-powered datacenter.
I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?
Neat!
"Does AI Progress Have a Speed Limit?" links to an April 2025 dialogue between Ajeya Cotra and Arvind Narayanan. Perhaps you wanted to link the 2023 dialogue between Tamay and Matt Clancy, also on Asterisk?