LESSWRONG
LW

446
jsd
448Ω64480
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2jsd's Shortform
3y
8
boazbarak's Shortform
jsd12d10

Neat! 

"Does AI Progress Have a Speed Limit?" links to an April 2025 dialogue between Ajeya Cotra and Arvind Narayanan. Perhaps you wanted to link the 2023 dialogue between Tamay and Matt Clancy, also on Asterisk?

Reply
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
jsd15d70

I already expect some (probably substantial) effect from AIs helping to build RL environments

I think scraping and filtering MCP servers then RL training to navigate them is largely even if not fully automatable and already being done (cf this for SFT), but doesn't unlock massive value. 

Reply
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
jsd15dΩ110

In their BOTEC, it seems you roughly agree with a group size of 64 and 5 reuses per task (since 5 * 64 is between 100 and 1k).

You wrote $0.1 to $1 per rollout, whereas they have in mind 500,000 * $15 / 1M = $7.5. 500,000 doesn't seem especially high for hard agentic software engineering tasks which often reach into the millions. 

Does the disagreement come from:

  • Thinking the $15 estimate from opportunity cost is too high (so compute cost lower than Mechanize claims)
  • Expecting most of the RL training to somehow not be end-to-end? (so compute cost lower than Mechanize claims)
  • Expecting spending per RL environment to be smaller than compute spending, even if within an OOM.
Reply
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
jsd15d10

ongoing improvements to RL environments are already priced into the existing trend

I agree with this especially for e.g. METR tasks, or proxies for how generally smart a model is.

A case for acceleration in enterprise revenue (rather than general smarts) could look like: 

  • So far RL still has been pretty targeted towards coding, research/browsing, math, or being-ok-at-generic-tool-use (taking random MCP servers and making them into environments, like Kimi K2 did but with RL).
  • It takes SWE time to build custom interfaces for models to work with economically productive software like Excel, Salesforce, or Photoshop. We're not there yet, at least with publicly released models. Once we are, this suddenly unlocks a massive amount of economic value.

Ultimately I don't really buy this either, since we already have e.g. some Excel/Sheets integrations that are not great but better than what there was a couple months ago. And increase in breadth of RL environments is probably already factored into the trend somewhat.

ETA: this also matters less if you're primarily tracking AI R&D capabilities (or it might but indirectly, through driving more investment etc.).

Reply
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
jsd15dΩ340

Another way to think about this is that it could be reasonable to spend within the same order of magnitude on each RL environment as you spend in compute cost to train on that environment. I think the compute cost for doing RL on a hard agentic software engineering task might be around $10 to $1000 ($0.1 to $1 for each long rollout and you might do 100 to 1k rollouts?), so this justifies a lot of spending per environment. And, environments can be reused across multiple training runs (though they could eventually grow obsolete).

Agreed, cf https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/ 

Reply1
Bohaska's Shortform
jsd4mo*20

[I work at Epoch AI]

Thanks for your comment, I'm happy you found the logs helpful! I wouldn't call the evaluation broken - the prompt clearly states the desired format, which the model fails to follow. We mention this in our Methodology section and FAQ ("Why do some models underperform the random baseline?"), but I think we're also going to add a clarifying note about it in the tooltip.

While I do think "how well do models respect the formatting instructions in the prompt" is also valuable to know, I agree that I'd want to disentangle that from "how good are models at reasoning about the question". Adding a second, more flexible scorer (likely based on an LLM-judge, like we have for OTIS Mock AIME) is in our queue, we're just pretty strapped on engineering capacity at the moment :)

ETA: since it's particularly extreme in this case I plan to hide this evaluation until we have the new scorer added

Reply
Ram Potham's Shortform
jsd5mo50

there's this https://github.com/Jellyfish042/uncheatable_eval

Reply
Win/continue/lose scenarios and execute/replace/audit protocols
jsd10moΩ990

This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting. 

Reply
Fabien's Shortform
jsd1y45

Well that was timely

Reply1
Scaling of AI training runs will slow down after GPT-5
jsd1y96

Amazon recently bought a 960MW nuclear-powered datacenter. 

I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?

Reply
Load More
19A Theory of Unsupervised Translation Motivated by Understanding Animal Communication
2y
0
292Notes on Teaching in Prison
2y
13
2jsd's Shortform
3y
8
11"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time
4y
9