LESSWRONG
LW

Josh You
2670430
Message
Dialogue
Subscribe

data analyst at Epoch AI

@justjoshinyou13 on twitter

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
My AGI timeline updates from GPT-5 (and 2025 so far)
Josh You12d150

But I've been increasingly starting to wonder if software engineering might not be surprisingly easy to automate when the right data/environments are used at much larger scale

I've had similar thoughts: I think there's still low-hanging fruit in RL, and in scaffolding and further scaling of inference compute. But my general take is that the recent faster trend of doubling every ~4 months is already the result of picking the low-hanging RL fruit for coding and SWE, and fast inference scaling. So this kind of thing will probably lead to a continuation of the fast trend, not another acceleration.

Another source of shorter timelines, depending on what timeline you mean, is the uncertainty from translating time horizon to real-world AI research productivity. Maybe models with an 80% time horizon of 1 month or less are already enough for a huge acceleration of AI R&D, with the right scaffold/unhobbling/bureaucracy that can take advantage of lots of parallel small experiments or other work, or good complementarities between AI and human labor, 

Reply
Mo Putera's Shortform
Josh You13d10

In any case, the paper says the curtailments would last about two hours each:

The average duration of load curtailment (i.e., the length of time the new load is curtailed during curtailment events) would be relatively short, at 1.7 hours when average annual load curtailment is limited to 0.25%, 2.1 hours at a 0.5% limit, and 2.5 hours at a 1.0% limit

Reply
Mo Putera's Shortform
Josh You13d30

Demand response could be done by covering the data center with battery energy or not. Demand response and batteries can stack: if the grid is really stressed, a data center can both turn off and discharge its battery into the grid.

Economically, it makes sense to accept some true downtime to avoid months-long delays in data center construction. This is clearly true for training workloads which are very important but don't have live demand. But downtime for even inference clusters is acceptable: you can reduce the compute demand by temporarily slowing down token generation, or use dynamic rate limits. And any curtailment would almost certainly be isolated to one region, so inference data centers in other places would still be operational.

Reply1
GPT-5: The Reverse DeepSeek Moment
Josh You14d11

Internal models aren't 6 months ahead in general.

Sometimes internal models are several months ahead in key benchmarks or capabilities. For example, an internal OpenAI model won gold on IMO but it might be a while before a public OpenAI model does as well at IMO or other math competitions. But you wouldn't want to use this model, and I don't think OpenAI uses the model a lot internally.  

Also Anthropic is probably a few months ahead of OpenAI in coding.

Reply
METR's Evaluation of GPT-5
Josh You23d20

GPT-5's time horizon curve for reference. Actually looks a bit smoother than the curves for previous models.

 

Models' logistic histogram
Reply
tdko's Shortform
Josh You25d10

The router is only on ChatGPT, not the API, I believe. And it switches between two models of the same size and cost (GPT-5 with thinking and GPT-5 without thinking).

Reply
AI Task Length Horizons in Offensive Cybersecurity
Josh You2mo30

o4-mini performing best isn't surprising, it leads or ties o3 and other larger models in most STEM-focused benchmarks. Similarly, o1-mini was better at math benchmarks than o1-preview, and Grok-3 mini generally gets better scores than Grok-3. The general explanation here is that RL training and experiments are cheaper and faster on mini models. 

Reply
Ghiblification for Privacy
Josh You3mo125

why not generate them with a more generic art style?

Reply
ryan_greenblatt's Shortform
Josh You3mo10

One possibility I've wondered about is whether AI can automate this learning work: start from a transcript of someone trying to do things with AI with mistakes and subsequent feedback, and then curating some data from that works well for RL fine-tuning. Or even distilling it into examples for in-context learning (which probably works somewhat well, sometimes, today).

Reply
We're Not Advertising Enough (Post 3 of 7 on AI Governance)
Josh You3mo10-3

What do you think about the success of AI 2027? It was very widely read, including by the vice president. That's partly because it's a striking narrative and, I presume, due to a decent amount of comms and press work. But it's also backed by a lot of research which took up the majority of the effort, and I think that was instrumental to its success.

More generally, good research builds credibility, especially among other experts who also have a lot of credibility and can help amplify your message. Someone like Yoshua Bengio has a lot more influence for AI safety than a large number of dedicated AI safety advocates. And high-quality research could persuade the next Bengio.

Reply
Load More
No posts to display.