Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
Thanks for adding that context, I think it's helpful!
In some sense, I hope that this post can be useful for current regulators who are thinking about where to put thresholds (in addition to future regulators who are doing so in the context of an international agreement).
I think these thresholds were set by political feasibility rather than safety analysis
Yep, I think this is part of where I hope this post can provide value: we focused largely on the safety analysis part.
Hey, thanks for your comment! My colleague Robi will have a blog post out soon dealing with the question "how does the agreement prevent ASI development in countries that are not part of the agreement?", which it sounds like is basically what you're getting at. I think this is a non-trivial problem, and Latin American drug cartels seem like an interesting case study—good point!
As a clarification, I don't expect this to be true in our agreement: "there will be so many existing useless GPU chips". Our agreement lets existing GPUs stick around as long as they're being sufficiently monitored (Chip Use Verification), and it's fair game for people to do inference on today's models. So I think there will be a strong legal market for GPUs, albeit not as strong as today's because there will be some restrictions on how they can be used.
As a tldr on my thinking here: It's similar to existing WMD non-proliferation problems. Coalition countries will crack down pretty hard on chip smuggling and try to deny cartel access to chips. The know-how for frontier AI development is pretty non-trivial and there aren't all that many relevant people. While some of them might defect to join the cartels, I think there will be a lot of ideological/social/peer pressure against doing this, in addition to various prohibition efforts from the government (e.g., how governments intervene on terrorist recruitment).
Thanks for your comment! Conveniently, we wrote this post about why we pick the training compute thresholds we did in the agreement (1e24 is the max). I expect you will find it interesting, as it responds to some of what you're saying! The difference between 1e24 and 5e26 largely explains our difference in conclusions about what a reasonable unmonitored cluster size should be, I think.
You're asking the right question here, and it's one we discuss in the report a little (e.g., p. 28, 54). One small note on the math is that I think it's probably better to use FP8 (so 2e15 theoretical FLOP per H100 due to the emergence of FP8 training).
Hey Tsvi, thanks for the ideas! The short answer is that we don't have good answers about what the details of Article VIII should be. It's reasonably likely that I will work on this as my next big project and that I'll spend a couple of months on it. If so, I'll keep these ideas in mind—they seem like a reasonable first pass.
This has some disagree votes. What's wrong with this idea?
Relatedly, probably the wrong thread but somewhat relevant, I don't feel great about my donations to a nonprofit funding their "hotel/event venue business" (as I would call it). Is it that Lighthaven wants to offer some services to some groups/events at a discount, and donations are subsidizing this?
If so, Lightcone should probably make the case that they are cost-competitive with doing this for other event venues (e.g., donations being used to rent a Marriott or Palace of Fine Arts). There's clearly aesthetic differences, and maybe Lighthaven is the literal best event venue in the world by my preferences. But this is a nontrivial argument. (I didn't see this argument made in a quick skim of the 2024 fundraising post)
they have indeed seen the skulls
I think this is true, but I also don't think seeing the skulls implies actually dealing with them (and wish Scott's post was crossposted here so I could argue with it). Like, a critique of AI evaluations that people could have been making for the last 5+ years (probably even 50) and which remains true today is "Evaluations do a poor job measuring progress toward AGI because they lack external validity. They test scenarios that are much narrower, well defined, more contrived, easier to evaluate, etc. compared to the skills that an AI would need to be able to robustly do in order for us to call it AGI." I agree that METR is well aware of this critique, but the critique is still very much true of HCAST, RE-Bench, and SWAA. Folks at METR seem especially forward about discussing the limitations of their work in this regard, and yet the critique is still true. (I don't think I'm disagreeing with you at all)
We care about the performance prediction at a given point in time for skills like "take over the world", "invent new science", and "do RSI" (and "automate AI R&D", which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent's follow up work, it seems like we're facing down at least three problems:
So my overall take is that I think the current work I'm aware of tells us
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I'm not Adam, but my response is "No", based on the description Megan copied in thread and skimming some of the paper. It's good that the paper includes those experiments, but they don't really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Do the experiments in Sec 6 deal with this?
We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent
performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world
context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness
factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions
can be found in Appendix D.4.The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the
task’s length alone (b=-0.081, R2 = 0.251) ...However, trends in AI agent performance over time are similar for lower and higher messiness
subsets of our tasks.
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don't have very messy tasks.
c. SWE-Bench Verified: doesn't speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they're still clearly verifiable and still software engineering.
I do think Thomas and Vincent's follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
Thanks for the comment. I think this is definitely one of the places that would both receive lots of negotiation, and where we don't have particular expertise. Given my lack of expertise, I don't have much confidence in the particular withdrawal terms.
One of the frames that I think is really important here is that we are imagining this agreement is implemented in a situation where (at least some) world leaders are quite concerned with ASI risk. As such, countries in the agreement do a bunch of non-proliferation-like activities to prevent non-parties from getting AI infrastructure. So the calculus looks like "join the agreement and get access to AI chips to run existing AI models" vs. "don't join the agreement and either don't get access to AI chips or be at risk of coalition parties disrupting your AI activities". That is, I don't expect 'refusing to sign' or withdrawing to be particularly exciting opportunities, given the incentives at play. (and this is more a factor of the overall situation and risk awareness among world leaders, rather than our particular agreement)