Considerations for setting the FLOP thresholds in our example international AI agreement

peterbarnett

We at the Machine Intelligence Research Institute’s Technical Governance Team have proposed an illustrative international agreement (blog post) to halt the development of superintelligence until it can be done safely. For those who haven’t read it already, we recommend familiarizing yourself with the agreement before reading this post.

TLDR: This post explains our reasoning for the FLOP thresholds in our proposed international AI agreement: we prohibit training runs above 10²⁴ FLOP and require monitoring for runs between 10²²–10²⁴ FLOP. Given fundamental uncertainty about how many FLOP are needed to reach dangerous AI capabilities, we advocate for conservative thresholds. Other considerations include algorithmic progress between now and when the agreement is implemented and the strong capabilities of current AI models.

This post aims to explain our reasoning about why we chose the training compute thresholds we did. We refer to these as “FLOP thresholds” (FLOP = floating point operations) to avoid any ambiguity with chips themselves, which are sometimes referred to as “compute”. Many of these considerations are relevant to others thinking about FLOP thresholds, including the hypothetical negotiators/regulators who would modify the thresholds in this agreement, if it is implemented in the future.

Readers can see Article IV for the full text, but the abbreviated version is as follows:

Training runs that exceed 10²⁴ FLOP (aka 1e24) are prohibited. We call this the Strict Threshold. Post-training above 10²³ FLOP is prohibited (Strict Post-training Threshold).
Training runs between 10²² FLOP and 10²⁴ FLOP must be declared to the governance body, approved in advance, and are subject to monitoring and pausing by the governance body. We call 10²² FLOP the Monitored Threshold.
These thresholds can be modified as needed.
The agreement invites the creation of specific carveouts for activities such as safety evaluations, self-driving vehicles, and medical technology.
People are allowed to continue using models created before the agreement, including models trained with more than 10²⁴ FLOP. This includes post-training these models up to the Strict Post-training threshold.

Our proposed thresholds are set far lower than today’s frontier models (which we believe aren’t yet dangerous in the relevant sense). For some context, the Strict Threshold is a little lower than the training compute for some near-frontier models such as DeepSeek-R1 (~4 x 10²⁴ FLOP) and gpt-oss-120B (~5 x 10²⁴ FLOP) and would cost around a million dollars.

Overall, we aren’t tied to these specific thresholds, and the agreement is clear that these thresholds can be updated to adapt to the situation. Nevertheless, these thresholds are our current best guess at how they should be initially set.

Why FLOP thresholds at all?

Ideally, we would place thresholds directly on AI capabilities. Unfortunately, we can only know the capabilities of an AI after it is trained, and by then it might be too late. If a dangerous AI has already been created, it may be impossible to contain. Instead, FLOP thresholds provide a measure that can be defined and enforced ahead of time, before a dangerous AI is created.

There are a variety of reasons why FLOP thresholds have been proposed as a key method in AI governance, here adapted from Erben et al. (2025):

Roughly capability-tracking: Training compute correlates reasonably well with model capabilities, especially when algorithmic progress is restricted (as it would be under our agreement's research controls; to some extent, data quality improvements are also limited under our agreement).
Measurable before deployment: Training compute is measurable before a model is deployed, and can be accurately estimated before training even begins. This makes it harder for people to accidentally cross the threshold, as they might with capability thresholds.
Externally verifiable: AI chips can be monitored, allowing for an AI’s total training FLOP to be externally verified (see Baker et al. 2025 Appendix A.6).
Cost-tracking: Training FLOP directly correlates with the resources needed to train models.

FLOP thresholds also have various limitations:

Only “roughly” capability-tracking: There is an uncertain relationship between FLOP and AI risks, and the “capability-tracking” aspect is far from perfect. For instance, better data and better AI algorithms often lead to a more capable model even when trained with less FLOP, compared to a higher-FLOP model trained on bad data and bad algorithms. See e.g., Hooker (2024), Ho et al. (2024), Heim & Koessler (2024).
Non-pre-training capability gains: While the relation between FLOP and AI capabilities in pretraining is somewhat well studied, there are many uncertainties in how FLOP used on post-training or at inference time affects model performance. See Hooker (2024), Davidson et al. (2023).
Methodological challenges: There are still some technical ambiguities in FLOP accounting; more work is needed to precisely define FLOP thresholds and operationalize FLOP counting. See Hooker (2024), Casper et al. (2025), Pistillo & Villalobos (2025).

Primary considerations for where the thresholds should be

There are a few guiding considerations when deciding what the FLOP limits should be in an agreement like this one.

First, the main goal of the threshold is to let the world have the most advanced AIs it can without those AIs substantially increasing catastrophic risk. Setting the threshold too low would leave a huge amount of value on the table. But setting the threshold too high would be very dangerous. How high is too high? Unfortunately, nobody knows.

A second guiding consideration is that nobody knows how many FLOP you need to train an ASI, or an automated AI researcher, or any other dangerous capability of note.

To be clear, the problem isn’t just about FLOP—AI forecasting is really difficult! One of the main AI forecasting methods, the “scaling laws” documented by Kaplan et al. (2020), show the relationship between pretraining FLOP and cross-entropy loss on a validation set. Traditionally, this method is used by companies to predict how capable a big model might be, based on only smaller training experiments. Applying this method to our use case is difficult. “Cross-entropy loss” is itself meaningless, and you would need to tie that to real-world capabilities. For instance, one might use the concept of “indistinguishability” as in Barnett & Besiroglu (2023). Alternatively, one might demonstrate that cross-entropy loss correlates with certain benchmark performance (or that FLOP correlates closely with benchmark performance as in Pimpale et al. (2025)). In this latter case it would be necessary to ensure that the benchmark matches real-world use cases and capability (a notoriously difficult task). So you have a major challenge in making sure the measure you have properly relates to the capability you care about.

There’s another big problem, which is that these scaling laws assume fixed model algorithms and fixed data quality. If the halt were ongoing and we saw that progress on algorithms and data quality had effectively been stymied, then maybe we could make some solid predictions. But that’s not the case right now. From the vantage point of today, where algorithms and data are improving rapidly, using the empirical scaling laws just doesn’t work to forecast precisely what will happen in the field overall. You can make general statements like “AI models will get more capable as the FLOP used to train them increases, and we observe data center investments consistent with massively increasing the FLOP used to train models in the next couple years”. But that’s a very different claim and much less specific about AI capabilities than what you would need here.

If the world adopted the thresholds in our example agreement, we would be taking a gamble that with pre-agreement algorithms and pre-agreement data quality, 10²⁴ FLOP (the Strict Threshold) is not enough for somebody to train ASI. This is obviously a topic that should receive much more deliberation during treaty negotiations!

The agreement includes monitoring of training runs above 10²² FLOP (the Monitored Threshold). This monitoring would likely include a “Whitelist” approach applied to ensure that newly produced algorithmic innovations are not being used in this training. That is, monitored training runs could make use of AI algorithms that existed before the agreement came into effect, but any algorithms developed later would be a violation of the research restrictions and would thus be flagged by the monitors. So we would also be gambling that any illicit algorithm or data research that happens after research restrictions are in place would also be insufficient to allow ASI to be developed with 10²² FLOP (the Monitored Threshold). Note that unmonitored chips would still be a problem, especially the possibility of large covert clusters that don’t comply with FLOP thresholds.

A third consideration is that current frontier models seem like they might be close to some of the dangerous capabilities we care about. There are many difficulties in evaluating existing models, such as the poor external validity of evaluations, poor elicitation, failure to evaluate broader interactions with AIs, and more—see Barnett & Thiergart (2024) and Mukobi (2024). So it’s hard to know how “capable”—in a deep sense—models are. But AI companies have recently begun saying that they can’t rule out substantial novice uplift on biological and chemical weapons development, and there is anecdotal evidence that AIs are beginning to speed up AI development significantly inside AI companies. We definitely don’t know for sure, but it seems plausible that AIs could autonomously carry out frontier AI research in just a couple of years; OpenAI recently stated that they are planning on developing a “true automated AI researcher by March of 2028”. AI research is a particularly notable capability because if this capability was highly proliferated, verifying research restrictions would be more difficult.

A fourth consideration is that we think the world should be conservative in its management of this risk, given the huge uncertainty. The fact that nobody knows how many FLOP are needed to train ASI should not mean “rush ahead”. If you’re driving through fog and you know there’s a cliff ahead but you have almost no idea how far it is, the correct decision is to slam on the brakes. Therefore, we suggest a somewhat conservative threshold—far lower than today’s frontier models (which, again, aren’t yet dangerous in the relevant sense).

Secondary considerations for where the thresholds should be

The agreement will not be implemented tomorrow. By the time it is implemented, the “compute needed for a given level of capability” will be much lower than it is today. According to Ho et al. (2024), the compute needed for a given level of capability decreases by about 3x per year. And looking at more recent data, the trend appears to be much faster (forthcoming). So 10²⁴ FLOP models of tomorrow will be more capable, maybe much more capable, than the 10²⁴ FLOP models that exist today. The agreement will necessarily take time to be passed, and we are roughly imagining that it goes into effect 1-2 years from now.

Another consideration is that it would be desirable to pick thresholds for which monitoring is practical. This consideration generally points toward having higher thresholds—100,000 chips and a 10²⁶ FLOP training run are much more noticeable than 1,000 chips and a 10²⁴ FLOP training run. The thresholds we suggest are somewhat monitorable with respect to the monitored cluster size of 16 H100s (which unfortunately probably isn’t reliably monitorable), but overall we’ve chosen not to prioritize this consideration very much. Reaching the Strict Threshold of 10²⁴ FLOP with 16 H100 GPUs would take 730 days—this would be unprecedentedly long and might be detected.

We don't want the thresholds to be so low that they interfere with non-AI computing activities, or so low that people violate them by mistake. The Monitored Threshold of 10²² FLOP is set high enough that hobbyists and ordinary users won't accidentally violate it through normal computing activities.

Another consideration is whether the thresholds should be, as a terminal goal, aimed at preventing ASI or aimed at preventing AI capabilities from advancing past the current frontier. One potential orientation towards FLOP thresholds is to set them so low that no one ever trains a model more capable than models trained before the agreement. For example, this orientation might be held by somebody who is concerned with job loss, where even marginal improvements to current AI capabilities could have a big effect. Our primary focus is ensuring that no one ever trains ASI or an otherwise unacceptably dangerous AI. This unacceptable capability threshold will (hopefully) be higher than the best previously-existing AI, but we unfortunately aren’t sure how much higher. Because of this uncertainty, we recommend keeping AI capabilities below today’s frontier as a precautionary measure.

LESSWRONG
LW

LESSWRONG
LW

37

Considerations for setting the FLOP thresholds in our example international AI agreement

37

37

Why FLOP thresholds at all?

Primary considerations for where the thresholds should be

Secondary considerations for where the thresholds should be