AI safety field-building is a large space, with many possible programs. The Center for AI Safety (CAIS) has commissioned cost-effectiveness models to support decisions regarding how to prioritize between these programs. Dan provided high-level guidance early on in the process as well as a small amount of sanity checks later on.
This analysis is a starting point for discussion, not a final verdict. Numbers should not be taken out of context and are not necessarily endorsed fully by CAIS. Rather, we believe that these models:
These models have so far assisted our evaluations of field-building programs for students and professionals, detailed here and here respectively.
This post:
We expect that many readers of this post will already be on-board with the practice of using explicit models to guide decision-making — historically, cost-effectiveness analyses have been core to effective altruism. At the same time, however, we expect that few do this in practice, in part because evaluating the cost-effectiveness of AI safety programs might be more challenging than evaluating global development or animal welfare programs. This section justifies our belief that using explicit models for AI safety is worthwhile in practice.
We see two core benefits:
When making strategic decisions, we are often interested in questions such as:
The first benefit — enhanced strategic decision-making — is realized if models help us to reason through questions like these.
The second benefit of using explicit models — improved communication between organizations — is realized if models help illuminate disagreements' cruxes and their significance. See the analysis of preferences for current versus future research later in this post for an example of this benefit in action.
Of course, explicit models are not without limitations. Models might not answer the right questions, contain conceptual or implementation errors, or risk their outputs being taken too literally. Most importantly, explicit models are inherently reductionist, ignoring factors that might be crucial in practice, including (but not limited to) interactions between programs, threshold effects, diffuse effects, and learning benefits. That said, we believe that explicit models can be quite useful to the AI safety field if we keep these limitations in mind.
Our key metric is the Quality-Adjusted Research Year (QARY). We define a QARY as:
In order to operationalize the QARY, we need some way of defining weights for different researcher types, researcher abilities, and the relevance of different research avenues.
We define the ‘scientist-equivalence’ of a researcher type as the rate at which we would trade off an hour of labor from this researcher type with an hour of otherwise-similar labor from a research scientist.
Similarly, the ‘ability’ level of a researcher is the rate at which we would trade off an hour of labor from a researcher of this ability level with an hour of otherwise-similar labor from a researcher of ability level 1.
Finally, the ‘relevance’ of a research avenue is the rate at which we would trade off an hour of labor from a researcher pursuing this avenue with an hour of otherwise-similar labor from a researcher pursuing adversarial robustness research.
The benefit of the program is given by the difference between expected QARYs with and without the program. (QARYs with or without the program could be further decomposed as QARYs per scientist-equivalent participant of some participant type, multiplied by the number of scientist-equivalents of that participant type, summed across participant types.)
Cost-effectiveness is calculated by dividing this benefit by the expected cost in millions of US dollars.
Example calculations using these definitions can be found in our posts evaluating field-building programs for students and professionals respectively.
QARYs offer three major advantages:
These benefits are all considered relative to estimating terminal outcomes. Estimating program impact in x-risk basis points would certainly aid comparability, but at the cost of requiring strong assumptions about how exposure to good outcomes (something like QARYs) translates into better terminal outcomes. Relatedly, components of the QARY framework — how many researchers we might affect, which research avenues they will work on, and so on — are easier to inform empirically. And these components are easier to affect, enabling shorter feedback loops.
Still, QARYs have important limitations, the very most important of which is that QARYs are narrowly focused on researchers and programs that produce research.
In the QARY framework, any labor from politicians, lobbyists, or military leaders is implicitly not valued.
Even for research-centric programs, the QARY framework might not be ideal. For example, when evaluating research labs, one might be more interested in metrics that engage more directly with research outputs, such as high-impact citations.
Still, these limitations can be partially addressed. We could model the influence of QARYs on more downstream outcomes, such as high-impact citations or even x-risk basis points, then compare different categories of programs by downstream impact. Or we could at least avoid inconsistent priorities by specifying a fixed exchange rate between QARYs and other goods.
The full code for this project is in this repository. The `examples` folder includes documentation demonstrating the repository’s use.
To get a detailed ‘feel’ for the models, read our cost-effectiveness evaluations of programs for students and professionals.
If you would like help navigating the repository — or with your own evaluations — please email contact@safe.ai.
The table below gives a sneak peek of our posts evaluating the cost-effectiveness of programs for students and professionals. The table combines results from the programs considered in these posts, hypothetical implementations of similar programs, and ‘baseline’ programs. Baseline programs involve directly funding a talented research scientist or PhD student working on anomaly detection research for 1 year or 5 years respectively.
Program | Cost (USD) | Benefit (counterfactual expected QARYs) | Cost-effectiveness (QARYs per $1M) |
---|---|---|---|
Atlas | 9,000,000 | 43 | 4.7 |
MLSS | 330,000 | 6.4 | 19 |
Student Group | 350,000 | 50 | 140 |
Undergraduate Stipends | 50,000 | 17 | 340 |
Trojan Detection Challenge | 65,000 | 26 | 390 |
NeurIPS ML Safety Social | 5200 | 150 | 29,000 |
NeurIPS ML Safety Workshop | 110,000 | 360 | 3300 |
Hypothetical: Power Aversion Prize | 50,000 | 490 | 9900 |
Hypothetical: Cheaper Workshop | 35,000 | 250 | 7000 |
The “power aversion prize” is a hypothetical program mirroring the Trojan Detection Challenge, except with $15k less award funding and a more relevant topic. The “cheaper workshop” is identical to the NeurIPS ML Safety Workshop, except with $75k less award funding.
There are many important caveats to this table, detailed in the posts. These caveats include:
The evaluation posts each consider programs that affect research at a similar point in time: student programs, which aim to influence future research, or professional programs, which aim to influence research today. A possible crux for whether targeting students tends to be more cost-effective than targeting professionals (or vice versa) is one’s preference for current versus future research.
In the QARYs framework, this preference is described by the research discount rate, defined as the degree to which we value a research year beginning next year less than a research year beginning today.
In this model, we consider research one year from now to be 20% less valuable than research today. The justification for this figure begins with the observation that, in ML, research subfields often begin their growth in an exponential fashion. This means that research topics are often dramatically more neglected in earlier stages (i.e. good research is much more counterfactually impactful), and that those who are early can have an outsized impact in influencing the direction of the field — imagine a field of 3 researchers vs. one of 300 researchers. If, for instance, mechanistic interpretability arose as a research agenda one year earlier than it did, it seems reasonable to imagine that the field would have 20% more researchers than it currently does. In fact, we think that these forces are powerful enough to make a discount rate of 30% seem plausible. (Shorter timelines would also be a force in this direction.)
This view does not reflect a consensus. Others might argue that the most impactful safety work requires access to more advanced models and conceptual frameworks, which will only be available in the future[1].
To illustrate the difference this parameter could make, consider someone who believes that research one year from now is 20% more valuable than research today. This person would be indifferent between 1 research year today and 0.23 research years in 8 years time — approximately the point in time at which rising juniors would finish a PhD — whereas this model would be indifferent between 1 research year today and 6 research years in 8 years time.
The plot below compares the cost-effectiveness of student programs with that of professional programs for different values of the research discount rate. 0.2 is our default; negative values represent a preference for research conducted in the future. (The cost-effectiveness ordering of the identically-colored professional programs never changes.)
Notice that:
Thank you to:
The extent to which current research will apply to more advanced models is a useful topic of discussion. Given that it seems increasingly likely that AGI will be built using deep learning systems, and in particular LLMs, we believe that studying existing systems can provide useful microcosms for AI safety. For instance, LLMs already exhibit forms of deception and power-seeking. Moreover, it seems likely that current work on AI honesty, transparency, proxy gaming, evaluating dangerous capabilities, and so on will apply to a significant extent to future systems based on LLMs. Finally, note that research on benchmarks and evals is robust to changes in architecture or even to the paradigm of future AI systems. As such, building benchmarks and evals are even more likely to apply to future AI systems.
Of course, it is true that more advanced models and conceptual frameworks do increase the relevance of AI safety research. For instance, we anticipate that once the LLM-agent paradigm gets established, research into AI power-seeking and deception will become even more relevant. Notwithstanding, we believe that, all things considered, AI safety research is currently tractable enough, and that the subfields are growing exponentially such that a 20% or even 30% discount rate is justified.