There should be $100M grants to automate AI safety

Marius Hobbhahn

This post reflects my personal opinion and not necessarily that of other members of Apollo Research.

TLDR: I think funders should heavily incentivize AI safety work that enables spending $100M+ in compute or API budgets on automated AI labor that directly and differentially translates to safety.

Motivation

I think we are in a short timeline world (and we should take the possibility seriously even if we don't have full confidence yet). This means that I think funders should aim to allocate large amounts of money (e.g. $1-50B per year across the ecosystem) on AI safety in the next 2-3 years.

I think that the AI safety funders have been allocating way too little funding and their spending has been far too conservative in the past 5 years. So, in my opinion, we should definitely continue ramping up “normal” spending, e.g. pay more competitive salaries, allow AI safety organizations to grow faster, and other things in that vein.

However, these “normal” spending patterns are not sufficient under short timeline assumptions and the obvious way to spend more money quickly is to aggressively encourage finding ways to use automated labor for AI safety.

What is an “automated AI safety scaling grant”?

An “automated AI safety scaling grant” aims to aggressively encourage attempts to use automated labor for AI safety at scale. The explicit intention is that if somebody manages to find ways to build scalable safety pipelines, they can be confident that they can run these pipelines at scale with short turnaround time from grantmakers.

An example of such a grant would be:

Step 1: Show scalability
- An organization submits a sketch of a pipeline that they think could scale significantly. In the best case, they have some empirical evidence for this claim already.
- Both parties agree on the “scaling condition”, i.e. what empirical evidence would be sufficient to spend more money on that pipeline
- By default, the scaling condition could be a plot that has “money spent on the pipeline” on the x-axis and “some reasonably proxy of safety” on the y-axis (as long as both parties are convinced that the proxy is not goodharted).
- - An example of a proxy could be “number of distinct, egregiously misaligned ‘features’ found by interpretability (and a human expert would judge them as egregious, diverse and high quality)”.

For example, such a grant could be $5M for 1 year where $2M is for salaries and overhead and $3M is for compute (e.g. fine-tuning, API costs and GPU rents). These stats can vary depending on team size and prior confidence in the scaling hypothesis.
Step 2: double up
- When the funding condition has been met, the funder is willing to quickly ramp up additional funding to scale up the experiments.
- For example, let’s say the grantee was able to produce the scaling plot and the evidence is convincing to both parties. Then the funder makes another grant for $14M where $10M is for scaling the pipeline and $4M is for scaling the team and rest of the organization.
- They then both agree on the next milestone. For example, the milestone could be to extend the scaling plot by four additional datapoints that show no or little diminishing returns to scale.
Step 3: big scale
- When the double up was successful, the funder is willing to quickly and significantly increase the funding by another OOM.
- For example, let’s say the grantee was able to show that the scaling plot can be extended and the increases still meaningfully translate into safety (rather than e.g. goodharting a metric that ceases to be useful). Then the funder makes another grant for $108M where $100M is used for further scaling the pipeline and $8M is used for other overhead.
- Note: since the explicit intention of these grants is to scale quickly it is possible that the time between Step 2 and 3 can be as little as a few months.
Step 4: real-world integration
- Assuming an organization can show that they can convincingly spend $100M on something that meaningfully translates into AGI safety at scale, there are multiple further possible actions.
- - Given that it was financed by a grant, it needs to be for the public benefit. Thus, under all circumstances, this information has to be published with relevant details. This can e.g. include open sourcing the entire pipeline or writing a paper with the core details that would enable others to replicate the pipeline.
  - Collaborate with labs to adopt these techniques into their alignment pipelines and help them implement it into practice.
  - The team joins one AGI lab to implement the pipeline at that lab.
  - If this pipeline could be profitable, explicitly try to turn the organization into an AGI safety for-profit. If it is already a for-profit, build out products around the idea that lead to drastically increased safety.
  - Try to scale it further to $1B if we still haven’t hit diminishing returns and the pipeline continues to produce great safety results.

Other considerations

Who should be able to receive such a grant?

I think any organization or subset of an organization that provides a promising proposal and shows sufficient technical expertise should be a potential recipient. This includes non-profits, academia and for-profits. If the grant is made to a for-profit, it should be a restricted grant with the clear purpose of exploring the scaling of this project and under the condition that it gets published in relevant detail to ensure public benefit.

Given that the scope is already quite clear from the purpose of the grant, I think it doesn’t matter that much where it is housed as long as the idea and team are promising enough.

Why make this an explicit grant program?

Given the level of conservatism that funders have shown to date, I think most people considering such ambitious scaling projects would currently not expect that this is even in the possibility space of their application.

If funders are serious about spending serious amounts of money in the near future, I think they have to be extremely explicit about it to overcome the current perceived hesitation. Furthermore, I think the kind of people who would be best placed to run such scaling efforts might be people with substantial entrepreneurial talent and experience. Such people typically have many good options available to them, so only a serious public commitment like this would convince them to attempt such a project rather than e.g. starting a for-profit start-up without explicit safety benefit.

Aren’t we just gonna goodhart all of these metrics?

Maybe. Definitely a possibility. I think it would be good to try to find more robust metrics, but if we always end up with goodharted metrics, that’s good negative evidence and indicates we should probably stop with that particular program.

Concrete examples of potential grant areas

Monitoring & Control

I think monitoring and/or control are the most obvious cases for such a scaling plan. One of the core reasons that we’re pursuing a monitoring agenda at Apollo now is that we think it has high returns to scale.

There are several reasons for this:

Highly scalable sources of data: We have had clear positive preliminary results from a) collecting real-world data and injecting artificial failures in them and b) creating semi-realistic failures through environments in the flavor of ControlArena or SHADE arena. Both of these could be done at a significantly larger scale, e.g. millions of high-quality trajectories.
Blue team vs. red team framing: By definition all control problems can be framed as blue team vs. red team games. These enable many scaling opportunities, including automated red-teaming where we iteratively improve blue and red teams. While there are many details to be resolved, we have seen initial positive results for automated red-teaming in the context of monitors and many self-play and red-teaming approaches from other fields can easily be translated into the monitoring setting.
Number-go-up-games: Almost any subproblem in monitoring can be reasonably approximated with a number-go-up-game. This makes me hopeful that we can already use or will soon be able to use medium-horizon agents like Claude Code and Codex at scale to develop new methodology for these well-scoped problems. This would allow us to leverage a lot of compute that directly translates into better methods.

We’re aiming to publish a substantially more detailed version of this agenda and concrete projects for it soon. While we are building a monitor-based product around this research, we intend to keep the research itself relatively open, including the agenda.

Automated black box auditing

In short, I’m envisioning a substantially better version of PETRI for this. While PETRI is a good starting point, it has multiple shortcomings. Most importantly a) lack of realism and b) lack of diversity / scale.

To address those, we need to find automated ways to create highly realistic situations that could elicit egregious misalignment in models automatically at scale. Examples of plausible subparts of such an effort include:

Finding highly scalable long-horizon data sources to use and / or modify for red-teaming. Note that this is a very similar task as building RL environments and it therefore has to be ensured that people do not directly train against that distribution.
Designing better red-teaming agents that can make use of more compute, e.g. more multi-agent, more parallel, more effective continual learning, etc.
Various self-play and other red team vs blue team framings that can be borrowed from the monitoring ideas above.

White box auditing agents

Basically this: https://alignment.anthropic.com/2025/automated-auditing/ with similar ideas as above and also access to the features of the model. So the objective is not only to find meaningful failure modes but to find meaningful misaligned concepts of the model.

See also the “interpretability agent” Trenton Bricken talks about his second Dwarkesh interview.

Propensity evals

I think evals are harder to scale than automated red-teaming. For automated red-teaming we have some number to iterate against (e.g. number of instances of egregious misalignment found), so measuring success is much easier. For evals, this is not necessarily the case and we always have to rely on some other judgment like “realism” or “validity”. Nevertheless, I think there are highly scalable ways of designing evals that do not reduce to automated red-teaming.

First and foremost, I think finding highly scalable ways to turn large data sources systematically into high-quality and realistic evals is the key to success here. For various capabilities, this has already been attempted, e.g. SWEBench. However, I would specifically want to target propensity evals here.

My best guess is that the canonical attempt here tries to

Create a big library of realistic situations that we expect to be representative of real risks.
Create a big list of failure modes we’re worried about.
Then find automated ways to combine these two axes into a big grid of scenarios that we then turn into evaluations.

Likely this would use agents or agentic pipelines at scale that build and improve this set of evaluations, e.g. “eval agents” as described here.

Automated conceptual alignment research

This is, in some sense, the holy grail for a scaling grant.

At least in my experience, current models struggle with meaningful conceptual reasoning much more than I would expect given their impressive capabilities for agentic coding.

They seem to be quite bad at coming up with ideas that I would judge to have high quality in domains that I think I can judge quality in, e.g. scheming related research, evals and monitoring.
They seem to not “try very hard”, and intuitively it feels more like they are trying to give some answer than giving a good answer.

I’d expect that models would get rapidly better with RL on such tasks, but I guess it’s hard to find good tasks to RL on because they are often fuzzy and therefore bottlenecked by conceptual understanding of the grader.

Possible projects here could be

Exploratory projects into building scaffolding that leads to improvements in conceptual safety work.
Exploratory projects into building good datasets and RL environments for safety work
Exploratory work into RL for fuzzy tasks with a specific focus on alignment research
Cataloguing a lot of knowledge that we think is useful and important to alignment research in particular and seeing if we can make models use that knowledge more efficiently somehow.

Compared to the other examples above, I’m much less sure how this would be done in practice and I’d expect many possible grants to fail in the early stages due to lack of promising results. However, given the high importance, I think more people should try it anyway.

Note on capability externalities: that I’m not sure how differentially advantageous this work is for safety. My guess is that any kind of progress here could also transfer to fuzzy capability research. Nevertheless, I feel like

Someone will have to do it at some point anyway since most AGI companies’ ASI alignment plans go through automated research as far as I can tell.
I think there are some gains to be made that are not general, but specific to AI safety research, e.g. comparable to how training models to be good coding agents improves their cyber capabilities as well, but there are still meaningful gains from specialized cyber training / scaffolding.

So it seems worth attempting. Though, recipients of such grants should be more careful about the publication of their insights and be willing to shut down the project in case it differentially accelerates capabilities more than alignment.

Addressing various concerns I heard so far

I have previously pitched these kinds of grants to multiple funders and suggested them to non-funders and have heard the following concerns, among others.

We should first understand the concrete areas for these grants in more detail

While I think it is useful for the grantmakers to think about specific theses of their own for what kind of grants they want to make, I think this should not be a blocker to making the intention clear.

Specifically, I think making such grants public would incentivize more people to think in detail about ways to scale automated labor more effectively and should also influence the grantmakers’ thinking about the projects they actively seek.

Why should for-profits be recipients of such grants?

Broadly speaking, I think it doesn’t matter what type of organization should run such a project as long as they are able to execute it well. I personally think that non-profits and for-profits (and to some extent academia) can be able to execute such grants as long as it is ensured that they are done in service of the public benefit.

I think there are arguments in favor and against non vs. for profits for these scaling activities and I’m not entirely sure which are stronger. For profits might be better at doing things at the required scale, non-profits might be more likely to stay true to their mission. Having restricted grants might counteract the mission concerns for for-profits.

[-]RohanS2mo50

I think my main concern is the Goodhart one.

For the proxy you gave, is the idea that if you can detect a huge number of "distinct, egregiously misaligned ‘features’ found by interpretability (and a human expert would judge them as egregious, diverse and high quality),” then labs could use your methods and have a high likelihood of detecting egregious misalignment if it exists? $100M is a nontrivial fraction of all the funding available for AI x-risk reduction, so in some vague sense I feel like doing very well according to the proxies needs to mean solving a nontrivial fraction of all the x-risk from AI in order for these grants to make sense. Maybe this does meet the bar, I'd have to think more.

(I'm a bit unsure what it even means for a feature to be egregiously misaligned, won't even aligned models recognize egregiously bad behaviors and therefore have features for them?)

[-]Charlie Steiner2moΩ340

Datasets might be nice.

Object-level values.
- "What do you like or dislike about my current life?"
- "What kind of actions do you want to take in the next few weeks?"
- "What kind of changes would you make to the world around you if you could?"
- "What are some examples of kindness that you've witnessed?"
- "Come up with a moral dilemma that seems close to you."
- "What would you do in this moral dilemma someone else came up with?"
- etc.
Meta-level values.
- "How would you change yourself if you could?"
- "How do you feel about various ways you expect to grow and change in the future?"
- "Come up with a fictional disagreement between two people who value different things."
- "How do you think these fictional people should resolve their disagreement?"
- "When you feel torn between different options, how do you think you normally decide?"
- "How do you think you should decide?"
- "Watch this morally interesting video and describe what happened, thereby giving it an ontology."
- etc.

[-]baykenney2mo10

I think this is directionally correct in terms of the size of the problem and the amount of funding necessary. I have concerns though on the structure. Broadly, my largest concerns are that a 1) hub-and-spoke funding structure (like we currently have) isn't the best approach, long-term 2) the structure you're proposing lends itself to siloed research, where every org is independently trying to hillclimb the safety proxy.

If we agree on the scale of the problem, wouldn't we also agree that whatever small teams (METR/Apollo/Redwood size) would go further through open collaboration with academics, other labs, independent researchers, companies etc. I've written more here: Automating AI Safety Research Requires an Open Ecosystem, Not Bigger Grants.

[-]Archie Chaudhury2mo10

I think the main gap between safety work and the broader, "generative AI" ecosystem that funders or investors may be looking at is the tendency of AI safety to sound like something that fundamentally is meant to not return immediate results, but rather be a public good of sorts.

I actually think there are plenty of concrete problems today, such as the prospenity of LLMs to engage in less explicit, harmful behavioral patterns, that can be addressed by solutions today. To me, this is something that can be extremely valuable and also help address existensial risk down the line.

67

There should be $100M grants to automate AI safety

67

Ω 16

Motivation

What is an “automated AI safety scaling grant”?

Other considerations

Who should be able to receive such a grant?

Why make this an explicit grant program?

Aren’t we just gonna goodhart all of these metrics?

Concrete examples of potential grant areas

Monitoring & Control

Automated black box auditing

White box auditing agents

Propensity evals

Automated conceptual alignment research

Addressing various concerns I heard so far

We should first understand the concrete areas for these grants in more detail

Why should for-profits be recipients of such grants?

67

Ω 16

67

Ω 16