Datasets might be nice.
I think my main concern is the Goodhart one.
For the proxy you gave, is the idea that if you can detect a huge number of "distinct, egregiously misaligned ‘features’ found by interpretability (and a human expert would judge them as egregious, diverse and high quality),” then labs could use your methods and have a high likelihood of detecting egregious misalignment if it exists? $100M is a nontrivial fraction of all the funding available for AI x-risk reduction, so in some vague sense I feel like doing very well according to the proxies needs to mean solving a nontrivial fraction of all the x-risk from AI in order for these grants to make sense. Maybe this does meet the bar, I'd have to think more.
(I'm a bit unsure what it even means for a feature to be egregiously misaligned, won't even aligned models recognize egregiously bad behaviors and therefore have features for them?)
I think the main gap between safety work and the broader, "generative AI" ecosystem that funders or investors may be looking at is the tendency of AI safety to sound like something that fundamentally is meant to not return immediate results, but rather be a public good of sorts.
I actually think there are plenty of concrete problems today, such as the prospenity of LLMs to engage in less explicit, harmful behavioral patterns, that can be addressed by solutions today. To me, this is something that can be extremely valuable and also help address existensial risk down the line.
This post reflects my personal opinion and not necessarily that of other members of Apollo Research.
TLDR: I think funders should heavily incentivize AI safety work that enables spending $100M+ in compute or API budgets on automated AI labor that directly and differentially translates to safety.
Motivation
I think we are in a short timeline world (and we should take the possibility seriously even if we don't have full confidence yet). This means that I think funders should aim to allocate large amounts of money (e.g. $1-50B per year across the ecosystem) on AI safety in the next 2-3 years.
I think that the AI safety funders have been allocating way too little funding and their spending has been far too conservative in the past 5 years. So, in my opinion, we should definitely continue ramping up “normal” spending, e.g. pay more competitive salaries, allow AI safety organizations to grow faster, and other things in that vein.
However, these “normal” spending patterns are not sufficient under short timeline assumptions and the obvious way to spend more money quickly is to aggressively encourage finding ways to use automated labor for AI safety.
What is an “automated AI safety scaling grant”?
An “automated AI safety scaling grant” aims to aggressively encourage attempts to use automated labor for AI safety at scale. The explicit intention is that if somebody manages to find ways to build scalable safety pipelines, they can be confident that they can run these pipelines at scale with short turnaround time from grantmakers.
An example of such a grant would be:
Other considerations
Who should be able to receive such a grant?
I think any organization or subset of an organization that provides a promising proposal and shows sufficient technical expertise should be a potential recipient. This includes non-profits, academia and for-profits. If the grant is made to a for-profit, it should be a restricted grant with the clear purpose of exploring the scaling of this project and under the condition that it gets published in relevant detail to ensure public benefit.
Given that the scope is already quite clear from the purpose of the grant, I think it doesn’t matter that much where it is housed as long as the idea and team are promising enough.
Why make this an explicit grant program?
Given the level of conservatism that funders have shown to date, I think most people considering such ambitious scaling projects would currently not expect that this is even in the possibility space of their application.
If funders are serious about spending serious amounts of money in the near future, I think they have to be extremely explicit about it to overcome the current perceived hesitation. Furthermore, I think the kind of people who would be best placed to run such scaling efforts might be people with substantial entrepreneurial talent and experience. Such people typically have many good options available to them, so only a serious public commitment like this would convince them to attempt such a project rather than e.g. starting a for-profit start-up without explicit safety benefit.
Aren’t we just gonna goodhart all of these metrics?
Maybe. Definitely a possibility. I think it would be good to try to find more robust metrics, but if we always end up with goodharted metrics, that’s good negative evidence and indicates we should probably stop with that particular program.
Concrete examples of potential grant areas
Monitoring & Control
I think monitoring and/or control are the most obvious cases for such a scaling plan. One of the core reasons that we’re pursuing a monitoring agenda at Apollo now is that we think it has high returns to scale.
There are several reasons for this:
We’re aiming to publish a substantially more detailed version of this agenda and concrete projects for it soon. While we are building a monitor-based product around this research, we intend to keep the research itself relatively open, including the agenda.
Automated black box auditing
In short, I’m envisioning a substantially better version of PETRI for this. While PETRI is a good starting point, it has multiple shortcomings. Most importantly a) lack of realism and b) lack of diversity / scale.
To address those, we need to find automated ways to create highly realistic situations that could elicit egregious misalignment in models automatically at scale. Examples of plausible subparts of such an effort include:
White box auditing agents
Basically this: https://alignment.anthropic.com/2025/automated-auditing/ with similar ideas as above and also access to the features of the model. So the objective is not only to find meaningful failure modes but to find meaningful misaligned concepts of the model.
See also the “interpretability agent” Trenton Bricken talks about his second Dwarkesh interview.
Propensity evals
I think evals are harder to scale than automated red-teaming. For automated red-teaming we have some number to iterate against (e.g. number of instances of egregious misalignment found), so measuring success is much easier. For evals, this is not necessarily the case and we always have to rely on some other judgment like “realism” or “validity”. Nevertheless, I think there are highly scalable ways of designing evals that do not reduce to automated red-teaming.
First and foremost, I think finding highly scalable ways to turn large data sources systematically into high-quality and realistic evals is the key to success here. For various capabilities, this has already been attempted, e.g. SWEBench. However, I would specifically want to target propensity evals here.
My best guess is that the canonical attempt here tries to
Likely this would use agents or agentic pipelines at scale that build and improve this set of evaluations, e.g. “eval agents” as described here.
Automated conceptual alignment research
This is, in some sense, the holy grail for a scaling grant.
At least in my experience, current models struggle with meaningful conceptual reasoning much more than I would expect given their impressive capabilities for agentic coding.
I’d expect that models would get rapidly better with RL on such tasks, but I guess it’s hard to find good tasks to RL on because they are often fuzzy and therefore bottlenecked by conceptual understanding of the grader.
Possible projects here could be
Compared to the other examples above, I’m much less sure how this would be done in practice and I’d expect many possible grants to fail in the early stages due to lack of promising results. However, given the high importance, I think more people should try it anyway.
Note on capability externalities: that I’m not sure how differentially advantageous this work is for safety. My guess is that any kind of progress here could also transfer to fuzzy capability research. Nevertheless, I feel like
So it seems worth attempting. Though, recipients of such grants should be more careful about the publication of their insights and be willing to shut down the project in case it differentially accelerates capabilities more than alignment.
Addressing various concerns I heard so far
I have previously pitched these kinds of grants to multiple funders and suggested them to non-funders and have heard the following concerns, among others.
We should first understand the concrete areas for these grants in more detail
While I think it is useful for the grantmakers to think about specific theses of their own for what kind of grants they want to make, I think this should not be a blocker to making the intention clear.
Specifically, I think making such grants public would incentivize more people to think in detail about ways to scale automated labor more effectively and should also influence the grantmakers’ thinking about the projects they actively seek.
Why should for-profits be recipients of such grants?
Broadly speaking, I think it doesn’t matter what type of organization should run such a project as long as they are able to execute it well. I personally think that non-profits and for-profits (and to some extent academia) can be able to execute such grants as long as it is ensured that they are done in service of the public benefit.
I think there are arguments in favor and against non vs. for profits for these scaling activities and I’m not entirely sure which are stronger. For profits might be better at doing things at the required scale, non-profits might be more likely to stay true to their mission. Having restricted grants might counteract the mission concerns for for-profits.