Which technical AI safety fields are going to be automated first?

Chamod Kalupahana

I’m transitioning into technical AI safety, and I find myself thinking a lot about what fields I want to research and where I’ll have the biggest impact. One thing I’ve found myself thinking about a lot recently is what fields are likely to be automated.

This seems pretty likely since frontier labs will likely be automating capabilities research as a part of automated R and D, and safety research won’t be far off. Some initial examples of this are Anthropic is investigating automated alignment researchers, UKAISI recently evaluated models’ propensity to sabotage alignment research, and Anthropic is using Mythos for internal productivity boosts and testing its ability to do alignment research without sabotage (which it can’t). There are also a handful of frameworks appearing for creating automated research pipelines, such as The AI Scientist.

I’ll be considering two factors:

Feedback Quality: How easy is it to verify the research outputs from these fields
Economic Incentive: How much incentive is there for frontier labs to automate this research

Disclaimer: This is a pretty rough and hand-wavy explanation of this. If you’re looking for a more in-depth exploration into modelling automated R and D across alignment vs capabilities, I’d recommend this post from January.

Disclaimer: I wrote up most of the ideas for this blog post in April and didn’t get around to publishing it. I'm aware that UKAISI has published more fantastic research into this that I haven’t considered in this blog post.

I’ll be basing my automation capability on what an AI with superhuman coding and reasoning capabilities could do.

I’m particularly referring to how quickly each will get to automating the entire research pipeline with no/little humans in the loop, and the incentives to get there.

Here are my rankings for which fields are most likely to be automated this way (top is most likely):

Scalable Oversight
- Feedback Quality (3/5): This has okay feedback quality in terms of already existing evals; however, I think this is quite open-ended and hard to identify everything we need to watch out for in terms of model outputs.
- Economic incentive (5/5): I’ve put this at the top since 1. it is crucial for automating the other fields of research 2. It has a nice positive feedback loop where stronger scalable oversight methods allow more automation. We have already seen anthropic researching scalable oversight.
- I admit that this may be a circular dependency where good scalable oversight needs to exist in order to automate scalable oversight research, but that’s why I believe there will be massive incentive and research to develop this by humans (with assisted AI reasoning).
Mech Interp
- Feedback Quality (5/5): For activations engineering, I reckon it’s arguably easy to validate methods of mech interp, particularly for things like steering (e.g., linear probes, SAEs), by evaluating model outputs given steering. I also think it applies to optimising current methods by doing hyperparameter sweeps, etc.
- I do think novel methods will struggle and probably need some human insight for this, although I think it may emerge from just iteratively understanding the model more, leading to positive feedback loops as predicted by ambitious mech interp. I also think the bitter lesson of throwing more compute at it may also bypass difficult novel research (e.g activations oracles, NLA)
- Economic incentive (4/5): Massive, better theories of mech interp will lead to both better alignment research but also aiding capability research (e.g verifying a model has learnt a feature). I predict this will be the most investment into automated R & D.
AI control
- I was tempted to group with this scalable oversight since it tackles a very similar problem, but more adversarially, and its research agenda seems different, so I’ve left it as its own field for now.
- Feedback Quality (3/5): I think existing frameworks, such as ControlArena and LinuxArena, provide pretty good environments that can serve as a foundation for more complicated foundations. I can also see this field becoming more automated in general to evade eval awareness via blind deep deployment.
- Economic incentive (4/5): I think slightly lower than scalable oversight, since this is preparing for the case that a model is deceptive, which is less likely in training/deployment than just monitoring frontier models for quality.
Model Organisms of Misalignment
- Feedback Quality (4/5): I think this has good feedback quality since model orgs have a ground truth (that they’re misaligned in a specific way), assuming they’re training correctly.
- Economic incentive (2/5): I think this has fairly low economic incentive for the same case as above, despite the catastrophic risk of deceptive or backdoored models; I think the lower likelihood will lead to AI labs neglecting this research for more profitable research, such as scalable oversight.
Emergent Misalignment
- Again, I was tempted to group with this Model Organisms of Misalignment since it’s a main field of study within the field, but I decided to separate it due to the research agenda being more aimed at how it occurs and prevention (e.g via inoculation prompting).
- Feedback Quality (3/5): This has slightly harder feedback quality than model orgs because the range of misalignment is much broader here; we need to ensure that we can measure misalignment of values and policies, which I believe requires hard brainstorming.
- Economic Incentive (3/5): I believe that this has more economic incentive to study than model orgs, since this is a phenomenon that may arise in post-training frontier models right now if not carefully, therefore, there’s an incentive for frontier labs to protect themselves against this.
Evals
- Feedback Quality (2/5): I think the quality differs based on the kind of evaluation this is. I believe capability evaluations will be easier to validate than propensity or alignment evaluations, since some kind of ground truth can be established using smaller models. However, I think the hardest part will be designing evals since models aren’t fantastic at novel idea generation.
- Economic Incentive (4/5): There is a decent incentive to automate these evals, especially since we are already running out of evals that aren’t saturated (which ties into scalable oversight), especially since frontier labs can create their own evals and thresholds for safety as a part of regulatory capture.
Technical Governance
- This is an oversimplification by assuming technical governance as one big field, for example, I’m including: inference verification, open weight safety, threat models, forecasting, etc.
- Feedback Quality (3/5): I believe this is quite a messy field in terms of knowing how successful. Depends on how much governance will be automated because this is quite a messy field in terms of product communication and adoption (by regulators and frontier labs). However, there are some parts of the field that have very clear feedback loops (e.g compute/hardware tracking) so I’m upgrading this to a 3.
- Economic Incentive (2/5): I believe there is little incentive for frontier labs to automate this since they would be contributing that would aid their own regulation however there is the case of regulatory capture where frontier labs can control this regulation via this automation.
Field Building
- I particularly mean starting a new org, community, or tool, driven particularly by generalists. I’ll be focused on automating the process of starting and running a new org for the technical AI research pipeline.
- Feedback Quality (1/5): I think it’s pretty hard to determine the impact of an org apart from just analysing and optimising for KPIs (Coefficient Giving has a good strategy for it though). This also has a very slow feedback loop, with investments taking months and years to pay off or even become clear.
- Economic Incentive (2/5): I think this has pretty negligible incentives for frontier labs apart from expanding their own organisation and automating their own admin and growth.
Theoretical Alignment
- This includes, for example, value alignment, CEV.
- Feedback Quality (1/5): This is hard, particularly for research into value alignment, as you need to validate that these theories have some grounding in reality, which is quite difficult to validate. Also, this requires lots of novel forms of thinking and methods, which are not easy for models to automate yet.
- Economic Incentive (2/5): I do see the idea of getting to AGI or ASI and having models research humanity’s values and research theoretical alignment to those values, but this is far off, and I think a lot of other things need to go right (such as scalable oversight) before frontier labs will practically consider this.
Agent Foundations
- Feedback Quality (1/5): Similar to above, quite difficult and lots of theory that can be hard to validate.
- Economic Incentive (1/5): Some subjects like corrigibility are measured by frontier labs but not really as a main point of focus and not really in ways that other attempts at corrigibility agree with. I also think that frontier labs are fully on board with the idea of empiricism and doubt they will heavily investigate these theoretical foundations of agents.

Just a reminder that this is an extremely broad overview of these fields and just my educated guess for the tasks required in research that are very difficult to automate (such as research taste, conceptual design), and of course, one could plausibly rank these fields differently (e.g., evals and emergent misalignment depending on how much you trust automated eval generation).

It’s been pretty interesting as I’ve been writing this, at first I had mech interp as my most #1 field but reading more about anthropic’s research showed me that scalable oversight is just as important, if not more.

It’s not a nice list to make, especially as someone who wants to do research. I worry about the ever-shrinking gap of “human” work to do. And I’ll be selfish and admit that it also kinda sucks because I want to do the research in all of these fields.

This is one reason I’ve found myself leaning towards technical governance work, evals, and field building (although mech interp will always be a massive interest) since these seem like “safer bets” (in terms of less likely to be automated first) even if they have their own list of problems for why it’s hard to enter these fields and long-term impact. I also had a lovely time at ML4Good, which updated my values and made me want to pursue these fields more, which I’m looking forward to writing up as a separate blog.

I would love to hear your thoughts on this, especially if you disagree with my list above. Feedback is always welcome!

21

Which technical AI safety fields are going to be automated first?

21

21

21