Feel free to disregard but I've gotten some feedback that it could be nice to bring some metascience and philosophy of science into thinking about automated alignment science so I thought it would be worth giving a shot.
If I take the perspective of my internal combined model of Michael Nielsen and Thomas Kuhn then I would also say something about the framing questions that are implicit here?
Right now it seems your model is to some extent an iteration loop on top of a set of well-defined set of evaluation benchmarks and iterating on those in order to fill up the space of potential problems that might arise.
I wonder if you're underestimating the difficulty implicit here by focusing on this model?
If we look at the engineering around some major crises in the past I think we can point at some interesting problems that show up.
Take the Gaussian copula in pricing structured credit before 2008. People doing quantiative analysis were looking at the underlying math in a good and straightforward way that fit the frames of the time well. The problem was that the correlations were estimated from a regime where house prices had only risen, which made every model locally correct and the aggregate position catastrophically wrong when the regime changed. Now you might say withhold test data and do k-fold verification but it is likely that no amount of better aggregation would have caught it because the space of evidence being aggregated was pre-defined by a framing nobody inside the loop was incentivised to examine.
Or aviation automation through the 80s and 90s. They had lots of different technical measures about their reliability including things like formal verification. And yet aircraft were flying into terrain while the autopilot did exactly what its spec said it would do, because the framing-level question of whether "automation meets spec" was the right safety property at all couldn't be reached from inside the engineering paradigm. It took human-factors researchers from outside to introduce concepts like mode confusion and automation surprise.
I'm not saying you can predict the specific framing failures, but I worry that if you don't build automated alignment science with paradigm-level reframing as a core capacity, you end up in a bad state. This is because science very much seems like a series of weird serendipitous discoveries following each other (or at least the process of paradigm shift, normal science is also very much real).
More specifically I'm worried about the lack of research on problem formulation? It seems like the approach here is a very engineering heavy mindset where you're iteratively figuring out how to make a system safe. It seems like an annoying part of the scientific process in weirder domains like complexity science, biological systems sciences and social sciences is missing which is that sometimes it is really really hard to interpret results.
Usually this is also done in a community way and the re-interpretation of ideas often involves bringing in external perspective on the same problem and then seeing it in a new way. The mixing of ideas and the mixing of problems don't seem very implied here? I don't know if I would subscribe to a model of scientific progress that is similar to the one that you're implicitly imposing here? For example, how do you model serendipity and adaptivity to new problems? How are you thinking of modelling the integration of external frames in your problems?
Is this model of research open to multi-polar worlds where you have more complex interactions? If we are talking about automated alignment science and not engineering why is there no mention of theories of science?
I don't know if this is helpful or not, it could just be the case that you're strictly aiming for a scalable oversight single agent scenario and that you find it justified to simplify the model to an engineering problem here. I hope that this might be somewhat useful for you to potentially think about? (It also feels a bit weird to mention this to you when you wrote a paper on it being harder than you think but that is precisely why I thought you might be receptive to some of these thoughts.)
I'm glad to see this paper. The difficulties of automating alignment research are well-appreciated by many, but until now I hadn't seen a rigorous attempt to articulate them. This paper is an important contribution.
The point about "Aggregating correlated evidence" is something that gets brought up in finance. Investors routinely get into trouble by treating correlated risks as if they're independent. I had never thought about that in the context of safety evaluations, but it makes perfect sense.
Summary
This is a summary of a paper published by the alignment team at UK AISI. Read the full paper here.
AI research agents may help solve ASI alignment, for example via the following plan:
We argue that automating alignment research in this manner could produce catastrophically misleading safety assessments, causing researchers to believe that an egregiously misaligned AI is safe, even if AI agents are not scheming to deliberately sabotage alignment research. Our core argument (Fig. 1) is as follows:
Because these tasks are hard to supervise, research outputs (including OSAs) will contain systematic, undetected errors, regardless of whether these tasks are completed by human researchers or research agents trained using human data. We already see examples of such errors in current models, especially on hard tasks.
In most domains, iteration is able to correct for undetected errors. Mistakes not caught during one experiment are often surfaced by subsequent research or real-world system behaviour. Unfortunately, alignment lacks the safe feedback loops that are required for such an error-correction process to work: producing an overly optimistic OSA could result in the deployment of a misaligned AI before the error is caught, which could be catastrophic. Therefore, we need to train agents to reliably perform well on these hard-to-supervise fuzzy tasks the first time. Direct training using human feedback on these hard-to-supervise fuzzy tasks is not sufficient as human approval does not indicate correctness. Two alternatives remain:
Figure 1: Automated alignment can fail, leading to the deployment of a misaligned AI, if research outputs contain undetected systematic errors (a) or if the aggregation of research outputs is flawed (b). Since alignment cannot be measured directly, research instead targets alignment proxies (model organisms, honesty evaluations, whitebox probes etc.). Incorrect judgement about the implications of results for alignment can lead to systematic, undetected errors in individual research outputs. Aggregation of outputs can lead to incorrect overall safety assessments, even when individual outputs are correct, if the correlation between the uncertainties in each research output is mis-modelled (Sec. 3.3.2).
Acknowledgments
Charlie Griffin and Martin Soto contributed substantially to the core content of this paper. We are also grateful to David Africa, Simon Marshall, Konstantinos Voudouris, Benjamin Hilton, Joshua Clymer, Evan Hubinger, Daniel Kokotajlo and Jan Leike for valuable review and discussions.
We consider an agent to be scheming if it covertly and strategically pursues a hidden objective (Meinke et al. 2025). This definition permits isolated misaligned behaviours such as reward hacking. Unlike other common definitions (e.g., Carlsmith 2023), our definition is not restricted to behaving well in training in order to gain power later.
We use "overall safety assessment" rather than "safety case" because it is more general. An OSA can take many forms (e.g. a safety case, risk report, compilation of empirical evidence from red teaming efforts, or judgement by an individual human or AI).
Another important, hard-to-supervise fuzzy task is steering a body of research and allocating resources efficiently between research directions. This involves assessing whether individual research directions are worth pursuing and whether the research corpus is sufficiently broad.
This problem of correlated uncertainty was discussed in learning the prior (Christiano 2020) and learning the smooth prior (Irving et al. 2022).