Thanks for your thoughts. I'm thinking along the same lines for biological research, where the ground-truth is very far away and many analytical outcomes can seem plausible.
I am deliberately assuming humans as the final decision-makers, and asking how far we can get by augmenting human judgement
I agree with this stance and I think we should keep humans in the loop as much as possible. But I wonder what the response should be if you run several AI agents on the same topic that come to different conclusions. (Although it seems like you have the opposite problem of them all doing exactly the same thing).
are differences in productivity between academic fields
I think even looking within subfields (say, within biology), the rate of uptake of similar ideas can be very different. Some fields are just more amenable to disproving ideas more than others, considering the availability of evidence and the possible variation created by experiment. I would expect AI to make the gaps in theory smaller because synthesis across fields is easier, but I don't think it will change the fundamental differences in experimental design.
But I wonder what the response should be if you run several AI agents on the same topic that come to different conclusions.
I would guess that situations where answering one question raises a few others are fairly common. You can have conflicting evidence and that's a pointer that your understanding is somehow off, so you have to dig deeper. If you keep following those edges on the knowledge graph and still have compelling evidence going both ways, something might be wrong with your world model such that you need to reevaluate more fundamental assumptions. It shouldn't be possible to arrive at different conclusions if your work so far is correct. This is one way in which massively parallelising the legwork doesn't speed you up – it's a sort of Amdahl's law effect – because you still need to understand the assumptions, context, etc.
To that extent you want to reduce the verification cost of each agent thread. That's why I briefly touched on debate as a method of eliciting the crux. My sense is that if the crux is obvious to the human reviewer progress is more easily made. If it isn't, the human reviewer has to do increasingly more costly tasks to review the work, potentially up to the cost of generating it in the first place.
Over the last month, I've been working on a project I'm calling fab.[1]Nominally, it's an interface that enables a human researcher to make sense of research produced by many agents running in parallel. I haven't "finished" fab — in fact, I'm stuck searching for the crux of building something like this — but I still think it is worth posting a short explanation of the problem it is meant to solve, and how it tries to address it.
In what follows I'm imagining this being used for automated alignment research. This is not because I see it as a silver bullet. I am just observing that a lot of current empirical alignment work could be automated (and in fact some of it already has been, just not the more open-ended stuff). If we can shift more work to the left through automation, that's a win. I am deliberately assuming humans as the final decision-makers, and asking how far we can get by augmenting human judgement.[2]
Research at scale
Imagine a near future where you can spin up dozens of agents to do research for you in parallel.[3]You loosely specify a question you're interested in, and they do all the legwork: operationalise that question, look at prior work, run quick experiments to get a mechanistic understanding, run bigger experiments (training models, perhaps, or doing white-box interpretability), analyse the results, position the findings in the broader context. Ideally, they all take slightly different approaches, trying to find different "handles" on the question you specified. When they're done, you have on the order of tens of write-ups to review, with the ultimate goal of updating your understanding of that question in light of new evidence.
I think this is actually really hard, for a couple of reasons that have to do with the interplay between how we do research and current agent failure modes.[4]
Attention is the big problem. There are only so many human researchers in a given field. In alignment, there are maybe a few thousand in total, with maybe 30 or so who are highly productive. You don't want these people reviewing LLM slop. I think this is the binding constraint of the entire system, because it dictates how much work you can fan out. In other words, nothing is stopping you from kicking off 100 research agents right now (okay, maybe token spend), but good luck making sense of all hundred research reports.
In reality, of those 100 research agents, many will not produce anything worthwhile. In my experience the most common reasons for that are:
There are mitigations for all three:
Even when these are taken care of, there is a lot of information to review. A side point: the format we've settled on for academic publication is partly a reflection of bandwidth constraints in human reviewers. That need not be the case in the future, if we can offload some of that to agents. Ideally we would not share just the findings; we would record what was tried, what worked & what didn't, what assumptions were made, as well as any resources used (open-science efforts often focus on these points).
How fab works
fab is supposed to be the layer/interface that converts lots of agent attempts into one human update. Today fab is mostly exploration. I am assuming that a good agent platform exists externally, such that I can spin up persistent background agents with tools, skills, execution sandboxes, durable filesystems etc. Those agents receive a research contract, do some work, and return a specific kind of artefact bundle.
The research contract is a specification of the research question. It is partly structured, but leaves a lot of room for nuance. Here is an example for weak-to-strong generalisation. I get the sense that, with current-generation agents, investing in a good spec has large returns. Saying "go investigate phenomenon X" tends to not work.
The fab artefact bundle is adapted from this paper. It contains results, code, logs, and a report. I also expect the entire agent trace to be at least inspectable, but it is better if there is a rich snapshot of the state so that the agent can be spun up again with the same execution state, trajectory history etc. (this exists today).
fab is explicitly not the data layer, and stays out of how agents might want to structure or persist intermediate work. As long as the artefact shape is correct, the rest is open to experimentation. I put together a lightweight knowledge base repo to accompany fab; here is an example of an artefact from a real multi-agent run investigating what happens to a model's support before and after RLVR. That said, I've noticed that you have to be very careful when you get agents to work on a knowledge base. The default outcome is accumulation; agents often behave as librarians, filing things away. I would like them to consolidate that knowledge, pruning the corpus as needed, updating previous notes, etc. I think this append-only bias mostly falls out of how they're trained, so it's not easy to remove.
I've used fab a few times now with different kinds of agent harnesses. It doesn't yet clear the bar for switching from directly interacting with the agents to this added interface. The few runs I did were just okay — but they only used 3 agents, and the core value proposition is when you fan out more.
What's next
A fun idea I had is "FellowsBench": how easy would it be to use fab to replicate work from past Anthropic Safety Fellows? It's an easier problem because the work has already been done, but it tests many of the execution paths you would hit in real-world use (e.g. fan out replications of ablations at the same time). It's also easier than the abstract "do research" question because most of the projects are empirical, so more tractable for agents.
I'm planning to run this at a larger scale and see if that works as a forcing function for the design. If using fab causes me to update on a view, that seems like a win; measuring the delta between using fab and using plain agents seems harder. I don't anticipate the architecture or code generation to be big obstacles here; rather I think nailing the ergonomics is most important.
What fab is and isn't
There is a lot of overlap between fab and agent orchestration. For single-agent work, this looks like coming up with a good harness. For multi-agent work, it's trying to figure out how to get swarms to make useful progress. fab is a bit more like the latter, but in my mind it is subtly different in focus. Similarly, fab overlaps but doesn't try to rehash other efforts:
Adjacent questions
Asking how fab should work led me to other questions like "how does science happen?" or "how is knowledge produced?". There are a few threads worth chasing here.
Within or across paradigms
It could be that a research system that works well within-paradigm does not work well across them, and vice-versa. Within a paradigm there is typically incremental progress, even though you can't trace a neat graph curving up-and-to-the-right. It is still hard to discover the future, but in retrospect it is obvious that those pieces line up. Discovering a new paradigm is not like this; it is a "shock" to the existing body of knowledge that seems to occur when enough anomalies accrue under that body of knowledge. It's plausible that the kind of work that eventually leads to noticing those anomalies, to putting the picture together, is not the same kind of work that yields incremental progress within-paradigm. (One example that comes to mind is adding epicycles to explain planetary orbits under a geocentric model, when the better model, the new paradigm, is heliocentrism.)
I am more sceptical that you can do this kind of discovery by just ramping up the amount of work you do, though the connection is somewhat subtle. To be clear, I do think that even paradigm-changing research requires a lot of attempts, a lot of shots on goal; in that sense parallelising those attempts isn't bad by default. But the function that tells you whether those attempts are coalescing into a broader paradigm is harder to pin down, and not something I expect to "crack" with fab. This recent write-up is relevant!
Research flows
I am finding out that even though there is an archetypal "flow" for research, it varies quite a lot between fields. Even with expensive training runs in ML our feedback loops are short. Most experiments are computational in nature. That dictates how we iterate, and ultimately how progress gets made. You can view the advances in the last few years as accumulated knowledge from repeated experimentation. The best in the field carry tacit knowledge: heuristics about hyperparameters, intuitions about anomalies, and so on.
The picture is completely different when you're, say, manufacturing viral vectors for gene therapy. Assays take several days. You can't hurry your cells along because you have a deadline, you're mostly operating on their schedule instead. Wet lab work has an added overhead, and added complexity from biology not-quite-behaving according to your model, that dictates how research is done there.
Can fab support both of these things? Other types of research? Should the research workflow be constrained or open-ended? And there is a trade-off here: more structure means more legibility, but less novelty. For fab, if you allow fully free-form outputs from agents, the verification/oversight mechanism completely breaks down, and you get even less out than if you had, say, forced a particular structure that somewhat restricted what the agents could do.
Debate
We might want to use fab to carry out debate, between:
The hope there would be that the debate contains more information than either side presenting its own arguments. Ideally the debate helps to elicit the crux of the research question — sometimes asking the right question is an extremely valuable contribution.
Which fields are productive?
Automated Alignment Is Harder Than You Think makes the observation that there are differences in productivity between academic fields. Part of the explanation is whether it is possible to discard unproductive hypotheses in a given field. If it isn't possible, or if it takes a long time, progress in that field is extremely slow. That implies we need a way to distinguish between productive and unproductive hypotheses. Historically, reproducibility has been a mechanism to do this. There are other methods, but they are more uneven; for example a great mathematician can probably discern through intuition when an approach is fruitful, but a mediocre mathematician might make matters worse.
Overall I think these are fascinating points which we may have to revisit if agents can do science at scale.[5]I have, however, tried to steer clear of the big-big questions, because I don't want to boil the ocean ("before designing fab, I have to figure out science" is not a workable goal). My plan is to move outward from a useful prototype.
Why write this?
Even though this work isn't ready, I wanted to write about it for a few reasons, most of which are selfish:
As in, semiconductor fab. Also, lowercase, because that's cool these days. ↩︎
Delegating the judgement unlocks scale, but takes away alignment. I think scalable oversight has some hairy problems, and I am trying to dodge them if I can by not doing scalable oversight. ↩︎
If that's difficult, imagine you are a principal investigator managing a bunch of capable-but-junior researchers. Or copies of you. Or, in the future, copies of von Neumann. ↩︎
I am assuming somewhat aligned agents, plus a harness that allows us to implement control techniques, which is why the failure modes are pedestrian with respect to the alignment problem proper. ↩︎
Or maybe not think about at all, if the agents are too good. ↩︎