Great post! Nice to see something constructive! And half your citations are new to me. Thank you for sharing.
I have spent the last few months reinventing the wheel with LLM applications in various ways. I've been using my own code assistant for about 7 months. I did an alpha-evolve-style system for generating RL code that learns atari. Last year I was trying some fancy retrieval over published/patented electronics circuits. Did some activation steering and tried my hand at KellerJordan/modded-nanogpt for a day. Of course before that I was at METR helping them set evals stuff up.
It hasn't occurred to me to try to draw any conclusions from all this different work, and I didn't think of it really as inter-related in any significant way or relevant experience for much of anything, but your topic here is making me think...
Almost every "optimizing" system I make ends up breaking/glitching/cheating the score function. Then I patch and patch until it works, and by then it looks more like a satisficer.
Getting something really useful seems to take about a month of corrections like this. It looks done/working on the first day, I notice something broken and fix it and declare it done on the second day, etc, but after a month I just don't have any more corrections to make. This is different from eg a web app or game which I never run out of todo items for. Of course when LLMs are involved you have to look three times more carefully to be sure you are measuring what you mean to be measuring.
My point is that I expect projects fitting your description here to basically actually work and be worthwhile, but if it is your (speaking to the anonymous reader) first time doing this, expect that you'll spend 10x as long correcting/improving/balancing scores & heuristics as you'll spend on the core functionality.
As you stated in the post, that's not so different from the process used to make AI assistants (etc) in general.
Making my own AI tools has definitely given some depth/detail to all the theoretical problems I've been reading about and talking about all these years. Particularly it is impressive how long my tools have tricked me at times. It is possible I am still tricked right now.
More people should probably be thinking about research automation. If automating research is feasible prior to creating ASI it could totally change the playing field, vastly accelerating the pace of progress and likely differentially accelerating certain areas of research over others. There's a big risk, though, that AI capabilities research might be much easier to automate than safety research. One reason this could be the case is that it's much harder to verify that safety research is actually valuable, since we can't safely try out our techniques on an ASI. A second reason is that alignment research might be more strongly bottlenecked on conceptual breakthroughs than capabilities research, and getting such breakthroughs out of an AI seems likely to be significantly harder than automating the "normal science" needed to advance AI capabilities.
I want to make a fairly narrow argument, that AI safety research isn't so drastically different from research in other fields: other fields also have difficulty verifying the value of their work, and many other fields are also bottlenecked on conceptual breakthroughs. These difficulties may be more extreme in AI safety than in some other fields, but they're almost always present, even in ML. Because of this, I expect the big AI labs to put considerable effort into figuring out how to train automated researchers despite these difficulties, raising the chances that we'll be able to automate significant amounts of safety research. I wouldn't say that this makes me hopeful, exactly: the AI labs could very well fail to solve these problems effectively before creating ASI. But it does make me slightly more hopeful.
I think generally there are two different axes along which we evaluate the impact of research: impact on other researchers, and impact on broader society. I'll call these "scientific" and "practical" impact, respectively. Scientific impact is essentially a measure of how much other impactful research you enable. Papers with no scientific impact don't get cited; papers with some impact might enable a few follow-up papers; more impactful papers might introduce new approaches, enable us to answer novel questions, or open up whole new sub-fields of research. The foundational papers of string theory might be examples of research with high scientific impact but no practical impact; a clinical trial for a scientifically well-understood drug might be an example of research with a large practical impact but minimal scientific impact. In practice, of course, most research will have a mix of the two, and they might be hard to disentangle.
I think accurately evaluating research along either of these axes is just really hard in general. To evaluate the scientific value of a paper we rely on the judgment of experienced researchers in the short term, and in the long term we just wait and see what subsequent research is produced. The former is fallible, while the latter takes a very long time. The process is also at risk of circularity, where communities of researchers nerd-snipe each other into increasingly esoteric and useless rabbit-holes.[1] Evaluating the practical impact of some research is often easier, but it can still be fraught. Unless you're working directly on applications, most research just takes a long time to reach the point where it has any practical use. And even when it does, there are plenty of examples of applied, empirical research that ends up having nowhere near the impact people originally expect (or even having a negative impact), because the world is complicated and you are not measuring what you think you are measuring.[2] Along both axes, the only really foolproof way to judge the value of some work is to just wait a while and see what comes of it.
In AI safety, it's only the end goal -- whether we can safely build ASI -- that's extremely hard to verify. Every other level of impact, along both axes, is not any harder to evaluate than in other fields. It's certainly no harder to judge a paper purely on the level of technical correctness. We can also judge with no more than the usual amount of difficulty how much follow-up work a paper might be expected to produce. On the practical side, it's not any harder than in other fields to tell whether a practical technique or tool is useful according to the assumptions and threat model of a particular sub-field. It's only the very last step, judging whether those assumptions and threat models are valid and whether the research will actually move the needle on safety, that's more difficult than in other fields. For instance, we can see that the AI Control paper was impactful because it spawned a new sub-field of research, while still being uncertain about whether it'll actually help reduce the probability of catastrophe.
Now, you can certainly complain that it doesn't matter how "impactful" something looks if it doesn't actually help us survive. But I think this doesn't really change the prospects of automating research. If we want to train a model to do good research, we can't judge the outputs of the model on the basis of its long-term impact: we'd need shorter feedback loops for training. We won't be able to stick "does this solve alignment" in an objective function, but we also can't stick "does this cure Alzheimer's" (for instance) in an objective function, because the timescale over which we'd have to judge that is just too long. So at least when it comes to training an automated researcher, AI safety doesn't seem much worse off to me than any other field.
It's certainly possible that we'll end up following our automated researchers down useless rabbit-holes because it's too hard to judge whether something actually helps with safety. But many other fields have this problem too, if perhaps not as severe as in AI safety, because the feedback loops are so long. (It's probably even worse in pure math: the field doesn't even have an end goal to work towards, and I don't think they pay much attention to what the applied mathematicians do with their results.) And also, the same danger applies to human AI safety researchers. If your position is that AI safety is almost impossible to evaluate in principle, and that we shouldn't build an ASI until we have a full mathematical proof of its safety -- fair enough, but that's not an argument against research automation in particular.
That's all well and good, but maybe we will have the short-term feedback loops we need to train automated AI capabilities researchers? After all, capabilities research is just "line go up," which should be easy to evaluate.
Well, yes and no: the important question is not "did line go up" but "which line went up."[3] Even in ML, it's very easy to fool yourself and others (at least for a time) about the quality of your work: you might accidentally or "accidentally" train on the test set, overfit to a dataset, or pick a benchmark that makes your method look good but doesn't translate into useful performance in practice. We've seen these failure modes all the time with LLMs, for instance with ChatGPT's sycophancy, or Llama 4 and Chatbot Arena. The only ultimate measure of success for a model or new technique is whether other people use it or build on it in the long run. Evaluating work might be easier in ML than in most other fields, because the feedback loops are tighter, and you can fairly easily test something out on a range of different benchmarks to get a more holistic sense of how it performs. But it's still difficult; so if the AI labs want to automate capabilities research and aren't content with just making better coding agents, they'll have to address this issue.
A somewhat different argument I've heard about the difficulty of automating safety research is that AI safety is strongly bottlenecked on good conceptual thinking or taste, and that this will be hard to automate due to the long feedback loops needed. I think AI safety might be more bottlenecked on conceptual innovations than most other fields, but it's certainly not unique. Neuroscience, for instance, is often described as being "data rich but theory poor," with many calls for better theoretical frameworks to handle the "sea of data" being collected.
But more generally, I don't think good conceptual thinking is really confined to certain fields, or to certain types of research. Regardless of the field, I think in most cases the limiting factor to doing really impactful research is not in the technical work -- writing proofs or code, running experiments, etc. -- it's in coming up with the right questions to ask, or the right angle to approach a problem.[4] Plenty of research gets published that, while technically correct and maybe even technically impressive, is just generally useless and unimpactful (try asking most PhDs about their thesis). This includes ML: you need good research taste to pick the right metric to optimize. So I think the kind of skill you need to come up with really paradigm-shifting ideas is pretty contiguous with the kind of skill you need to do good research at any level, in any field: it's mostly a difference of degree, not kind.
If models don't improve much at ideation, research taste, and good conceptual thinking, they do seem likely to accelerate capabilities research somewhat more than safety. But even if the AI companies don't care about automating AI safety, they'll still have an incentive to solve these problems, because they show up in many domains. And I think there's a good chance that whatever techniques they come up with will let us automate safety research too.
What might research automation look like in practice? How might the AI labs try to train automated researchers?
One possibility would be something like RLHF: train a reward model on human judgments of research quality, then train a model to maximize the score it gets from the reward model. This probably doesn't go particularly well, in any field: you'll get outputs that might look good to the reward model (and maybe to you) but don't actually end up being useful, i.e. slop. (But again, I don't see it being any worse in AI safety than in other fields.)
What about outcome-based RL? We currently use this to train models on tasks we can easily verify, but doing this for entire research projects seems very hard: for almost all research, the time from ideation to actual real-world impact is way too long to package in an RL loop. You can't wait for a drug to go through clinical trials before applying a gradient update, for instance. And even in ML, you can't wait around for the next frontier training run to check whether the techniques you've come up with actually improve model quality in practice: you need to use more easily verifiable proxies of quality, and that introduces the risk of Goodharting on those proxies.
If the best we can do is outcome-based RL in some narrow domains where we can easily check output quality, I don't think we'll get very useful autonomous AI researchers. I expect humans would still largely guide the research process, telling AI assistants what questions to try to answer or what metrics to optimize. This would probably accelerate AI capabilities somewhat more than safety, but even ignoring safety, we'd be leaving a lot of value on the table. So I think there would be a strong incentive to go beyond this.
I think it's likely possible to build dangerous ASI while still only using this sort of simple outcome-based RL. AIs might just keep getting better at optimizing narrow metrics until one of them realizes it could optimize metrics better by tiling the galaxy with GPUs, or something. But I think if the AIs only get practice optimizing narrow metrics and don't have training for research taste or other skills needed to initiate and guide a research project, there's a decent chance this raises the bar. In other words, it seems plausible that, in order to get an AI capable of taking over the world without experience doing the kinds of thinking needed for really good research, you'd need a significantly bigger model than you would if it had such experience.
What might a solution to building good automated researchers actually look like? I don't have a concrete answer, and if I did I'm not sure I'd be writing about it here! But I want to make the case that it's probably not impossible, despite the fact that we lack a good objective function to optimize.[5] This is mainly because reward is not the optimization target. To train an AI to do good research, we shouldn't necessarily imagine trying to design a function that takes in a research output like a paper and has a global maximum at the best possible paper, and then optimizing that, and despairing because we don't have a non-Goodhartable way of judging research outputs. Rather, we should imagine trying to shape the AI's cognition to perform mental motions that are useful for research, and not perform detrimental ones: to check its work, consider alternate approaches, make predictions, examine its assumptions; to not just try to find the "right answer" or focus on passing the test. It's a problem of generalization: using some limited environments and datasets to instill behaviors that will generalize to doing good research in areas we can't easily judge. Broadly speaking, the fact that we can sometimes do this kind of generalization is why ML is useful at all. For instance, we don't really have a single reward function for "have good conversations with people," we have a weird combination: pre-training plus SFT plus RLHF, but prevent the RL from getting too far from the pre-trained prior because otherwise it goes off the rails, etc. Obviously this has its problems, but it works well enough most of the time. So maybe, if we do get good automated researchers, they'll be trained in a similarly convoluted manner, stacking together a couple different training procedures to get something that works well enough. Again, I don't think this is easy, but it doesn't seem unsolvable.
I've argued that some of the reasons people have for thinking AI safety might be particularly hard to automate aren't actually so unique to AI safety. It's difficult to evaluate impact in any field, and good conceptual thinking and taste are needed for almost all research. So to unlock most of the value of automating research, AI companies will have to find ways of training automated researchers despite these difficulties. There's no guarantee that they'll solve them in time: it's certainly possible that simple scaling (or good old-fashioned human research) gets us to ASI before we figure out how to get useful research out of weaker AIs. But I don't think the problems are impossible in principle, and I expect the AI labs will have a pretty strong incentive to solve them. This makes me somewhat less worried about not getting a chance to automate significant amounts of safety research before ASI, or about the gap between automated capabilities research and automated safety research growing too large. I'm still pretty worried -- there's a lot of uncertainty about how things might play out. But, somewhat less worried than I would be otherwise.
My impression is that some people think string theory is an example of this. I don't know enough physics to have an opinion on the matter.
Leaded gasoline, CFCs, ivermectin, and all the non-replicable work in psychology are some examples.
You'll notice in the diagram above that both the x-axis and the y-axis are labeled with "layers," making this a prime example of optimizing the wrong metric.
This isn't to downplay the importance of empiricism, good execution, and generally making contact with reality: it's often in the course of running experiments or tinkering with a problem that we come up with new ideas, and it's often hard to judge the value of an idea until we try to implement it.
The fact that we can train human researchers without waiting for them to write and get feedback on hundreds of different papers shows that this is possible in principle, although the unreliability of producing good human researchers does point to its difficulty.