It seems to me that the hardest parts of alignment to automate are also the most important. The one exception I can think of is comprehensive reverse-engineering, which seems like it would be slightly more tractable. Still, I don't think LLMs are currently capable of making novel advances there.
Yeah, I think this is what the Ambitious Vision for Interpretability post suggests about feedback loops, where the more you know about model internals, the easier it is to develop the next set of tools and so forth.
I predict we'll need lots of human brainstorming and engineering to get a breakthrough on mech interp which makes other tasks on mech interp easier to verify and less fuzzy and therefore easier to automate.
Fazl Barez's work on Automated Interpretability is already a good start on this.
My model of things is that there is a core set of generally applicable research skills, and then there is topic specific technical knowledge required to conduct research in that particular area.
I don't think models are bottlenecked by technical knowledge. I think they just don't currently have the general capability required to do broad exploratory research, outside of narrowly specified verifiable problems.
I would guess all these subfields will get automated at the same time, once said capability is developed.
Different groups have specific agendas, and within said agendas you can try to automate the production of incremental papers (which are important!) But in the limit you're going to have to develop concepts beyond the current frame if you don't want to stagnate, which requires general exploratory research ability.
Ah I see, you're thinking of core set of research skills -> domain specific knowledge. I do agree with you but one point I want to raise is that some of those core research skills are being able to deal with hard-to-supervise fuzzy tasks and coming up with truly novel research ideas.
While on the way to an fully automated researcher, I believe models are going to pick up some of the research skills faster than others (e.g implementation, review) and the fields that require more of these tasks are the ones (easy-to-verify tasks) that will get automated first.
I think you are pretty right here of course, once one of these fields are automated, then the others aren't far off but it is interesting to forecast this :)
One quick note I forgot to mention! I purposefully avoided talking when these fields are going to be automated, that is a pretty big question and worth it’s own discussion. This is more so, if we accelerating towards automated research, which fields do I expect to get automated and which will require more radical human-engineered breakthroughs.
I’m transitioning into technical AI safety, and I find myself thinking a lot about what fields I want to research and where I’ll have the biggest impact. One thing I’ve found myself thinking about a lot recently is what fields are likely to be automated.
This seems pretty likely since frontier labs will likely be automating capabilities research as a part of automated R and D, and safety research won’t be far off. Some initial examples of this are Anthropic is investigating automated alignment researchers, UKAISI recently evaluated models’ propensity to sabotage alignment research, and Anthropic is using Mythos for internal productivity boosts and testing its ability to do alignment research without sabotage (which it can’t). There are also a handful of frameworks appearing for creating automated research pipelines, such as The AI Scientist.
I’ll be considering two factors:
Disclaimer: This is a pretty rough and hand-wavy explanation of this. If you’re looking for a more in-depth exploration into modelling automated R and D across alignment vs capabilities, I’d recommend this post from January.
Disclaimer: I wrote up most of the ideas for this blog post in April and didn’t get around to publishing it. I'm aware that UKAISI has published more fantastic research into this that I haven’t considered in this blog post.
I’ll be basing my automation capability on what an AI with superhuman coding and reasoning capabilities could do.
I’m particularly referring to how quickly each will get to automating the entire research pipeline with no/little humans in the loop, and the incentives to get there.
Here are my rankings for which fields are most likely to be automated this way (top is most likely):
Just a reminder that this is an extremely broad overview of these fields and just my educated guess for the tasks required in research that are very difficult to automate (such as research taste, conceptual design), and of course, one could plausibly rank these fields differently (e.g., evals and emergent misalignment depending on how much you trust automated eval generation).
It’s been pretty interesting as I’ve been writing this, at first I had mech interp as my most #1 field but reading more about anthropic’s research showed me that scalable oversight is just as important, if not more.
It’s not a nice list to make, especially as someone who wants to do research. I worry about the ever-shrinking gap of “human” work to do. And I’ll be selfish and admit that it also kinda sucks because I want to do the research in all of these fields.
This is one reason I’ve found myself leaning towards technical governance work, evals, and field building (although mech interp will always be a massive interest) since these seem like “safer bets” (in terms of less likely to be automated first) even if they have their own list of problems for why it’s hard to enter these fields and long-term impact. I also had a lovely time at ML4Good, which updated my values and made me want to pursue these fields more, which I’m looking forward to writing up as a separate blog.
I would love to hear your thoughts on this, especially if you disagree with my list above. Feedback is always welcome!