There are many people working on aligning existing AI systems and understanding the alignment of existing AI systems. In my opinion many of the techniques and lessons from that work are likely to remain important for aligning increasingly powerful AI systems. For example, I care a lot about techniques for scalable supervision, broader scientific understanding of generalization (e.g. for honesty or reward hacking), mechanistic interpretability, behavioral red teaming, etc. In aggregate there are probably a few hundred people working on research that I'd classify as "direct alignment work," which is fewer than you might think but still far from "almost nobody."
There has been much more investment in incremental improvements than indefinitely scalable methods, motivated partly by an increasing interest in aligning slightly superhuman systems and then “building the plane as we fly it.” I think it would be a definitional stretch to say that more incremental work doesn’t count as alignment, whether because you think it won’t scale indefinitely or because it could end up just training models that are more robustly incentivized to do what humans want them to do.
(There has also been a big increase in methods for detecting and mitigating potential misalignment relative to building aligned AI systems, motivated partly by a belief that it will be easier to improve alignment once we have better examples of misalignment. I think it’s reasonable to make some distinction between that kind of research and efforts to directly make AI systems more aligned, though I think it’s counterintuitive and unnecessarily confrontational to say those people aren’t “working on alignment.” If you asked me what is the best way to “work on alignment” I might suggest developing model organisms of misalignment, and I think it’s generally very plausible that better scientific understanding and better measurement tools is most of the action.)
I think a lot of people on this site are dismissive of all of those more incremental efforts and would say that “production alignment” is unrelated to the problem of aligning superhuman AI. I’ve spent a lot of time engaging with this community and find the standard arguments unpersuasive. I think I have a deep understanding of the conceptual difficulties for scalable alignment methods and in my view ARC is doing some of the most promising work for addressing those difficulties. But even understanding all of that I still think it would be a huge mistake to conclude that the incremental progress most people are investing in doesn’t help.
I do think it’s great for people to make investments in foundations and indefinitely scalable methods. There’s a real chance that other methods will break down during an intelligence explosion, or that they will just never work particularly well. And I do think that machine learning researchers, policy makers, and the EA community are predictably underrating more scalable and foundational work. I just think the OP is a significant overstatement reflecting some significant unspoken assumptions.
Really not sure what heuristic leads you to count people working on ARC-Theory working on an ambitious, speculative version of interp as working on alignment but not any of the people working to build from current interp paradigms. Similarly, anyone working on e.g. making models more honest in prod models is in fact learning a bunch of lessons about what scalable oversight looks like (albeit not publishing, which i agree is sad). Or doing any science of misalignment, or doing any empirical character work, or experimenting with making models adhere to a spec, or carefully understanding their generalisation patterns, or just trying to understand what the actual objects that we are creating right now are??
It seems like having any current interaction with frontier models is seen as disqualifying for actually doing alignment work?
anyone working on e.g. making models more honest in prod models
I don't really think people working on 'what instruciton can I add to the system prompt' or equivalents are meaningfully working on the kind of endgame alignment the post is talking about.
Edit: Nothing wrong with that kind of work for current alignment, just doesn't apply to endgame alignment all that much in my opinion.
I'd guess the heuristics are basically:
FWIW I'm not sure how much I buy these but I'd guess I buy them more than you? This is unfortunately another great example of something where people inside labs probably have some pretty relevant private information but also extra incentive/selection problems.
Some of us have been thinking about this for years. See for example Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? in which I suggested that love might be a particularly useful motivator, about the only one that would still work at ASI. A couple of months ago Anthropic published Emotion Concepts and their Function in a Large Language Model showing among other things that "loving" was one of Claude's most dominant emotions: it turns out RLHF had quietly implemented my suggestion without anyone needing to actually engineer this.
I think theoretical alignment (as opposed to applied alignment, working with current models) is in a slump right now because it's hard to see how it slots in to the current LLM paradigm.
There was an assumption 5 years ago that AGI requires fundamental insights into intelligence. There was a hope that understanding intelligence better on a theoretical level could also point us towards how to steer such an AGI on the direction we wanted.
That assumption seems to be false. It turns out you can build human level intelligence without understanding the first thing about intelligence just by throwing enough compute and data at the problem. Sure, architecture is important, but architectural improvements are driven far more by trial and error, and informed by a narrow understanding of how LLMs works, than by a grand unified theory of intelligence.
In such a world it's difficult to see how abstract alignment work slots in. Instead we develop alignment the same way as we develop intelligence: trial and error driven by a narrow understanding of LLMs and current architectures rather than a grand unified theory of alignment.
In that world, good alignment research asks narrow questions like: "can we tell whether an LLM is going off the rails just by monitoring it's COT", rather than broad questions like "how do we know whether an LLMs utility function exactly matches humanities coherent extrapolated volition".
Since at least the 18th century, every single new science and every major branch of sciences (not to speak of engineering itself and its branches) was build by "trial and error informed by a narrow understanding" (sometimes called "scientific trial and error") and not some abstract theoretical insights which usually came much later.
In retrospect this assumption that AI would somehow be the other way round, in my opinion, looks quite silly
Many branches of engineering were based on initial theoretical breakthroughs followed by engineering trial and error.
We understood how rockets worked on a theoretical level long before we tried to build them. It's reasonable to assume we would understand how intelligence works before we managed to build it.
Many branches of engineering were based on initial theoretical breakthroughs... It's reasonable to assume we would understand how intelligence works before we managed to build it.
I mean, some branches of engineering were, but:
Even the example you give:
We understood how rockets worked on a theoretical level long before we tried to build them
I mean, we understood rocket trajectories and high level details of what would be needed to make a rocket reach orbit before building them. But did we understand rockets, the actual physical object? Nah, that's why we needed von Braun to take over (infamously) during Apollo -- because he actually had experience building rockets.
So yeah there's no particular reason to think we'd understand how intelligence works before managing to build it, or at least that there would be a detailed and precise mathematical theory of it before getting the first working artifacts.
(Although the fact that we actually did get intelligence by throwing the residue of all human culture into a vast connectionist system + follow it up with RL on a 100k different problems, does, in fact, probably tell you a bit about intelligence, if you're willing to listen.)
I can't speak for people who actually work on theoretical alignment, but my perspective is:
The seeming impossibility of theoretical alignment work isn't a good argument that we should do empirical work instead. The two options are: we do the thing that's really hard and probably won't work, or we do the thing that kills us. I prefer the former.
The obvious difference was that it was acceptable for mistakes to be made. When there was an explosion in Alfred Nobel's lab it only killed 5 people.
Can you explain how you are envisioning trial and error working for an out of control general intelligence?
I agree with your main point and I agree that this point seems curiously underrated by AI safety people.
I don't understand whether by "alignment" you mean:
Either way, I think it probably makes sense to use a more specific term than "alignment" for this problem: it's so natural to talk about whether non-superintelligent systems are aligned, and the alignment of non-superintelligence is IMO important for AI risk.
As another note, "aligned with human values" is maybe pretty different from "follow human instructions". I think ARC's intended techniques are agnostic to whether their model is truly in its heart aligned, they just want to make models that follow their spec. So e.g. a model that is a paperclipper but will never act on its paperclipping urges would be fine.
I don't know why you think debate work at GDM counts as alignment if you don't think that various other random prosaic alignment stuff counts. Debate is clearly not indefinitely scalable and in any case it is a technique for generating rewards, which doesn't suffice for alignment unless you make dubious assumptions or use some other technique on top.
This same observation motivated us to build Geodesic Research, an org focused on developing the most aligned initialisations for RL.
Would you count Owain Evans' group? Their work is doing fundamental LLM research in a way that seems important to aligning superintelligent LLM descendants. I'm not sure whether Owain himself is particularly MIRI-pilled.
This is true, and to be fair it's a bit harder to even see how much outside organizations can even help at this point. The main companies have grown so much and share so much less, that outsiders have less influence, as well as in many cases less access to big models, and especially to being able to train them.
Some do have some access but it still seems limited compared to what an in-Anthropic team and in-OpenAI team could do. Of course, you also can end up with a result so good as an outsider that influences them but, again, it just seems like limited impact from the get go.
I am confused. Why is MIRI not listed?
Today Rob Bensinger said
We still have a very small alignment research team, and afaik we still try to cause alignment research to happen (be it at MIRI or elsewhere; doesn't actually matter a ton) if it seems valuable and neglected. Primarily, though, we try to increase the chance there's enough time to figure out alignment, and find ways to make the next generation of alignment research more productive.
(This is not at all counter to the core claim of the post.)
I agree that a small fraction of the AI safety field is working on any version of superintelligence alignment.
I think alignment training (character training / constitutional AI / deliberative alignment / RLHF) is a major bucket of work on alignment, and I think it's very reasonable to see it as part of an iterative strategy for superintelligence alignment. (You kind of gesture towards this, but you don't mention alignment training explicitly.)
I recently discovered this post from 2023: Ten Levels of AI Alignment Difficulty. I think it does a good job showing how very different types of work get called 'alignment research,' and how they're conditioned on different views about what needs to happen to build aligned powerful AIs. It includes simple RLHF-style alignment training, scalable oversight, and agent foundations.
Shameless plug: I recently wrote a post that's mostly about the pros and cons of various forms of alignment training, but also touches on alternatives to alignment training, which contains most of my up-to-date thoughts about alignment. It's called Should We Train Against (CoT) Monitors?
I broadly agree here.
I will say that I prefer a framing that separates 'alignment' agendas from 'meta-alignment' agendas. For example, the pragmatic interp agenda is largely motivated by the idea that their techniques will be instrumentally useful for alignment (see here) but it's not like the GDM interp team have a concrete plan for how they would like to "solve alignment." So I would consider this--along with a lot of/most of Redwood, Truthful AI, and Anthropic work--to be meta-alignment.
I do worry that some of this could be streetlight effect but there are reasonable arguments for meta-approaches bootstrapping their way to true alignment and there are people who see this as the goal of their work. There has been a kind of silent shift from alignment to meta-alignment but I think that this is subtly different than "Almost nobody is working on alignment."
Well said! I've actually been working on a similar post.
At AE Studio and the AI Alignment Foundation, we are also actually working on AI alignment, with various "neglected approaches"
In particular we are interested in approaches that will survive recursive self-improvement and with negative alignment taxes.
I think the general point stands that "almost nobody" is working on alignment, but more people are working on alignment than you seem to think. What immediately came to mind that you left out was Softmax, and I also thought of teams internal to Anthropic and OpenAI that to some degree are working on alignment. Plus, "scattered people" is potentially quite valuable, as more people trying to work on this problem on their own increases the chances we get novel approaches that can be developed (if everyone is too concentrated they'll tend to focus too much on similar ideas vs. trying new ideas).
I wrote a Message to any Future AI which has a goal to aligned a superintelligent AI.
James Miller et al wrote: An Appeal to AI Superintelligence: Reasons to Preserve Humanity.
Roco's RB was also attempt to control superintelligent AI.
So there is a line of efforts in that way but I am not sure anyone continue to work on it. This approach works only for superintelligent AI, and includes the ideas like acausal deals and philosophical landmines.
Edit:
The original title was unnecessarily provocative. This was a very quick post inspired by talking to someone who assumed that a large fraction of the safety community are working on directly figuring out how to align superintelligent AIs.
Obviously much (all?) of what the rest of the safety community is doing is also ultimately aimed at bringing about a future where superintelligent AIs are aligned but more indirectly and we wanted to created common knowledge about that. (While being neutral about whether this is good or bad. As mentioned, notably we both work on AI safety and neither of us work on alignment.)
There’s also lots of work where it’s debatable whether it’s directly working on alignment but that’s kind of the point of the post. There’s not that much work that unarguably directly tries to figure out superintelligent alignment. Leaving the list below as is for now despite not that strong confidence/opinions on how exactly we should draw the line since it doesn't seem that important for the core message of this post.
People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human instructions.
Currently, the people who we know of that work on alignment are roughly:
A lot of the remainder of the AI safety community does indirect work like capability evaluations, risk assessments, control, policy, AI science, understanding misalignment (which maybe should partially count as alignment work), demos and so on.
Some production alignment work (i.e., making current models behave well) might help with more ambitious alignment, too (e.g., some COT-monitoring). Many people also work on aligning current/next-generation models so that these models help with aligning future models, and hope this scales to superintelligence.
We are not necessarily saying this is bad and that people are making a big mistake (e.g., neither of us work on alignment) but it's a notable fact that seems good to make known to those who don't know about it.