Really not sure what heuristic leads you to count people working on ARC-Theory working on an ambitious, speculative version of interp as working on alignment but not any of the people working to build from current interp paradigms. Similarly, anyone working on e.g. making models more honest in prod models is in fact learning a bunch of lessons about what scalable oversight looks like (albeit not publishing, which i agree is sad). Or doing any science of misalignment, or doing any empirical character work, or experimenting with making models adhere to a spec, or carefully understanding their generalisation patterns, or just trying to understand what the actual objects that we are creating right now are??
It seems like having any current interaction with frontier models is seen as disqualifying for actually doing alignment work?
I'd guess the heuristics are basically:
FWIW I'm not sure how much I buy these but I'd guess I buy them more than you? This is unfortunately another great example of something where people inside labs probably have some pretty relevant private information but also extra incentive/selection problems.
anyone working on e.g. making models more honest in prod models
I don't really think people working on 'what instruciton can I add to the system prompt' or equivalents are meaningfully working on the kind of endgame alignment the post is talking about.
Edit: Nothing wrong with that kind of work for current alignment, just doesn't apply to endgame alignment all that much in my opinion.
This is true, and to be fair it's a bit harder to even see how much outside organizations can even help at this point. The main companies have grown so much and share so much less, that outsiders have less influence, as well as in many cases less access to big models, and especially to being able to train them.
Some do have some access but it still seems limited compared to what an in-Anthropic team and in-OpenAI team could do. Of course, you also can end up with a result so good as an outsider that influences them but, again, it just seems like limited impact from the get go.
This same observation motivated us to build Geodesic Research, an org focused on developing the most aligned initialisations for RL.
I think theoretical alignment (as opposed to applied alignment, working with current models) is in a slump right now because it's hard to see how it slots in to the current LLM paradigm.
There was an assumption 5 years ago that AGI requires fundamental insights into intelligence. There was a hope that understanding intelligence better on a theoretical level could also point us towards how to steer such an AGI on the direction we wanted.
That assumption seems to be false. It turns out you can build human level intelligence without understanding the first thing about intelligence just by throwing enough compute and data at the problem. Sure, architecture is important, but architectural improvements are driven far more by trial and error, and informed by a narrow understanding of how LLMs works, than by a grand unified theory of intelligence.
In such a world it's difficult to see how abstract alignment work slots in. Instead we develop alignment the same way as we develop intelligence: trial and error driven by a narrow understanding of LLMs and current architectures rather than a grand unified theory of alignment.
In that world, good alignment research asks narrow questions like: "can we tell whether an LLM is going off the rails just by monitoring it's COT", rather than broad questions like "how do we know whether an LLMs utility function exactly matches humanities coherent extrapolated volition".
People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human instructions.
Currently, the people who we know of that work on alignment are roughly:
A lot of the remainder of the AI safety community does indirect work like capability evaluations, risk assessments, control, policy, AI science, understanding misalignment (which maybe should partially count as alignment work), demos and so on.
Some production alignment work (i.e., making current models behave well) might help with more ambitious alignment, too (e.g., some COT-monitoring). Many people also work on aligning current/next-generation models so that these models help with aligning future models, and hope this scales to superintelligence.
We are not necessarily saying this is bad and that people are making a big mistake (e.g., neither of us work on alignment) but it's a notable fact that seems good to make known to those who don't know about it.