A lot depends on how broadly you construe the field. There's plenty of work in academia and at large labs on how to resist jailbreaks, improve RL on human feedback, etc. This is at least adjacent to AI safety work in your first category.
If you put a gun to my head and told me to make some guesses, there's maybe like 600 people doing that sort of work, about 80 people more aware of alignment problems that get harder as AI gets smarter and so doing more centrally first-category work, about 40 people doing work that looks more like your second category (maybe with another 40 doing off-brand work in academia), and about 400 people doing AI safety work that doesn't neatly fit into either group.
When this question was posted, I asked myself, what would be a "cynical" answer? What that means is, you ask yourself: given what I see and know, what would be a realistically awful state of affairs? So, not catastrophizing, but also having low expectations.
What my intuition came up with was, less than 10% working on user-centered alignment, and less than 1% on user-independent alignment. But I didn't have the data to check those estimates against (and I also knew there would be issues of definition).
So let me try to understand your guesses. In...
Trying to get a rough estimate for some related research I’m doing.
Specifically, I’m wondering if anyone could give a rough percentage of current AI safety work primarily targeted toward these two buckets. There may be an overlap with some safety work, and some safety work that doesn’t fit either of these, so that the percents don’t add up to 100%, and I’m sure it may often be unclear, just wondering what a rough estimate would be of how much AI safety work is geared toward specifically achieving these two separate goals?
I think one way to conceive of this question is how much AI safety work is primarily focused on intent alignment or control + avoiding other catastrophic outcomes, vs. how much work is focused primarily on ultimately aligning AI to the best version of humanity, something like our coherent extrapolated volition or what we might come to after “the long reflection.”
The intuition here being that getting AI to reliably do what the user wants, even if avoiding catastrophic outcomes, might nonetheless lead down a mediocre trajectory; whereas work that is specifically targeted to the latter goal would be most useful toward getting to the best possible world we could achieve.
Of course, the trade-off on the other end is that AI that tries to do the best thing possible may not be commercially viable and in fact would probably be pretty annoying to users since it seems like it may possibly ignore or redirect 90%+ of requests toward something more likely to result in some ultimately good outcome. And furthermore, this latter goal is probably far less pragmatic for achieving extinction prevention in a timely fashion.
Greatly appreciate rough estimates or any thoughts!