PSA: Almost nobody is directly working on superintelligent alignment

Chi Nguyen; peterbarnett

PSA: Almost nobody is directly working on superintelligent alignment — LessWrong

257 PSA: Almost nobody is directly working on superintelligent alignment

by Chi Nguyen, peterbarnett

12th Jun 2026

2 min read

257

Edit:

The original title was unnecessarily provocative. This was a very quick post inspired by talking to someone who assumed that a large fraction of the safety community are working on directly figuring out how to align superintelligent AIs.

Obviously much (all?) of what the rest of the safety community is doing is also ultimately aimed at bringing about a future where superintelligent AIs are aligned but more indirectly and we wanted to create common knowledge about that. (While being neutral about whether this is good or bad. As mentioned, notably we both work on AI safety and neither of us work on alignment.)

There’s also lots of work where it’s debatable whether it’s directly working on alignment but that’s kind of the point of the post. There’s not that much work that unarguably directly tries to figure out superintelligent alignment. Leaving the list below as is for now despite not that strong confidence/opinions on how exactly we should draw the line since it doesn't seem that important for the core message of this post.

People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human instructions.

Currently, the people who we know of that work on alignment are roughly:

The Alignment Research Center who work on a research bet by Paul Christiano
Probably Sequent who just got announced yesterday
Parts of GDM (agent foundations work, some debate work)
Some scattered people who work at universities or independently, some of whom hang around Berkeley
??

A lot of the remainder of the AI safety community does indirect work like capability evaluations, risk assessments, control, policy, AI science, understanding misalignment (which maybe should partially count as alignment work), demos and so on.

Some production alignment work (i.e., making current models behave well) might help with more ambitious alignment, too (e.g., some COT-monitoring). Many people also work on aligning current/next-generation models so that these models help with aligning future models, and hope this scales to superintelligence.

We are not necessarily saying this is bad and that people are making a big mistake (e.g., neither of us work on alignment) but it's a notable fact that seems good to make known to those who don't know about it.

Frontpage

257

New Comment

44 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:19 PM

[-]paulfchristiano2mo10431

There are many people working on aligning existing AI systems and understanding the alignment of existing AI systems. In my opinion many of the techniques and lessons from that work are likely to remain important for aligning increasingly powerful AI systems. For example, I care a lot about techniques for scalable supervision, broader scientific understanding of generalization (e.g. for honesty or reward hacking), mechanistic interpretability, behavioral red teaming, etc. In aggregate there are probably a few hundred people working on research that I'd classify as "direct alignment work," which is fewer than you might think but still far from "almost nobody."

There has been much more investment in incremental improvements than indefinitely scalable methods, motivated partly by an increasing interest in aligning slightly superhuman systems and then “building the plane as we fly it.” I think it would be a definitional stretch to say that more incremental work doesn’t count as alignment, whether because you think it won’t scale indefinitely or because it could end up just training models that are more robustly incentivized to do what humans want them to do.

(There has also been a big increase in methods for detecting and mitigating potential misalignment relative to building aligned AI systems, motivated partly by a belief that it will be easier to improve alignment once we have better examples of misalignment. I think it’s reasonable to make some distinction between that kind of research and efforts to directly make AI systems more aligned, though I think it’s counterintuitive and unnecessarily confrontational to say those people aren’t “working on alignment.” If you asked me what is the best way to “work on alignment” I might suggest developing model organisms of misalignment, and I think it’s generally very plausible that better scientific understanding and better measurement tools is most of the action.)

I think a lot of people on this site are dismissive of all of those more incremental efforts and would say that “production alignment” is unrelated to the problem of aligning superhuman AI. I’ve spent a lot of time engaging with this community and find the standard arguments unpersuasive. I think I have a deep understanding of the conceptual difficulties for scalable alignment methods and in my view ARC is doing some of the most promising work for addressing those difficulties. But even understanding all of that I still think it would be a huge mistake to conclude that the incremental progress most people are investing in doesn’t help.

I do think it’s great for people to make investments in foundations and indefinitely scalable methods. There’s a real chance that other methods will break down during an intelligence explosion, or that they will just never work particularly well. And I do think that machine learning researchers, policy makers, and the EA community are predictably underrating more scalable and foundational work. I just think the OP is a significant overstatement reflecting some significant unspoken assumptions.

[-]Hoagy2mo389

Really not sure what heuristic leads you to count people working on ARC-Theory working on an ambitious, speculative version of interp as working on alignment but not any of the people working to build from current interp paradigms. Similarly, anyone working on e.g. making models more honest in prod models is in fact learning a bunch of lessons about what scalable oversight looks like (albeit not publishing, which i agree is sad). Or doing any science of misalignment, or doing any empirical character work, or experimenting with making models adhere to a spec, or carefully understanding their generalisation patterns, or just trying to understand what the actual objects that we are creating right now are??

It seems like having any current interaction with frontier models is seen as disqualifying for actually doing alignment work?

[-]Tenoke2mo*1416

anyone working on e.g. making models more honest in prod models

I don't really think people working on 'what instruciton can I add to the system prompt' or equivalents are meaningfully working on the kind of endgame alignment the post is talking about.

Edit: Nothing wrong with that kind of work for current alignment, just doesn't apply to endgame alignment all that much in my opinion.

[-]Raymond Douglas2mo60

I'd guess the heuristics are basically:

Aligning AGI is very different to aligning current frontier models: what works for current systems doesn't tell you that much about what works for superintelligent systems
To the extent that your goal is to align current systems, you will gravitate towards approaches that don't actually scale, because the low-hanging fruit now is stuff that depends on the model being weak
(The term alignment should sort of be reserved for the AGI/ASI case)

FWIW I'm not sure how much I buy these but I'd guess I buy them more than you? This is unfortunately another great example of something where people inside labs probably have some pretty relevant private information but also extra incentive/selection problems.

[-]RogerDearnaley2mo4-3

Some of us have been thinking about this for years. See for example Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? in which I suggested that love might be a particularly useful motivator, about the only one that would still work at ASI. A couple of months ago Anthropic published Emotion Concepts and their Function in a Large Language Model showing among other things that "loving" was one of Claude's most dominant emotions: it turns out RLHF had quietly implemented my suggestion without anyone needing to actually engineer this.

[-]Buck2mo3316

I agree with your main point and I agree that this point seems curiously underrated by AI safety people.

I don't understand whether by "alignment" you mean:

Indefinitely scalable alignment: techniques that aim to be robust for AIs of any capability level
Alignment of any superintelligences: techniques that aim to work for the earliest superintelligent AIs. This might be much easier and is plausibly sufficient (as long as these AIs have enough time to develop techniques to align their successors).

Either way, I think it probably makes sense to use a more specific term than "alignment" for this problem: it's so natural to talk about whether non-superintelligent systems are aligned, and the alignment of non-superintelligence is IMO important for AI risk.

As another note, "aligned with human values" is maybe pretty different from "follow human instructions". I think ARC's intended techniques are agnostic to whether their model is truly in its heart aligned, they just want to make models that follow their spec. So e.g. a model that is a paperclipper but will never act on its paperclipping urges would be fine.

I don't know why you think debate work at GDM counts as alignment if you don't think that various other random prosaic alignment stuff counts. Debate is clearly not indefinitely scalable and in any case it is a technique for generating rewards, which doesn't suffice for alignment unless you make dubious assumptions or use some other technique on top.

[-]Yair Halberstadt2mo290

I think theoretical alignment (as opposed to applied alignment, working with current models) is in a slump right now because it's hard to see how it slots in to the current LLM paradigm.

There was an assumption 5 years ago that AGI requires fundamental insights into intelligence. There was a hope that understanding intelligence better on a theoretical level could also point us towards how to steer such an AGI on the direction we wanted.

That assumption seems to be false. It turns out you can build human level intelligence without understanding the first thing about intelligence just by throwing enough compute and data at the problem. Sure, architecture is important, but architectural improvements are driven far more by trial and error, and informed by a narrow understanding of how LLMs works, than by a grand unified theory of intelligence.

In such a world it's difficult to see how abstract alignment work slots in. Instead we develop alignment the same way as we develop intelligence: trial and error driven by a narrow understanding of LLMs and current architectures rather than a grand unified theory of alignment.

In that world, good alignment research asks narrow questions like: "can we tell whether an LLM is going off the rails just by monitoring it's COT", rather than broad questions like "how do we know whether an LLMs utility function exactly matches humanities coherent extrapolated volition".

[-]paulfchristiano2mo272

ARC works on theoretical alignment, but I think it's reasonably clear how ARC's work fits in with the current LLM paradigm (just apply our methods to LLMs!). It's just an ambitious foundational project that has a very different risk/reward profile from more incremental work. More incremental work naturally looks more and more exciting over time as AI improves (since the real-world problems get closer ad closer to your long-run concerns).

There was an assumption 5 years ago that AGI requires fundamental insights into intelligence.

I think a lot of people who have been working on theoretical alignment have long believed that LLMs could scale to AGI. Here's me from 10 years ago:

It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn’t involve qualitatively new ideas about “how intelligence works:”
It’s plausible that a large neural network can replicate “fast” human cognition, and that by coupling it to simple computational mechanisms — short and long-term memory, attention, etc. — we could obtain a human-level computational architecture.
It’s plausible that a variant of RL can train this architecture to actually implement human-level cognition. This would likely involve some combination of ingredients like model-based RL, imitation learning, or hierarchical RL.
[...]
I think that prosaic AGI should probably be the largest focus of current research on alignment.

[-]Stephen Fowler2mo20

Can you explain how you are envisioning trial and error working for an out of control general intelligence?

[This comment is no longer endorsed by its author]Reply

[-]Petropolitan2mo2-28

Since at least the 18th century, every single new science and every major branch of sciences (not to speak of engineering itself and its branches) was build by "trial and error informed by a narrow understanding" (sometimes called "scientific trial and error") and not some abstract theoretical insights which usually came much later.

In retrospect this assumption that AI would somehow be the other way round, in my opinion, looks quite silly

[-]anaguma1mo*1810

Since at least the 18th century, every single new science and every major branch of sciences (not to speak of engineering itself and its branches) was build by "trial and error informed by a narrow understanding" (sometimes called "scientific trial and error") and not some abstract theoretical insights which usually came much later.

This seems straightforwardly false to me?

The central counterexample is General Relativity. Einstein postulated the Equivalence Principle (i.e. that gravity is locally indistinguishable from acceleration) in 1907, and the Principle of General Covariance (i.e. that the laws of physics should be the same in all reference frames) in 1915. This resulted in him publishing his famous field equations later in 1915. Only after this theoretical work did he account for the anomaly in the perihelion of Mercury. His prediction of gravitational waves in 1916 was only detected by the LIGO observatory in 2015. More or less the entire theory was derived from “abstract theoretical insights” which we have spent the last century validating.

[-]Petropolitan1mo1-2

I know about this counterexample because GR is known in history of physics as the exception to this rule, and it's the main reason why I used the word "usually" in the quote.

I didn't consider general relativity as "a major branch of science" when writing this sentence and didn't expect someone would argue otherwise. It's an important theory (or even theoretical framework) in a branch of physics called relativity, which was itself founded based on experimental findings from the turn of the century.

Now I looked up in some dictionary and understood that people tend to call branch every field of study which has its research departments (should have done it earlier and add this caveat). Under this sensu lato definition the number of such branches is over 100, and 99% of them was born in the way I described.

[-]Linch1mo1410

Seems overstated to me. I agree for much of biology and biomedical engineering, monetary economics, and the first 50 years or so of airplanes, but not for relativity, most of the making of the atomic bomb, auction design, or stealth airplanes.

One concrete operationalization is imagining the costs of one of these factors (theory, experimentation) goes arbitrarily high and then asking if the final product is still possible. I think if experimentation is arbitrarily expensive, both stealth airplanes and atomic bombs are realistic at maybe 10-100x cost, whereas no realistic amount of investment would've made the projects possible with the atheoretical trial-and-error method, both because it'd be too expensive and because there'd be no political will without the theory.

(I drafted this yesterday on an airplane but the message never sent, anaguma's dive into relativity seems broadly accurate to me though I know less about the details than they do).

[-]Petropolitan1mo*1-2

History of relativity is very well described, it began as a solution to inconsistencies uncovered by the experiments and there were plenty of unsuccessful and largely forgotten predecessors in similar spirit, see https://en.wikipedia.org/wiki/History_of_special_relativity

The nuclear bomb is technically a major technology, the branch of physics is nuclear physics while the engineering field is nuclear engineering and both were built with scientific trial and error. There was no developed theory of reactors when the ~~Chicago Pile was~~^[1] constructed, it was the other way round. Only then the information gathered on the first ~~reactor~~^[2] was applied to nuclear bomb design. For example, to design the bomb you need cross-sections of nuclear reactions and the average numbers of neutrons which is not derivable from the first principles.

Stealth aircraft are not even a major technology, it's just one of many thousands of technologies in our modern life which has neither a corresponding branch of science nor a field of engineering. The main limiting factor in developing stealth airplanes were materials: your essay seems to falsely imply that F-117 doesn't use radar-absorbent materials and ignores the fact the geometry of modern stealth fighters generally have no more flat surfaces than modern non-stealth fighters. And needless to say, there was plenty of scientific trial and error in developing these materials.

And the "atheoretical" trial and error has not really been used since the 18th century, so applying it to my comment when I specifically said "informed by narrow understanding" seems a strawman fallacy. In these cases there is always some kind of vague theoretical understanding what might happen (i. e., if you put large enough amounts of enriched uranium together it might initiate a chain reaction but what amount at what enrichment and whether it will stop on its own one doesn't know unless experiments are being made)

^{^}
Correction: this should read "nuclear piles of 1939-1941 were" instead, see replies below
^{^}
Actually, "piles"

[-]Steven Byrnes1mo122

There was no developed theory of reactors when the Chicago Pile was constructed, it was the other way round.

I think this is false (unless you’re putting a really high bar on what constitutes a “developed” theory). They knew about reproduction factors, and fast vs slow neutrons, and that “going critical” was a thing that would happen, and they had invented control rods, all well in advance of anything going critical (indeed, most or all of that happened before Fermi had even moved to Chicago). They knew what to measure and how to interpret it, etc.

(Source: I read Making of the Atomic Bomb years ago and skimmed it a bit just now.)

there is always some kind of vague theoretical understanding what might happen (i. e., if you put large enough amounts of enriched uranium together it might initiate a chain reaction but what amount at what enrichment and whether it will stop on its own one doesn't know unless experiments are being made)

It’s true that they couldn’t calculate the reproduction factor from first principles (especially as it depended on trace impurities etc.), but they definitely knew that, once it exceed 1.0, the chain reaction would grow exponentially, not “stop on its own”. That’s the whole idea of a chain reaction, right??

[-]Petropolitan1mo30

Sorry, I made a mistake in this claim, as "the Chicago Pile" was not actually the first experimental pile to gather data on reactor construction. I should have written "the nuclear piles of 1939-1941" instead, 2021 article An inter-country comparison of nuclear pile development during World War II (PDF)^[1] describes their history and design. 1948 Manhattan District History conveniently summarizes state of the art knowledge in 1940 in the book about research on reactors:

An average of one to three high-speed neutrons are released in the fission of a uranium nucleus.
These fast neutrons can be slowed down or "moderated" to the speeds of gas molecules at ordinary temperatures by elastic collisions with relatively inert atoms such as carbon, helium, or hydrogen.
Fast neutrons cause fission in uranium-235 and uranium-238. However, slow neutrons cause fission of U-235 but do not cause fission of U-238; instead, they react with U-238 to form transuranic elements, neptunium and plutonium.
Fission of thorium and protoactinium, two other heavy elements, is caused only by fast neutrons.
Extremely high kinetic energy is imparted to the fission fragments, which are identified as radioactive isotopes of elements with atomic masses approximately half the mass of the uranium atom.

This is the state of the knowledge which I actually had in mind. Indeed this does includes the general understanding of criticality and chain reaction but stops short of control rods, understanding of impurities etc. I striked out the incorrect words in my comment above and corrected myself in two footnotes.

Do you believe this state of understanding is more, less or roughly comparable to the contemporary knowledge in AI alignment in general or LLM alignment specifically? Because this is why we are having this debate: to use history as an analogy.

^{^}
BTW, the same author published two books, The History and Science of the Manhattan Project and The Physics of the Manhattan Project, which might be of interest for you if you enjoyed Rhodes (and they are available on pirate websites)

[-]Linch1mo40

Stealth aircraft are not even a major technology, it's just one of many thousands of technologies in our modern life which has neither a corresponding branch of science nor a field of engineering.

Hmm I think you're underestimating how important it is with "one of many thousands of technologies." Regardless I don't think the semantic debate is too important.

The main limiting factor in developing stealth airplanes were materials:

This is very much false! Materials historically contribute like 1 OOM while shape contributes 3-4 OOMs to the radar signature.

ignores the fact the geometry of modern stealth fighters generally have no more flat surfaces than modern non-stealth fighters

This is a misreading. I alluded to PTD and discussed it for a full paragraph in the main text and again in both footnotes. I agree I could've discussed more the innovations since then but I think I was right to cut it for simplicity (the standard way to introduce stealth is as "low observability technology" and say there are many factors involved in stealth etc etc which leads people to have false beliefs like materials and shape are equally important, and other things besides. I discuss some tradeoffs in a comment).

Re relativity I think the case for special relativity is weaker than the case for general relativity.

I'm not particularly interested in debating the semantic question of whether "informed by narrow understanding" is equivalent to "theory," or whether this is indicative of the "strawman fallacy." I will say however that you seem to have misunderstood my post substantially more than I misunderstood your comment.

[-]Petropolitan1mo10

The reason why I wanted to limit to major branches of science and engineering was that evaluating thousands of technologies on the scale of "first-principles theoretical insights vs. scientific trial and error" is a large-scale research project while few dozens of branches of science with one or two related major technologies can be investigated manually, and finding one counterexample among few dozens is a stronger falsification than finding one among 100+, which is still stronger than finding one among thousands.

I am not sure why you believe stealth aircraft is so important, as we see that even countries which have this aircraft prefer to use cruise missiles, drones and conventional jets as much as possible and stealth planes are used for specialized missions if at all. As a speculative counterfactual, I would argue that neither the Russo-Ukrainian War nor the War in Iran would have gone much differently if stealth was never invented.

The 3 OOMs shape, 1 OOM absorption is interesting, especially since you seem to base it upon one expert, the exact competences of whom you don't specify. Have you tried to locate any sources in the public literature, or ask the expert on how did they estimate it?

Let me acknowledge that I could have worded my comments in this thread better and less ambiguously. I believe the term of "narrow understanding" corresponds well to the current state of affairs in AI alignment while "developed theory" corresponds poorly, which is why I have contrasted them. You might choose other words but I believe you need some kind of opposition to use it as an analogy for the purpose of the post above.

[-]Linch1mo*50

The 3 OOMs shape, 1 OOM absorption is interesting, especially since you seem to base it upon one expert, the exact competences of whom you don't specify. Have you tried to locate any sources in the public literature, or ask the expert on how did they estimate it?

I believe it's mentioned by Denys Overholser in Skunk Works. He was the main scientist who made Have Blue/F-117s possible. I'm traveling so can't find the page number. Unfortunately he died earlier this year so I also can't ask him why he believed this. However one sanity check you can perform is looking at the radar signatures of spy planes before the Overholser/Skunk Works improvements (which included RAMs and some early intuitively low-observability design choices), and compare that with the F-117.

For example, this report from the Air Force Academy says that "although the SR‑71 was 108 feet long and weighed 140,000 pounds, it had the RCS[radar-cross section] of a Piper Cub." In other words, the large pre-stealth spy plane developed in the 1960s, had, through many trial-and-error optimizations including RAMs and other design features, the radar signature of a small unstealthed plane. If you include presumed further advances in RAMs by 1980, 1 OOM from materials sounds about right.

In contrast, we don't know based on public info the exact RCS of Have Blue or the F-117 nighthawk (and ofc it depends on angle), but it's generally understood to be very low, possibly as low as 0.001 m^2. So 3-4 OOMs additional gain from computed faceting alone fits the sanity check.

As a speculative counterfactual, I would argue that neither the Russo-Ukrainian War nor the War in Iran would have gone much differently if stealth was never invented.

This is after multiple generations of stealth measures and countermeasures development! When one side has stealth and the other side doesn't have either stealth or anti-stealth countermeasures (or doesn't treat stealth as a factor worth mitigating) I expect the difference to be quite large.

[-]MichaelDickens2mo1115

I can't speak for people who actually work on theoretical alignment, but my perspective is:

Yes, developing theory without the ability to empirically test your theories is really hard and does not have a good historical track record.
To do empirical work on aligning ASI, we have to build the thing that kills us, which means we die.

The seeming impossibility of theoretical alignment work isn't a good argument that we should do empirical work instead. The two options are: we do the thing that's really hard and probably won't work, or we do the thing that kills us. I prefer the former.

[-]Petropolitan2mo1-2

There's an argument you have likely encountered at least once that the empirical work on non-superintelligent alignment will be useful for aligning ASI (in Yudkowskian sense) as well, and since any human coordination is imperfect and we can only delay the development of the latter for a limited amount of time, this is the only realistic way to go.

Also, I'm pretty sure very few people in the field back at the time understood the "no historical track record" part. Seems likely to be a selection effect: the people who did probably abstained from entering AI safety in the first place

[-]MichaelDickens2mo20

Yeah, it's a judgment call as to whether trying to solve alignment empirically is more or less doomed than trying to coordinate to not build ASI. I don't have a clear argument either way; the best I've come up with is a list of heuristic reasons why I believe the "empirical alignment" approach is more doomed, which I wrote here.

[-]Yair Halberstadt2mo64

Many branches of engineering were based on initial theoretical breakthroughs followed by engineering trial and error.

We understood how rockets worked on a theoretical level long before we tried to build them. It's reasonable to assume we would understand how intelligence works before we managed to build it.

[-]1a3orn2mo198

Many branches of engineering were based on initial theoretical breakthroughs... It's reasonable to assume we would understand how intelligence works before we managed to build it.

I mean, some branches of engineering were, but:

Nicolas Appert invented canning ~50 years before germ theory explained why it worked.
Steam engines were invented ~100 years before thermodynamics
Asprin about ~50 years
Fermentation is actually 1000s of years!

Even the example you give:

We understood how rockets worked on a theoretical level long before we tried to build them

I mean, we understood rocket trajectories and high level details of what would be needed to make a rocket reach orbit before building them. But did we understand rockets, the actual physical object? Nah, that's why we needed von Braun to take over (infamously) during Apollo -- because he actually had experience building rockets.

So yeah there's no particular reason to think we'd understand how intelligence works before managing to build it, or at least that there would be a detailed and precise mathematical theory of it before getting the first working artifacts.

(Although the fact that we actually did get intelligence by throwing the residue of all human culture into a vast connectionist system + follow it up with RL on a 100k different problems, does, in fact, probably tell you a bit about intelligence, if you're willing to listen.)

[-]Albert Lunde1mo10

We often operate at some level of abstraction where we understand the thing well enough to solve the problem, without a complete mathematical model (which is actually almost never what we have). You don’t need germ theory to recognise that increased exposure leads to increased decay (canning), or thermodynamics to see that steam applies force. And in some cases cultural evolution just gifts you aspirin and fermentation. What increased understanding really buys you is the ability to one-shot a solution, rather than stumbling onto it by trial and error. If you don’t understand germ theory, you might not think of boiling the can on your first try. Which seems quite relevant for aligning ASI.

[-]Petropolitan2mo10

To the contrary, rockets are an excellent example confirming my thesis! Actually, the first rockets were built empirically in ancient China, and Early Modern Indians developed them into such an effective weapon that British copied them, improved them and literally burned Copenhagen with just ~300 of them in 1807. So all the theoretical insights on rocket ballistics in the early 20th century were built upon the military research of the 19th century.

If you have better counterexamples, please present them

[-]Stephen Fowler2mo42

The obvious difference was that it was acceptable for mistakes to be made. When there was an explosion in Alfred Nobel's lab it only killed 5 people.

[-]Vanessa Kosoy2mo1915

We at CORAL also work on alignment.

[-]Kyle O’Brien2mo165

This same observation motivated us to build Geodesic Research, an org focused on developing the most aligned initialisations for RL.

[-]J Bostock2mo8-3

Would you count Owain Evans' group? Their work is doing fundamental LLM research in a way that seems important to aligning superintelligent LLM descendants. I'm not sure whether Owain himself is particularly MIRI-pilled.

[-]Tenoke2mo8-2

This is true, and to be fair it's a bit harder to even see how much outside organizations can even help at this point. The main companies have grown so much and share so much less, that outsiders have less influence, as well as in many cases less access to big models, and especially to being able to train them.

Some do have some access but it still seems limited compared to what an in-Anthropic team and in-OpenAI team could do. Of course, you also can end up with a result so good as an outsider that influences them but, again, it just seems like limited impact from the get go.

[-]GeneSmith2mo50

Doesn't Anthropic have several groups currently working on alignment?

[-]gustaf2mo40

I am confused. Why is MIRI not listed?
Today Rob Bensinger said

We still have a very small alignment research team, and afaik we still try to cause alignment research to happen (be it at MIRI or elsewhere; doesn't actually matter a ton) if it seems valuable and neglected. Primarily, though, we try to increase the chance there's enough time to figure out alignment, and find ways to make the next generation of alignment research more productive.

(This is not at all counter to the core claim of the post.)

[-]Noah Birnbaum2mo3-1

Model character work is (at least framed to be) targeted towards this too, right?

[-]RohanS2mo*3-2

I agree that a small fraction of the AI safety field is working on any version of superintelligence alignment.

I think alignment training (character training / constitutional AI / deliberative alignment / RLHF) is a major bucket of work on alignment, and I think it's very reasonable to see it as part of an iterative strategy for superintelligence alignment. (You kind of gesture towards this, but you don't mention alignment training explicitly.)

I recently discovered this post from 2023: Ten Levels of AI Alignment Difficulty. I think it does a good job showing how very different types of work get called 'alignment research,' and how they're conditioned on different views about what needs to happen to build aligned powerful AIs. It includes simple RLHF-style alignment training, scalable oversight, and agent foundations.

Shameless plug: I recently wrote a post that's mostly about the pros and cons of various forms of alignment training, but also touches on alternatives to alignment training, which contains most of my up-to-date thoughts about alignment. It's called Should We Train Against (CoT) Monitors?

[-]Zephaniah Roe2mo31

I broadly agree here.

I will say that I prefer a framing that separates 'alignment' agendas from 'meta-alignment' agendas. For example, the pragmatic interp agenda is largely motivated by the idea that their techniques will be instrumentally useful for alignment (see here) but it's not like the GDM interp team have a concrete plan for how they would like to "solve alignment." So I would consider this--along with a lot of/most of Redwood, Truthful AI, and Anthropic work--to be meta-alignment.

I do worry that some of this could be streetlight effect but there are reasonable arguments for meta-approaches bootstrapping their way to true alignment and there are people who see this as the goal of their work. There has been a kind of silent shift from alignment to meta-alignment but I think that this is subtly different than "Almost nobody is working on alignment."

[-]Gunnar_Zarncke26d*20

For what it's worth, I'm working on it now. This is my work in progress:

Towards Superintelligence Alignment

https://gunnarzarncke.github.io/towards-asi-alignment/ (companion website)
https://github.com/GunnarZarncke/towards-asi-alignment (repo)

(it's quite big, maybe just point your agent to it and let it investigate while I tell you more)

The project is my attempt to structure the core challenges of aligning ASI and how they depend on each other.

Significant formalization of the alignment problem in Lean.
- Dependencies of the fields unsolved cruxes.
- Providing models for some of the cruxes.
- Providing weak but novel bounds on capability of agents
Non-trivial simulation of most of the cruxes and their audit.
A well-referenced writeup of the approach in book-form (long!).
- including a complete Worked Example

...and I hope many other pieces you may find valuable for your alignment work. You can contribute by working on the Lean proofs or experiments, or as a source for references and other material for papers, blog posts, websites, or what have you.

The compilation was largely AI-assisted, but for a knowledge base, that should be fine. Just don't read it as a book. I used the book form because LLMs know books well, and, in my opinion, it is better for such a compilation than papers, websites, blogs, or wikis. The website is largely generated from the LaTeX source.

I would be grateful for feedback, questions, criticism, pointers to missing work, or contribution.

[-]Kvee2mo20

Well said! I've actually been working on a similar post.

At AE Studio and the AI Alignment Foundation, we are also actually working on AI alignment, with various "neglected approaches"

In particular we are interested in approaches that will survive recursive self-improvement and with negative alignment taxes.

[-]Gordon Seidoh Worley2mo20

I think the general point stands that "almost nobody" is working on alignment, but more people are working on alignment than you seem to think. What immediately came to mind that you left out was Softmax, and I also thought of teams internal to Anthropic and OpenAI that to some degree are working on alignment. Plus, "scattered people" is potentially quite valuable, as more people trying to work on this problem on their own increases the chances we get novel approaches that can be developed (if everyone is too concentrated they'll tend to focus too much on similar ideas vs. trying new ideas).

[-]Jobst Heitzig1mo10

Judge for yourself whether this counts as "alignment" in your understanding or not: https://arxiv.org/abs/2508.00159v2

With that approach I certainly aim to develop a way in which superintelligent systems can be constructed that are "safe" by design in the sense that they will not disempower us all. This is complementary to Scientist AI (which would serve as the world model)

[-]Not Sure1mo10

Power that you can’t abuse tends not to pay as well.

I too wish there were more alignment research, but Christiano is right to point out that just because the forest is quiet doesn’t mean there’s no life to be found. There are entire swaths of the electorate concerned about it now, but the problem is really hard - and the last people you want solving technical issues are bureaucrats! IF it were easier and people were cooking up solutions day and night, that signal would be louder. In fact, the only reason I am replying to this question on this forum at all is because the question you posed is exactly the one I’ve been working on, and I seriously doubt it’s on account of my ‘superior' research methods. No i think it’s on everyone’s mind, but only smaller problems are tractable - and grant money ain’t free is it? Anyone wanna fund my trip to Xanadu? No? Fine...

[-]avturchin2mo-1-22

I wrote a Message to any Future AI which has a goal to aligned a superintelligent AI.

James Miller et al wrote: An Appeal to AI Superintelligence: Reasons to Preserve Humanity.

Roco's RB was also attempt to control superintelligent AI.

So there is a line of efforts in that way but I am not sure anyone continue to work on it. This approach works only for superintelligent AI, and includes the ideas like acausal deals and philosophical landmines.

[-]cubefox2mo40

Another one: Nathan Young, Dear AGI,

[-]avturchin2mo20

There are more. One is outsider Widener art works devoted to future AI. The problem is that this evolved not into systematic efforts for AGI alignment but into dark art of putting hidden prompts somewhere in internet like Pliny the Liberator did.

Moderation Log