The Possessed Machines (summary)

L Rudolf L

The Possessed Machines is one of the most important AI microsites. It was published anonymously by an ex- lab employee, and does not seem to have spread very far, likely at least partly due to this anonymity (e.g. there is no LessWrong discussion at the time I'm posting this). This post is my attempt to fix that.

(The piece was likely substantially human-directed but laundered through an AI due to anonymity or laziness. Thanks to Malcolm MacLeod for reminding me to mention this in the comments. See here for Pangram-on-X analysis claiming 67.5% AI. The prose is not its strength.)

I do not agree with everything in the piece, but I think cultural critiques of the "AGI uniparty" are vastly undersupplied and incredibly important in modeling & fixing the current trajectory.

The piece is a long but worthwhile analysis of some of the cultural and psychological failures of the AGI industry. The frame is Dostoevsky's Demons (alternatively translated The Possessed), a novel about ruin in a small provincial town. The author argues it's best read as a detailed description of earnest people causing a catastrophe by following tracks laid down by the surrounding culture that have gotten corrupted:

What I know is that Dostoevsky, looking at his own time, saw something true about how intelligent societies destroy themselves. He saw that the destruction comes from the best as well as the worst, from the idealists as well as the cynics, from the people who believe they are saving humanity as well as those who want to burn it down.

The piece is rich in good shorthands for important concepts, many taken from Dostoevsky, which I try to summarize below.

First: how to generalize from fictional evidence, correctly

The author argues for literature as a source of limited but valuable insight into questions of culture and moral intuition:

Literature cannot tell us what to do. It cannot provide policy prescriptions or technical solutions. It cannot predict the future or settle empirical questions. The person who reads Dostoevsky looking for an alignment technique will be disappointed.
What literature can do is reshape perception. It can make visible patterns that were invisible, make felt truths that were merely known, make urgent realities that were abstract. It can serve as a kind of training data for moral intuition—presenting scenarios that expand the range of situations one has "experienced" and therefore the range of situations one can respond to wisely.
[...]
Dostoevsky's particular value is that he was obsessed with exactly the questions that matter most for AI development. What happens when intelligence develops faster than wisdom? What happens when the capacity for reasoning outstrips the capacity for feeling? What happens when small groups of smart people convince themselves they have discovered truths so important that normal constraints no longer apply?

Stavroginism: the human orthogonality thesis

Stavrogin is a character for who moral considerations have become a parlor game. He can analyze everything and follow the threads of moral logic, but is not moved or compelled by them at a level beyond curiosity.

The Stavrogin type can contemplate human extinction as calmly as they contemplate next quarter's revenue projections. This is not because they have thought more deeply about the question; it is because they lack the normal human response to existential horror. Their equanimity is not wisdom; it is damage.
[...]
They have looked at the abyss so long that they no longer see it. Their equanimity is not strength; it is the absence of appropriate emotional response.

Kirillovan reasoning: reasoning to suicide

Closely related is Kirillov. Whereas Stavrogin is the detached curious observer to long chains of off-the-rails moral reasoning, Kirillov is the true believer.

Yudkowsky has a useful concept he calls "the bottom line"—the idea that in any motivated reasoning process, the conclusion is written first, and the arguments are found afterward. [...]
But there is an opposite failure mode that Yudkowsky's framework does not adequately address: the person who follows arguments wherever they lead without any check on whether the conclusions make sense. This person is not engaging in motivated reasoning; they are engaging in unmotivated reasoning, deduction without sanity checks. Kirillov is the prototype.
[...]

Kirillov [...] has arrived at the conclusion that suicide is the ultimate act of human freedom, the assertion of human will against the universe that created it. He plans to kill himself as a kind of metaphysical demonstration, and he has agreed to leave a suicide note taking responsibility for crimes committed by Pyotr Stepanovich's revolutionary cell.

The author compares Kirillov to people who accept Pascal's wager -type EV calculations about positive singularities. A better example might be the successionists, some of who want humanity to collectively commit suicide as the ultimate act of human moral concern towards future AIs.

Shigalyovism: reasoning to despotism

Shigalyov rises to present his system for organizing society. "I have become entangled in my own data," he begins, "and my conclusion directly contradicts the original idea from which I started. Starting from unlimited freedom, I end with unlimited despotism. I will add, however, that apart from my solution of the social formula, there is no other."
[...]
One character asks whether this is not simply a fantasy. Shigalyov replies that it is the inevitable conclusion of any serious attempt to organize society rationally. All other solutions are impossible because they require human nature to be other than it is. Only by eliminating freedom for the many can freedom be preserved for the few, and only the few are capable of handling freedom without destroying themselves and others.
[...]
The company reacts with fascination, horror, and a certain amount of admiration. No one can quite refute the argument. And this is Dostoevsky's point: the argument cannot be refuted on its own terms because its premises, once accepted, do indeed lead to its conclusions. The error is in the premises, but the premises are hidden behind such a mass of reasoning that they are difficult to locate.

If Stavrogin is the intellectually entranced x-risk spectator & speculator, and Kirillov is the self-destructive whacko, Shigalyov is the political theorist who has rederived absolute despotism and Platonic totalitarianism for the AGI era.

The AI safety community has developed its own versions of Shigalyovism [...] The concept of a "pivotal act" is perhaps the clearest example. [...] The canonical example is using an aligned AI to prevent all other AI development—establishing a kind of permanent monopoly on artificial intelligence.
This is Shigalyovism in digital form. It begins with the desire to protect humanity and ends with a proposal for a single point of failure controlling all future technological development. The reasoning is internally consistent: if unaligned AI would destroy humanity, and if many independent AI projects increase the probability of unaligned AI, then preventing independent AI development reduces existential risk. QED.
But the conclusion is monstrous. A world in which a single entity controls all AI development is a world without meaningful freedom, without the possibility of exit, without any check on the power of whoever controls that entity. It is Shigalyov's one-tenth ruling over his nine-tenths, with the moral framework of "preventing extinction" replacing the moral framework of "achieving paradise."

Hollowed institutions

Dostoevsky's point is not that the revolutionaries are powerful but that the institutions they attack are weak. The provincial society of Demons has no genuine principles, no deep roots, no capacity for self-defense. It exists through inertia and convention. When those conventions are challenged, it collapses almost immediately.
[...]
I have watched equivalent dynamics in AI governance. I have sat in meetings where everyone present knew that a proposed deployment was risky, where no one was willing to be the person who stopped it. The social costs of objection were immediate and certain; the costs of acquiescence were diffuse and probabilistic. Every time, acquiescence won.
Dostoevsky understood that civilizations do not collapse because they are attacked by overwhelming external force. They collapse because their internal coherence decays to the point where even modest pressure can break them. The revolutionaries in Demons are not impressive people; they are provincial mediocrities. They succeed because the society they attack is even more mediocre.

Possession

The possession Dostoevsky describes is not primarily a matter of ideas entering minds from outside. It is a matter of capacities being developed without the corresponding wisdom to use them, of intelligence outrunning conscience, of means being cultivated without attention to ends.
The characters in Demons are not possessed by socialism or liberalism or nihilism as external forces. They are possessed by their own cleverness—by the intoxicating experience of reasoning without limit, of following thoughts wherever they lead, of treating everything as a puzzle to be solved rather than a reality to be encountered.

The AGI uniparty

The AI research community is not a collection of separate tribes; it is a single social organism that happens to be distributed across multiple corporate hosts.
Consider the actual topology. Researcher A at OpenAI dated Researcher B at Anthropic; they met at a house party in the Mission thrown by Researcher C, who left DeepMind last year and now runs a small alignment nonprofit. Researcher D at Google and Researcher E at Meta were roommates in graduate school and still share a group house with three other ML researchers who work at various startups. The safety lead at one major lab and the policy director at another were in the same MIRI summer program in 2017. The CEO of one frontier lab and the chief scientist of another served on the same nonprofit board.
This is not corruption in any conventional sense. It is simply how small, specialized communities work.
[...]
The official story is that the AI labs are competitors. [...] But the social topology undermines this story. When researchers move fluidly between organizations, they carry knowledge, assumptions, and culture with them.
[...]
The result is a kind of uniparty—a shared culture that supercedes corporate affiliation. The uniparty has its own beliefs (that AGI is coming relatively soon, that the current paradigm will scale, that technical alignment work is tractable), its own values (intellectual rigor, effective altruism, cosmopolitan liberalism), its own taboos (excessive pessimism, appeals to regulation, anything that smacks of Luddism). These shared beliefs, values, and taboos operate across organizational boundaries, creating a remarkable homogeneity of outlook among people who are nominally competitors.
[...]
The AI uniparty's shared premises include: that intelligence is the key variable in the future of civilization; that artificial intelligence will soon exceed human intelligence; that the people currently working on AI are therefore the most important people in history; that their technical and intellectual capabilities qualify them to make decisions for humanity. These premises are rarely stated explicitly, but they structure everything. They explain why the community can tolerate such high levels of risk—because the alternative (letting "less capable" people control the development) seems even worse.
[...]
One cannot believe that AI development should stop entirely. One cannot believe that the risks are so severe that no level of benefit justifies them. One cannot believe that the people currently working on AI are not the right people to be making these decisions. One cannot believe that traditional political processes might be better equipped to govern AI development than the informal governance of the research community.
These positions are not explicitly forbidden. They are simply unthinkable—they would mark one as an outsider, as someone who does not understand, as someone who is not part of the conversation. The boundary is maintained not through coercion but through the subtler mechanisms of social belonging: the raised eyebrow, the awkward silence, the failure to be invited to the next dinner party.

The liberal father as creator of the nihilist son

Liberal Stepan's son Pyotr Stepanovich is a chief nihilist character in Demons. The author of The Possessed Machines argues this sort of thing - EA altruism turning into either outright nihilism or power-hunger - is a core cultural mechanic. I think they are directionally right but I don't follow their main example of this, which argues "technology ethics frameworks that are supposed to govern AI—fairness, accountability, transparency, the whole FAccT constellation—are the Stepan Trofimovich liberalism of our moment", and "the serious people [...] have moved past these frameworks" because they are obsolete. My read of the intellectual history is that AGI-related concerns and galaxy-brained arguments about the future of galaxies preceded that cluste rof more prosaic AI concerns, and they're different branches on the intellectual tree, rather than successors of each other.

Handcuffed Shatov

Ivan Shatov is a former atheist who has returned to a mystical Russian Orthodoxy, a believer who cannot quite manage belief. He was once a member of Pyotr's revolutionary circle and now repudiates it, but the circle will not let him go. He is murdered by his former comrades for the crime of wanting to leave.
Shatov represents something important: the person who has come to doubt the project but cannot escape it. Every major AI lab has its Shatovs—researchers who have grown increasingly uncomfortable with the direction of their work but feel trapped by career incentives, social ties, stock options, and the genuine difficulty of imagining alternative paths. Some of them have left. Many more have stayed, hoping to "push from the inside," rationalizing their continued participation.
Dostoevsky shows us what happens to the Shatovs. They do not reform the movement from within. They are destroyed by it.

The solution is fundamentally spiritual

The ideological debate between liberals and radicals cannot be resolved through more ideology. The social dynamics of provincial conspiracy cannot be fixed through better coordination mechanisms. The psychological deformations of the intelligentsia cannot be healed through more intelligence. Something else is needed—something that operates at a different level, that addresses the human situation rather than any particular doctrine.
I am not a religious person, and I am not advocating for religious solutions to AI risk. But I think Dostoevsky is pointing toward something important: the limits of political and technical approaches to problems that are fundamentally spiritual in nature.
The word "spiritual" is likely to provoke allergic reactions in a rationalist context. Let me try to be precise about what I mean by it. The core problem with AI development is not that we lack good alignment techniques (though we do). It is not that the incentive structures are wrong (though they are). It is not that the governance mechanisms are inadequate (though they are). The core problem is that the people making the key decisions are, many of them, damaged in ways that disqualify them from making these decisions wisely.
This damage is not primarily intellectual. The people I am thinking of are intelligent, often extraordinarily so. It is something more like moral—a failure of the channels that connect knowledge to action, that make abstract truths feel binding, that generate appropriate emotional responses to contemplated harms.

I continue to think it would be valuable for organizations pursuing aligned AI to have a written halt procedure with rules for how to invoke it -- including provisions to address the fact that people at the organization are likely to be selected for optimism, and so forth.

Perhaps external activists could push for such measures: "If you're not gonna pause now, can you at least give us a detailed specification of what circumstances would justify a pause?"

I suppose "Responsible Scaling Policies" are something like what I'm describing... how are those working out?

It seems there is some diffusion of responsibility going on here. It could help to concentrate responsibility in a particular procedure, or particular group of individuals (e.g. have an internal red team, who are paid to interview people anonymously, and construct the best possible case for a halt).

Based on what I've read about safety culture in high-performing organizations, the safety culture of top AI organizations, as described in this post, seems fairly terrible. Perhaps even scarier than the lack of safety culture is the apparent lack of 101-level reading/implementation of what effective safety culture looks like. My impression is that many of the topics discussed here map well to standard safety culture concepts like "normalization of deviance". In general the post reads like a fairly typical sociological characterization of an organization that is on the verge of causing a catastrophic failure? Although admittedly I did my research into safety culture quite a while ago. Maybe it would be possible to get some reliability engineers to testify in front of Congress.

EDIT: On the other hand, the good news is: I expect there may be low-hanging fruit for improvement by simply hiring safety consultants who work with other high-stakes industries like aviation, nuclear, etc. and taking their recommendations seriously. A more pessimistic organization like MIRI could even offer to pay the fees for a mutually agreeable consultant. If AI companies are not willing to work with such a consultant, even when an outsider such as MIRI is paying the fees, that seems like an incredibly bad sign.

I was just dragged through Demons for a book club, so I was amused to read this. At least it means the time I spent reading that wasn't in vain.

There's some stuff that feels a little bit weird here. The author says they left in early 2024 and then spent the "following months" reading Dostoevsky and writing this essay. Was the essay a bit older and only got put up? (Has to be relatively recently edited, if it was run through 4.5). Who are the editors alluded to at the very end? Is it supposed to be Tim Hwang? A little bit more transparency would be much appreciated (the disclaimer about Opus 4.5 being used for anonymization was only added on the 24th after some people had pointed out that it sounded rather AI-written.).

Another weirdness: why did Hwang put up another microsite about Demons that's written by an anonymous author "still working in industry" that has clear LLM-writing patterns at basically the same time? https://shigalyovism.com/. Though this one is much less in-depth.

Can anyone with more experience in the frontier labs/the uniparty give a sanity check for whether this seems like it was written by someone who is who they say they are?

it seems plausible that the piece was written by someone who only has access to public writings. it has some confusions that seem unlikely but not completely inexplicable - for example, the assumption that EA is a major steering force in the uniparty (maybe the author is from Anthropic where this is more true); I also find the description of uniparty views to a bit too homogenous (maybe the author is trying to emphasize how small the apparent differences are compared to the space of all beliefs, or maybe they are an external spectator who is unaware of the details).

Yeah that EA-prevalence assumption also caused me to doubt that the author actually worked at an AI company, it was very dissonant with my experience at least.

I'm skeptical that the author is who they say they are. (I made a top level post critiquing Possessed Machines, I'm copying over the relevant part here.)

1. I think the author is being dishonest about how this piece was written.

There is a lot of AI in the writing of Possessed Machines. The bottom of the webpage states "To conceal stylistic identifiers of the authors, the above text is a sentence-for-sentence rewrite of an original hand-written composition processed via Claude Opus 4.5." As I wrote in a comment:

Ah, this [statement] was not there when I read the piece (Jan 23). You can see an archived version here in which it doesn't say that.

I don't actually believe that this is how the document was made. A few reasons. First, I don't think this is what a sentence-for-sentence rewrite looks like; I don't think you get that much of the AI style that this piece has with that^. Second, the stories in the interlude are superrrrr AI-y, not just in sentence-by-sentence style but in other ways. Third, the chapter and part titles seem very AI generated...
The piece has 31 uses of “genuine”/“genuinely” in ~17000 words. One “genuine” every 550 words.

See also...

2. Fishiness

From kaiwilliams:

There's some stuff that feels a little bit weird here. The author says they left in early 2024 and then spent the "following months" reading Dostoevsky and writing this essay. Was the essay a bit older and only got put up? (Has to be relatively recently edited, if it was run through 4.5). Who are the editors alluded to at the very end? Is it supposed to be Tim Hwang? A little bit more transparency would be much appreciated (the disclaimer about Opus 4.5 being used for anonymization was only added on the 24th after some people had pointed out that it sounded rather AI-written.).
Another weirdness: why did Hwang put up another microsite about Demons that's written by an anonymous author "still working in industry" that has clear LLM-writing patterns at basically the same time? https://shigalyovism.com/. Though this one is much less in-depth.

At the bottom of the webpage in an "About the Author" box, we are told "Correspondence may be directed to the editors." This is weird, because we don't know who the editors are. Probably this was something that Claude added and the human author didn't check.

Richard_Kennaway points out:

There are some anomalies in the chapter numbering:
Part IV ends with Chapter 18; Part V begins with Chapter 21... [etc.]

3. This piece could have been written by someone who wasn't an AI insider

If you're immersed in 2025/2026 ~rationalist AI discourse, you would have the information to write Possessed Machines. That is, there's no "inside information" in the piece. There is a lot of "I saw people at the lab do this [thing that I, a non-insider, already thought that people at the lab did]". Leogao has made this same point: "it seems plausible that the piece was written by someone who only has access to public writings."

I am surprised you didn't mention the fact that the whole thing was paraphrased to preserve anonymity by Opus 4.5. (Which really stood out to me! When I first read it, I assumed it was AI-generated, and I was disconcerted to see such quality of thought coming with such a slopreek to the prose.)

Fair, I should've mentioned this. I speculated about this on Twitter yesterday. I also found the prose somewhat off-putting. Will edit to mention.

Are you speculating, or do we know this is true? Did the author say this somewhere?

It's stated at the bottom of the webpage. After a few paragraphs at the start I was like... "ok surely this is AI generated" and popped it in a few detectors. i happened to scroll to the bottom to see how long the dang thing was and saw the disclaimer. i wish this former lab insider---who surely has money and at least one eloquent friend---could've paid another human being to rewrite it. but alas!

Ah, this was not there when I read the piece (Jan 23). You can see an archived version here in which it doesn't say that.

The statement now at the bottom of the webpage says: "To conceal stylistic identifiers of the authors, the above text is a sentence-for-sentence rewrite of an original hand-written composition processed via Claude Opus 4.5."

I don't actually believe that this is how the document was made. A few reasons. First, I don't think this is what a sentence-for-sentence rewrite looks like; I don't think you get that much of the AI style that this piece has with that^. Second, the stories in the interlude are superrrrr AI-y, not just in sentence-by-sentence style but in other ways. Third, the chapter and part titles seem very AI generated.

I might be wrong about this. Some experiments that would be useful here. One, give the piece sans titles to Claude and ask it to come up with titles; see how well they match. Two, do some sentence-by-sentence rewrites of other texts and see how much AI style they have^.

FWIW I think this work is valuable, I'm glad I read it, and I've recommended it to people. I do think the first 'half' of the document is better in both content and style than the second half. In particular, the piece becomes significantly more slop-ish starting with the interlude (and continuing to the end).

^The piece has 31 uses of “genuine”/“genuinely” in ~17000 words. One “genuine” every 550 words. Does Claude insert "genuinely"s when sentence-by-sentence rewriting? I genuinely don't know!

some thoughts fmpov

it seems pretty true that there is substantial cultural overlap between different labs. people come and go all the time, and info flows a lot. nobody really thinks of the individual people at competitor labs as enemies, though people do obviously really want their lab to win. EA is not really a major value of the uniparty. also, at least at openai, one of the major values that is missing from the list is a strong belief in the value of empiricism as opposed to philosophical argument.
the capabilities cluster is pretty socially distinct from the safety cluster. they mostly don't go to the same parties, live in the same apartments, etc.
I've received a lot of pushback from people for arguing that AGI timelines might be longer than 3 years, and for arguing that developing capabilities slower would be a good thing. obviously, some people will dislike this enough to not want to talk to me. but I don't feel like these are "unthinkable" propositions. perhaps I've memed so hard that I've found myself on a mystical island of stability where I have jesters' privilege to say such things, but the more likely explanation imo is that people simply treat this like any other normal disagreement.
from observation, I do think people are heavily motivated by their stonks. but there are also a sizeable number of people, especially the more senior ones, whose actions are hard to explain as financially motivated.

One cannot believe that AI development should stop entirely. One cannot believe that the risks are so severe that no level of benefit justifies them. One cannot believe that the people currently working on AI are not the right people to be making these decisions. One cannot believe that traditional political processes might be better equipped to govern AI development than the informal governance of the research community.

FWIW, I believe all those things, especially #3. (well, with nuance. Like, it's not my ideal policy package, I think if I were in charge of the whole world we'd stop AI development temporarily and then figure out a new, safer, less power-concentrating way to proceed with it. But it's significantly better by my lights than what most people in the industry and on twitter and in DC are advocating for. I guess I should say I approximately believe all those things, and/or I think they are all directionally correct.)

But I am not representative of the 'uniparty' I guess. I think the 'uniparty' idea is a fairly accurate description of how frontier AI labs are, including the people in the labs who think of themselves as AI safety people. There are exceptions of course. I don't think the 'uniparty' as described by this anonymous essay is an accurate description of the AI safety community more generally. Basically I think it's pretty accurate at describing the part of the community that inhabits and is closely entangled with the AI companies, but inaccurate at describing e.g. MIRI or AIFP or most of the orgs in Constellation, or FLI or ... etc. It's unclear whether it's claiming to describe those groups, it wasn't super clear about its scope.

well, with nuance. Like, it's not my ideal policy package, I think if I were in charge of the whole world we'd stop AI development temporarily and then figure out a new, safer, less power-concentrating way to proceed with it. But it's significantly better by my lights than what most people in the industry and on twitter and in DC are advocating for. I guess I should say I approximately believe all those things, and/or I think they are all directionally correct

With all due respect, I'm pretty sure that the existence of this very long string of qualifiers and very carefully reasoned hedges is precisely what the author means when he talks about intellectualised but not internalised beliefs.

Can you elaborate? What do you think I should be doing or saying differently, if I really internalized the things I believe?

To be honest, I wasn't really pointing at you when I made the comment, more at the practice of the hedges and the qualifiers. I want to emphasise that (from the evidence available to me publicly) I think that you have internalised your beliefs a lot more than those the author collects into the "uniparty". I think that you have acted bravely and with courage in support of your convictions, especially in face of the NDA situation, for which I hold immense respect. It could not have been easy to leave when you did.

However, my interpretation of what the author is saying is that beliefs like "I think what these people are doing might seriously end the world" are in a sense fundamentally difficult to square with measured reasoning and careful qualifiers. The end of the world and existential risk are by their nature so totalising and awful ideas that any "sane" interaction with them (as in, trying to set measured bounds and make sensible models) is extremely epistemically unsound, the equivalent of arguing whether 1e8 + 14 people or 1e8 + 17 people (3 extra lives!) will be the true number of casualties in some kind of planetary extinction event when the error bars are themselves +- 1e5 or 1e6. (We are, after all, dealing with never-seen-before black swan events.)

In this sense, detailed debates about which metrics to include in a takeoff model and the precise slope of the METR exponential curve and which combination of chip trade and export policies increases tail risk the most/least is itself a kind of deception. This is because the arguing over details implies that our world and risk models have more accuracy and precision than they actually do, and in turn that we have more control over events than we actually do. "Directionally correct" is in fact the most accuracy we're going to get, because (per the author) Silicon Valley isn't actually doing some kind of carefully calculated compute-optimal RSI takeoff launch sequence with a well understood theory of learning. The AGI "industry" is more like a group of people pulling the lever of a slot machine over and over and over again, egged on by a crowd of eager onlookers, spending down the world's collective savings accounts until one of them wins big. By "win big", of course, I mean "unleashes a fundamentally new kind of intelligence into the world". And each of them may do it for different reasons, and some of them may in their heads actually have some kind of master plan, but all it looks like from the outside is ka-ching, ka-ching, ka-ching, ka-ching...

OK, thanks! It sounds like you are saying that I shouldn't be engaged in research projects like the AI Futures Model, AI 2027, etc.? On the grounds that they are deceptive, by implying that the situation is more under control, more normal, more OK than it is?

I agree that we should try to avoid giving that impression. But I feel like the way forward is to still do the research but then add prominent disclaimers, rather than abandon the research entirely.

Silicon Valley isn't actually doing some kind of carefully calculated compute-optimal RSI takeoff launch sequence with a well understood theory of learning. The AGI "industry" is more like a group of people pulling the lever of a slot machine over and over and over again, egged on by a crowd of eager onlookers, spending down the world's collective savings accounts until one of them wins big. By "win big", of course, I mean "unleashes a fundamentally new kind of intelligence into the world". And each of them may do it for different reasons, and some of them may in their heads actually have some kind of master plan, but all it looks like from the outside is ka-ching, ka-ching, ka-ching, ka-ching...

I agree with this fwiw.

Just to be clear, while I "vibe very hard" with what the author says on a conceptual level, I'm not directly calling for you to shut down those projects. I'm trying to explain what I think the author sees as a problem within the AI safety movement. Because I am talking to you specifically, I am using the immediate context of your work, but only as a frame not as a target. I found AI 2027 engaging, a good representation of a model of how takeoff will happen, and I thought it was designed and written well (tbh my biggest quibble is "why isn't it called AI 2028"). The author is very very light on actual positive "what we should do" policy recommendations, so if I talked about that I would be filling in with my own takes, which probably differ from the author's in several places. I am happy to do that if you want, though probably not publicly in a LW thread.

@Daniel Kokotajlo Addendum:

Finally, my interpretation of "Chapter 18: What Is to Be Done?" (and the closest I will come to answering your question based on the author's theory/frame) is something like "the AGI-birthing dynamic is not a rational dynamic, therefore it cannot be defeated by policies or strategies that are focused around rational action". Furthermore, since each actor wants to believe that their contribution to the dynamic is locally rational (if I don't do it someone else will/I'm counterfactually helping/this intervention will be net positive/I can use my influence for good at a pivotal moment [...] pick your argument), further arguments about optimally rational policies only encourages the delusion that everyone is acting rationally, making them dig in their heels further.

The core emotions the author points to that motivate the AGI dynamic are: thrill of novelty/innovation/discovery, paranoia and fear about "others" (other labs/other countries/other people) achieving immense power, distrust of institutions, philosophies, and systems that underpin the world, and a sense of self importance/destiny. All of these can be justified with intellectual arguments but are often the bottom line that comes before such arguments are written. On the other hand the author also shows how poor emotional understanding and estrangement from one's emotions and intuitions lead to people getting trapped by faulty but extremely sophisticated logic. Basically, emotions and intuitions offer first order heuristics in the massively high dimensional space of possible actions/policies, and when you cut off the heuristic system you are vulnerable to high dimensional traps/false leads that your logic or deductive abilities are insufficient to extract you from.

Therefore, the answer the author is pointing at is something like an emotional or frame realignment challenge. You don't start arguing with a suicidal person about why the logical reasons they have offered for jumping don't make sense (at least, you don't do this if you want them to stay alive), you try to point them to a different emotional frame or state (i.e. calming them down and showing them there is a way out). Though he leaves it very vague, it seems that he believes the world will also need such a fundamental frame shift or belief-reinterpretation to actually exit this destructive dynamic, the magnitude of which he likens to a religious revelation and compares to the redemptive power of love. Beyond this point I would be filling in my own interpretation and I will stop there, but I have a lot more thoughts about this (especially the idea of love/coordination/ends to moloch).

You are obviously not in the AGI uniparty (e.g. you chose to leave despite great financial cost).

Basically I think it's pretty accurate at describing the part of the community that inhabits and is closely entangled with the AI companies, but inaccurate at describing e.g. MIRI or AIFP or most of the orgs in Constellation, or FLI or ... etc.

I agree with most of these, though my vague sense is some Constellation orgs are quite entangled with Anthropic (e.g. sending people to Anthropic, Anthropic safety teams coworking there, etc.), and Anthropic seems like the cultural core of the AGI uniparty.

OK, cool.

FWIW, I disagree that Anthropic is the cultural core of the AGI uniparty. I think you think that because "Being EA" is one of the listed traits of the AGI uniparty, but I think that's maybe one of the places I disagree with the author--"Being EA" is maybe a common trait in AI safety, but it's a decreasingly common trait unfortunately IMO, and it's certainly not a common trait in the AI companies, and I think the AGI uniparty should be a description of the culture of the companies rather than a description of the culture of AI safety more generally (otherwise, it's just false). I'd describe the AGI uniparty as the people for whom this is true:

One cannot believe that AI development should stop entirely. One cannot believe that the risks are so severe that no level of benefit justifies them. One cannot believe that the people currently working on AI are not the right people to be making these decisions. One cannot believe that traditional political processes might be better equipped to govern AI development than the informal governance of the research community.

...and I'm pretty sure that while this is true for Anthropic, OpenAI, xAI, GDM, etc. it's probably somewhat less true for Anthropic than for the others, or at least OpenAI.

As someone currently at an AI lab (though certainly disproportionately LW-leaning from within that cluster), my stance respectively would be

"AI development should stop entirely" oh man depends exactly how you operationalize it. I'd likely push a button that magically stopped it for 10 years, maybe for 100, probably not for all time though I don't think the latter would be totally crazy. None of said buttons would be my ideal policy proposal. In all cases the decisionmaking is motivated by downstream effects on the long-run quality of the future, not on mundane benefits or company revenue or whatever.
"risks are so severe that no level of benefit justifies them" nah, I like my VNM continuity axiom thank you very much, no ontologically incommensurate outcomes for me. I do think they're severe enough that benefits on the order of "guaranteed worldwide paradise for a million years for every living human" don't justify increasing them by 10% though!
"the people currently working on AI are not the right people to be making these decisions" absolutely. Many specific alternative decisionmakers would be worse but I don't think the current setup is anything like optimal.
"traditional political processes might be better equipped to govern AI development than the informal governance of the research community" Since 'might' is a very weak word I obviously agree with this. Do I think it's more likely than not, idk, it'll depend on your operationalization but probably. I do think there are non-consequentialist (and second-order consequentialist) reasons to default in favor of existing legitimate forms of government for this kind of decisionmaking, so it isn't just a question of who is better equipped in a magic hypothetical where you perfectly transfer the power.

I don't think my opinions on any of these topics are particularly rare among my coworkers either, and indeed you can see some opinions of this shape expressed in public by Anthropic very recently! Quoting from the constitution or the adolescence of technology I think there's quite a lot in the theme of the third and fourth supposedly-unspeakable thoughts from the essay:

Claude should generally try to preserve functioning societal structures, democratic institutions, and human oversight mechanisms
We also want to be clear that we think a wiser and more coordinated civilization would likely be approaching the development of advanced AI quite differently—with more caution, less commercial pressure, and more careful attention to the moral status of AI systems. [...] we are not creating Claude the way an idealized actor would in an idealized world
Claude should refuse to assist with actions that would help concentrate power in illegitimate ways. This is true even if the request comes from Anthropic itself. [...] we want Claude to be cognizant of the risks this kind of power concentration implies, to view contributing to it as a serious harm that requires a very high bar of justification, and to attend closely to the legitimacy of the process and of the actors so empowered.
It is somewhat awkward to say this as the CEO of an AI company, but I think the next tier of risk [for seizing power] is actually AI companies themselves. [...] The main thing they lack is the legitimacy and infrastructure of a state [...] I think the governance of AI companies deserves a lot of scrutiny.

"risks are so severe that no level of benefit justifies them" nah, I like my VNM continuity axiom thank you very much, no ontologically incommensurate outcomes for me. I do think they're severe enough that benefits on the order of "guaranteed worldwide paradise for a million years for every living human" don't justify increasing them by 10% though!

What about... a hundred million years? What does your risk/benefit mapping actually look like?

by an ex- lab employee

How do we know this is true?

There are some anomalies in the chapter numbering:

Part IV ends with Chapter 18; Part V begins with Chapter 21.

Part V ends with Chapter 23; Part VI begins with Chapter 27.

Part VI jumps from Chapter 28 to Chapter 31.

Part VII omits Chapter 33.

Are these just trifling errors, or indications of deliberate omissions by the writer, negative spaces for those with eyes to see?

This also stood out to me while reading. Although I realized it in jumping from Chapter 28 to 31. I don't think these are errors but my guess is perhaps they were somewhat revealing the authors identity so they were omitted. Curious to know what he/she chose to omit. It was a great read albeit the AI-slopification.

I really enjoyed reading the full essay. I think there many interesting points in it, and I'm grateful the author took the time to write and share it. I agree that a lot of issues regarding how to deploy/develop AI are being handled by people who's skillsets or perspectives are maybe more narrow than is ideal.

However. I understand the main thrust of the piece to be about how people can, while making locally sensible decisions, end up in disaster, with the presumption that that's where things are headed. While I thought that main point was really well elaborated, I'd like to talk about the presumption. First off I can't help but get the impression that the author's community is engaged in some of what he's accusing others of: He discusses how others are "possessed" by capabilities (and I don't entirely disagree, how could one not be captivated with all that AI is capable of today), but could it be his group is "possessed" by doom? It reminds me of Neumann's quote on Oppenheimer: "Sometimes someone confesses a sin in order to take credit for it".

Getting more specific, I wish I could hear more on his opinion about that safety meeting he mentioned. I thought it interesting that he brings it up as a seemingly negative experience. I'd love to know why he thinks that. Maybe I'm missing something, but AFAIK on net nothing bad has happened, which makes me think the product people in the meeting were right?

This is a long shot, but I'm deeply moved by resonance and gotta shoot it:

If you have any intuition about who might have written this essay, I humbly ask you to connect me with the author. Goes without saying but: Do not dox. DM who you think wrote it + ask for permission.

It is everything I'm working on as: 1) a Russian technologist with 2) theological commitments and literary leanings who's 3) building a very different alignment bet with 4) the courage to treat some truths as axiomatic rather than derived and 5) pursuing safety as a data science that systematically studies and simulates the conditions that engender an awareness of those truths in our users.

We cannot think our way out of cognitive atrophy and context collapse without reconstructing relational meaning. That's why all incumbent alignment bets are failure modes:

No one in AI safety seems willing to admit that materialistic conceptions of morality and virtue are incomplete and circular. And that feels like capitulating to superstition, BUT IT IS NOT.

I meet very few people willing to articulate this -- or who can even size up the problem thus -- but the author here is an exception.

And if you are the author and reading this, please get in touch. I can meet you at the level of Dostoevsky but *want* to meet you at the level of: please read our technical position paper and help me think through possible externalities my team and I are not seeing.

My own felt sense, as an outsider, is that the pessimists look more ideological/political and fervent than the relatively normal-looking labs. According to the frame of the essay, the "catastrophe brought about with good intent" could easily be preventing AI progress from continuing and the political means to bring that about.