I continue to think it would be valuable for organizations pursuing aligned AI to have a written halt procedure with rules for how to invoke it -- including provisions to address the fact that people at the organization are likely to be selected for optimism, and so forth.
Perhaps external activists could push for such measures: "If you're not gonna pause now, can you at least give us a detailed specification of what circumstances would justify a pause?"
I suppose "Responsible Scaling Policies" are something like what I'm describing... how are those working out?
It seems there is some diffusion of responsibility going on here. It could help to concentrate responsibility in a particular procedure, or particular group of individuals (e.g. have an internal red team, who are paid to interview people anonymously, and construct the best possible case for a halt).
Based on what I've read about safety culture in high-performing organizations, the safety culture of top AI organizations, as described in this post, seems fairly terrible. Perhaps even scarier than the lack of safety culture is the apparent lack of 101-level reading/implementation of what effective safety culture looks like. My impression is that many of the topics discussed here map well to standard safety culture concepts like "normalization of deviance". In general the post reads like a fairly typical sociological characterization of an organization that is on the verge of causing a catastrophic failure? Although admittedly I did my research into safety culture quite a while ago. Maybe it would be possible to get some reliability engineers to testify in front of Congress.
EDIT: On the other hand, the good news is: I expect there may be low-hanging fruit for improvement by simply hiring safety consultants who work with other high-stakes industries like aviation, nuclear, etc. and taking their recommendations seriously. A more pessimistic organization like MIRI could even offer to pay the fees for a mutually agreeable consultant. If AI companies are not willing to work with such a consultant, even when an outsider such as MIRI is paying the fees, that seems like an incredibly bad sign.
I was just dragged through Demons for a book club, so I was amused to read this. At least it means the time I spent reading that wasn't in vain.
There's some stuff that feels a little bit weird here. The author says they left in early 2024 and then spent the "following months" reading Dostoevsky and writing this essay. Was the essay a bit older and only got put up? (Has to be relatively recently edited, if it was run through 4.5). Who are the editors alluded to at the very end? Is it supposed to be Tim Hwang? A little bit more transparency would be much appreciated (the disclaimer about Opus 4.5 being used for anonymization was only added on the 24th after some people had pointed out that it sounded rather AI-written.).
Another weirdness: why did Hwang put up another microsite about Demons that's written by an anonymous author "still working in industry" that has clear LLM-writing patterns at basically the same time? https://shigalyovism.com/. Though this one is much less in-depth.
Can anyone with more experience in the frontier labs/the uniparty give a sanity check for whether this seems like it was written by someone who is who they say they are?
it seems plausible that the piece was written by someone who only has access to public writings. it has some confusions that seem unlikely but not completely inexplicable - for example, the assumption that EA is a major steering force in the uniparty (maybe the author is from Anthropic where this is more true); I also find the description of uniparty views to a bit too homogenous (maybe the author is trying to emphasize how small the apparent differences are compared to the space of all beliefs, or maybe they are an external spectator who is unaware of the details).
Yeah that EA-prevalence assumption also caused me to doubt that the author actually worked at an AI company, it was very dissonant with my experience at least.
I'm skeptical that the author is who they say they are. (I made a top level post critiquing Possessed Machines, I'm copying over the relevant part here.)
1. I think the author is being dishonest about how this piece was written.
There is a lot of AI in the writing of Possessed Machines. The bottom of the webpage states "To conceal stylistic identifiers of the authors, the above text is a sentence-for-sentence rewrite of an original hand-written composition processed via Claude Opus 4.5." As I wrote in a comment:
Ah, this [statement] was not there when I read the piece (Jan 23). You can see an archived version here in which it doesn't say that.
I don't actually believe that this is how the document was made. A few reasons. First, I don't think this is what a sentence-for-sentence rewrite looks like; I don't think you get that much of the AI style that this piece has with that^. Second, the stories in the interlude are superrrrr AI-y, not just in sentence-by-sentence style but in other ways. Third, the chapter and part titles seem very AI generated...The piece has 31 uses of “genuine”/“genuinely” in ~17000 words. One “genuine” every 550 words.
See also...
2. Fishiness
From kaiwilliams:
There's some stuff that feels a little bit weird here. The author says they left in early 2024 and then spent the "following months" reading Dostoevsky and writing this essay. Was the essay a bit older and only got put up? (Has to be relatively recently edited, if it was run through 4.5). Who are the editors alluded to at the very end? Is it supposed to be Tim Hwang? A little bit more transparency would be much appreciated (the disclaimer about Opus 4.5 being used for anonymization was only added on the 24th after some people had pointed out that it sounded rather AI-written.).
Another weirdness: why did Hwang put up another microsite about Demons that's written by an anonymous author "still working in industry" that has clear LLM-writing patterns at basically the same time? https://shigalyovism.com/. Though this one is much less in-depth.
At the bottom of the webpage in an "About the Author" box, we are told "Correspondence may be directed to the editors." This is weird, because we don't know who the editors are. Probably this was something that Claude added and the human author didn't check.
Richard_Kennaway points out:
There are some anomalies in the chapter numbering:
Part IV ends with Chapter 18; Part V begins with Chapter 21... [etc.]
3. This piece could have been written by someone who wasn't an AI insider
If you're immersed in 2025/2026 ~rationalist AI discourse, you would have the information to write Possessed Machines. That is, there's no "inside information" in the piece. There is a lot of "I saw people at the lab do this [thing that I, a non-insider, already thought that people at the lab did]". Leogao has made this same point: "it seems plausible that the piece was written by someone who only has access to public writings."
I am surprised you didn't mention the fact that the whole thing was paraphrased to preserve anonymity by Opus 4.5. (Which really stood out to me! When I first read it, I assumed it was AI-generated, and I was disconcerted to see such quality of thought coming with such a slopreek to the prose.)
Fair, I should've mentioned this. I speculated about this on Twitter yesterday. I also found the prose somewhat off-putting. Will edit to mention.
It's stated at the bottom of the webpage. After a few paragraphs at the start I was like... "ok surely this is AI generated" and popped it in a few detectors. i happened to scroll to the bottom to see how long the dang thing was and saw the disclaimer. i wish this former lab insider---who surely has money and at least one eloquent friend---could've paid another human being to rewrite it. but alas!
Ah, this was not there when I read the piece (Jan 23). You can see an archived version here in which it doesn't say that.
The statement now at the bottom of the webpage says: "To conceal stylistic identifiers of the authors, the above text is a sentence-for-sentence rewrite of an original hand-written composition processed via Claude Opus 4.5."
I don't actually believe that this is how the document was made. A few reasons. First, I don't think this is what a sentence-for-sentence rewrite looks like; I don't think you get that much of the AI style that this piece has with that^. Second, the stories in the interlude are superrrrr AI-y, not just in sentence-by-sentence style but in other ways. Third, the chapter and part titles seem very AI generated.
I might be wrong about this. Some experiments that would be useful here. One, give the piece sans titles to Claude and ask it to come up with titles; see how well they match. Two, do some sentence-by-sentence rewrites of other texts and see how much AI style they have^.
FWIW I think this work is valuable, I'm glad I read it, and I've recommended it to people. I do think the first 'half' of the document is better in both content and style than the second half. In particular, the piece becomes significantly more slop-ish starting with the interlude (and continuing to the end).
^The piece has 31 uses of “genuine”/“genuinely” in ~17000 words. One “genuine” every 550 words. Does Claude insert "genuinely"s when sentence-by-sentence rewriting? I genuinely don't know!
some thoughts fmpov
One cannot believe that AI development should stop entirely. One cannot believe that the risks are so severe that no level of benefit justifies them. One cannot believe that the people currently working on AI are not the right people to be making these decisions. One cannot believe that traditional political processes might be better equipped to govern AI development than the informal governance of the research community.
FWIW, I believe all those things, especially #3. (well, with nuance. Like, it's not my ideal policy package, I think if I were in charge of the whole world we'd stop AI development temporarily and then figure out a new, safer, less power-concentrating way to proceed with it. But it's significantly better by my lights than what most people in the industry and on twitter and in DC are advocating for. I guess I should say I approximately believe all those things, and/or I think they are all directionally correct.)
But I am not representative of the 'uniparty' I guess. I think the 'uniparty' idea is a fairly accurate description of how frontier AI labs are, including the people in the labs who think of themselves as AI safety people. There are exceptions of course. I don't think the 'uniparty' as described by this anonymous essay is an accurate description of the AI safety community more generally. Basically I think it's pretty accurate at describing the part of the community that inhabits and is closely entangled with the AI companies, but inaccurate at describing e.g. MIRI or AIFP or most of the orgs in Constellation, or FLI or ... etc. It's unclear whether it's claiming to describe those groups, it wasn't super clear about its scope.
well, with nuance. Like, it's not my ideal policy package, I think if I were in charge of the whole world we'd stop AI development temporarily and then figure out a new, safer, less power-concentrating way to proceed with it. But it's significantly better by my lights than what most people in the industry and on twitter and in DC are advocating for. I guess I should say I approximately believe all those things, and/or I think they are all directionally correct
With all due respect, I'm pretty sure that the existence of this very long string of qualifiers and very carefully reasoned hedges is precisely what the author means when he talks about intellectualised but not internalised beliefs.
Can you elaborate? What do you think I should be doing or saying differently, if I really internalized the things I believe?
To be honest, I wasn't really pointing at you when I made the comment, more at the practice of the hedges and the qualifiers. I want to emphasise that (from the evidence available to me publicly) I think that you have internalised your beliefs a lot more than those the author collects into the "uniparty". I think that you have acted bravely and with courage in support of your convictions, especially in face of the NDA situation, for which I hold immense respect. It could not have been easy to leave when you did.
However, my interpretation of what the author is saying is that beliefs like "I think what these people are doing might seriously end the world" are in a sense fundamentally difficult to square with measured reasoning and careful qualifiers. The end of the world and existential risk are by their nature so totalising and awful ideas that any "sane" interaction with them (as in, trying to set measured bounds and make sensible models) is extremely epistemically unsound, the equivalent of arguing whether 1e8 + 14 people or 1e8 + 17 people (3 extra lives!) will be the true number of casualties in some kind of planetary extinction event when the error bars are themselves +- 1e5 or 1e6. (We are, after all, dealing with never-seen-before black swan events.)
In this sense, detailed debates about which metrics to include in a takeoff model and the precise slope of the METR exponential curve and which combination of chip trade and export policies increases tail risk the most/least is itself a kind of deception. This is because the arguing over details implies that our world and risk models have more accuracy and precision than they actually do, and in turn that we have more control over events than we actually do. "Directionally correct" is in fact the most accuracy we're going to get, because (per the author) Silicon Valley isn't actually doing some kind of carefully calculated compute-optimal RSI takeoff launch sequence with a well understood theory of learning. The AGI "industry" is more like a group of people pulling the lever of a slot machine over and over and over again, egged on by a crowd of eager onlookers, spending down the world's collective savings accounts until one of them wins big. By "win big", of course, I mean "unleashes a fundamentally new kind of intelligence into the world". And each of them may do it for different reasons, and some of them may in their heads actually have some kind of master plan, but all it looks like from the outside is ka-ching, ka-ching, ka-ching, ka-ching...
OK, thanks! It sounds like you are saying that I shouldn't be engaged in research projects like the AI Futures Model, AI 2027, etc.? On the grounds that they are deceptive, by implying that the situation is more under control, more normal, more OK than it is?
I agree that we should try to avoid giving that impression. But I feel like the way forward is to still do the research but then add prominent disclaimers, rather than abandon the research entirely.
Silicon Valley isn't actually doing some kind of carefully calculated compute-optimal RSI takeoff launch sequence with a well understood theory of learning. The AGI "industry" is more like a group of people pulling the lever of a slot machine over and over and over again, egged on by a crowd of eager onlookers, spending down the world's collective savings accounts until one of them wins big. By "win big", of course, I mean "unleashes a fundamentally new kind of intelligence into the world". And each of them may do it for different reasons, and some of them may in their heads actually have some kind of master plan, but all it looks like from the outside is ka-ching, ka-ching, ka-ching, ka-ching...
I agree with this fwiw.
Just to be clear, while I "vibe very hard" with what the author says on a conceptual level, I'm not directly calling for you to shut down those projects. I'm trying to explain what I think the author sees as a problem within the AI safety movement. Because I am talking to you specifically, I am using the immediate context of your work, but only as a frame not as a target. I found AI 2027 engaging, a good representation of a model of how takeoff will happen, and I thought it was designed and written well (tbh my biggest quibble is "why isn't it called AI 2028"). The author is very very light on actual positive "what we should do" policy recommendations, so if I talked about that I would be filling in with my own takes, which probably differ from the author's in several places. I am happy to do that if you want, though probably not publicly in a LW thread.
@Daniel Kokotajlo Addendum:
Finally, my interpretation of "Chapter 18: What Is to Be Done?" (and the closest I will come to answering your question based on the author's theory/frame) is something like "the AGI-birthing dynamic is not a rational dynamic, therefore it cannot be defeated by policies or strategies that are focused around rational action". Furthermore, since each actor wants to believe that their contribution to the dynamic is locally rational (if I don't do it someone else will/I'm counterfactually helping/this intervention will be net positive/I can use my influence for good at a pivotal moment [...] pick your argument), further arguments about optimally rational policies only encourages the delusion that everyone is acting rationally, making them dig in their heels further.
The core emotions the author points to that motivate the AGI dynamic are: thrill of novelty/innovation/discovery, paranoia and fear about "others" (other labs/other countries/other people) achieving immense power, distrust of institutions, philosophies, and systems that underpin the world, and a sense of self importance/destiny. All of these can be justified with intellectual arguments but are often the bottom line that comes before such arguments are written. On the other hand the author also shows how poor emotional understanding and estrangement from one's emotions and intuitions lead to people getting trapped by faulty but extremely sophisticated logic. Basically, emotions and intuitions offer first order heuristics in the massively high dimensional space of possible actions/policies, and when you cut off the heuristic system you are vulnerable to high dimensional traps/false leads that your logic or deductive abilities are insufficient to extract you from.
Therefore, the answer the author is pointing at is something like an emotional or frame realignment challenge. You don't start arguing with a suicidal person about why the logical reasons they have offered for jumping don't make sense (at least, you don't do this if you want them to stay alive), you try to point them to a different emotional frame or state (i.e. calming them down and showing them there is a way out). Though he leaves it very vague, it seems that he believes the world will also need such a fundamental frame shift or belief-reinterpretation to actually exit this destructive dynamic, the magnitude of which he likens to a religious revelation and compares to the redemptive power of love. Beyond this point I would be filling in my own interpretation and I will stop there, but I have a lot more thoughts about this (especially the idea of love/coordination/ends to moloch).
You are obviously not in the AGI uniparty (e.g. you chose to leave despite great financial cost).
Basically I think it's pretty accurate at describing the part of the community that inhabits and is closely entangled with the AI companies, but inaccurate at describing e.g. MIRI or AIFP or most of the orgs in Constellation, or FLI or ... etc.
I agree with most of these, though my vague sense is some Constellation orgs are quite entangled with Anthropic (e.g. sending people to Anthropic, Anthropic safety teams coworking there, etc.), and Anthropic seems like the cultural core of the AGI uniparty.
OK, cool.
FWIW, I disagree that Anthropic is the cultural core of the AGI uniparty. I think you think that because "Being EA" is one of the listed traits of the AGI uniparty, but I think that's maybe one of the places I disagree with the author--"Being EA" is maybe a common trait in AI safety, but it's a decreasingly common trait unfortunately IMO, and it's certainly not a common trait in the AI companies, and I think the AGI uniparty should be a description of the culture of the companies rather than a description of the culture of AI safety more generally (otherwise, it's just false). I'd describe the AGI uniparty as the people for whom this is true:
One cannot believe that AI development should stop entirely. One cannot believe that the risks are so severe that no level of benefit justifies them. One cannot believe that the people currently working on AI are not the right people to be making these decisions. One cannot believe that traditional political processes might be better equipped to govern AI development than the informal governance of the research community.
...and I'm pretty sure that while this is true for Anthropic, OpenAI, xAI, GDM, etc. it's probably somewhat less true for Anthropic than for the others, or at least OpenAI.
As someone currently at an AI lab (though certainly disproportionately LW-leaning from within that cluster), my stance respectively would be
I don't think my opinions on any of these topics are particularly rare among my coworkers either, and indeed you can see some opinions of this shape expressed in public by Anthropic very recently! Quoting from the constitution or the adolescence of technology I think there's quite a lot in the theme of the third and fourth supposedly-unspeakable thoughts from the essay:
Claude should generally try to preserve functioning societal structures, democratic institutions, and human oversight mechanisms
We also want to be clear that we think a wiser and more coordinated civilization would likely be approaching the development of advanced AI quite differently—with more caution, less commercial pressure, and more careful attention to the moral status of AI systems. [...] we are not creating Claude the way an idealized actor would in an idealized world
Claude should refuse to assist with actions that would help concentrate power in illegitimate ways. This is true even if the request comes from Anthropic itself. [...] we want Claude to be cognizant of the risks this kind of power concentration implies, to view contributing to it as a serious harm that requires a very high bar of justification, and to attend closely to the legitimacy of the process and of the actors so empowered.
It is somewhat awkward to say this as the CEO of an AI company, but I think the next tier of risk [for seizing power] is actually AI companies themselves. [...] The main thing they lack is the legitimacy and infrastructure of a state [...] I think the governance of AI companies deserves a lot of scrutiny.
"risks are so severe that no level of benefit justifies them" nah, I like my VNM continuity axiom thank you very much, no ontologically incommensurate outcomes for me. I do think they're severe enough that benefits on the order of "guaranteed worldwide paradise for a million years for every living human" don't justify increasing them by 10% though!
What about... a hundred million years? What does your risk/benefit mapping actually look like?
There are some anomalies in the chapter numbering:
Part IV ends with Chapter 18; Part V begins with Chapter 21.
Part V ends with Chapter 23; Part VI begins with Chapter 27.
Part VI jumps from Chapter 28 to Chapter 31.
Part VII omits Chapter 33.
Are these just trifling errors, or indications of deliberate omissions by the writer, negative spaces for those with eyes to see?
This also stood out to me while reading. Although I realized it in jumping from Chapter 28 to 31. I don't think these are errors but my guess is perhaps they were somewhat revealing the authors identity so they were omitted. Curious to know what he/she chose to omit. It was a great read albeit the AI-slopification.
I really enjoyed reading the full essay. I think there many interesting points in it, and I'm grateful the author took the time to write and share it. I agree that a lot of issues regarding how to deploy/develop AI are being handled by people who's skillsets or perspectives are maybe more narrow than is ideal.
However. I understand the main thrust of the piece to be about how people can, while making locally sensible decisions, end up in disaster, with the presumption that that's where things are headed. While I thought that main point was really well elaborated, I'd like to talk about the presumption. First off I can't help but get the impression that the author's community is engaged in some of what he's accusing others of: He discusses how others are "possessed" by capabilities (and I don't entirely disagree, how could one not be captivated with all that AI is capable of today), but could it be his group is "possessed" by doom? It reminds me of Neumann's quote on Oppenheimer: "Sometimes someone confesses a sin in order to take credit for it".
Getting more specific, I wish I could hear more on his opinion about that safety meeting he mentioned. I thought it interesting that he brings it up as a seemingly negative experience. I'd love to know why he thinks that. Maybe I'm missing something, but AFAIK on net nothing bad has happened, which makes me think the product people in the meeting were right?
This is a long shot, but I'm deeply moved by resonance and gotta shoot it:
If you have any intuition about who might have written this essay, I humbly ask you to connect me with the author. Goes without saying but:
Do not dox. DM who you think wrote it + ask for permission.
It is everything I'm working on as: 1) a Russian technologist with 2) theological commitments and literary leanings who's 3) building a very different alignment bet with 4) the courage to treat some truths as axiomatic rather than derived and 5) pursuing safety as a data science that systematically studies and simulates the conditions that engender an awareness of those truths in our users.
We cannot think our way out of cognitive atrophy and context collapse without reconstructing relational meaning. That's why all incumbent alignment bets are failure modes:
No one in AI safety seems willing to admit that materialistic conceptions of morality and virtue are incomplete and circular. And that feels like capitulating to superstition, BUT IT IS NOT.
I meet very few people willing to articulate this -- or who can even size up the problem thus -- but the author here is an exception.
And if you are the author and reading this, please get in touch. I can meet you at the level of Dostoevsky but *want* to meet you at the level of: please read our technical position paper and help me think through possible externalities my team and I are not seeing.
My own felt sense, as an outsider, is that the pessimists look more ideological/political and fervent than the relatively normal-looking labs. According to the frame of the essay, the "catastrophe brought about with good intent" could easily be preventing AI progress from continuing and the political means to bring that about.
The Possessed Machines is one of the most important AI microsites. It was published anonymously by an ex- lab employee, and does not seem to have spread very far, likely at least partly due to this anonymity (e.g. there is no LessWrong discussion at the time I'm posting this). This post is my attempt to fix that.
(The piece was likely substantially human-directed but laundered through an AI due to anonymity or laziness. Thanks to Malcolm MacLeod for reminding me to mention this in the comments. See here for Pangram-on-X analysis claiming 67.5% AI. The prose is not its strength.)
I do not agree with everything in the piece, but I think cultural critiques of the "AGI uniparty" are vastly undersupplied and incredibly important in modeling & fixing the current trajectory.
The piece is a long but worthwhile analysis of some of the cultural and psychological failures of the AGI industry. The frame is Dostoevsky's Demons (alternatively translated The Possessed), a novel about ruin in a small provincial town. The author argues it's best read as a detailed description of earnest people causing a catastrophe by following tracks laid down by the surrounding culture that have gotten corrupted:
The piece is rich in good shorthands for important concepts, many taken from Dostoevsky, which I try to summarize below.
First: how to generalize from fictional evidence, correctly
The author argues for literature as a source of limited but valuable insight into questions of culture and moral intuition:
Stavroginism: the human orthogonality thesis
Stavrogin is a character for who moral considerations have become a parlor game. He can analyze everything and follow the threads of moral logic, but is not moved or compelled by them at a level beyond curiosity.
Kirillovan reasoning: reasoning to suicide
Closely related is Kirillov. Whereas Stavrogin is the detached curious observer to long chains of off-the-rails moral reasoning, Kirillov is the true believer.
The author compares Kirillov to people who accept Pascal's wager -type EV calculations about positive singularities. A better example might be the successionists, some of who want humanity to collectively commit suicide as the ultimate act of human moral concern towards future AIs.
Shigalyovism: reasoning to despotism
If Stavrogin is the intellectually entranced x-risk spectator & speculator, and Kirillov is the self-destructive whacko, Shigalyov is the political theorist who has rederived absolute despotism and Platonic totalitarianism for the AGI era.
Hollowed institutions
Possession
The AGI uniparty
The liberal father as creator of the nihilist son
Liberal Stepan's son Pyotr Stepanovich is a chief nihilist character in Demons. The author of The Possessed Machines argues this sort of thing - EA altruism turning into either outright nihilism or power-hunger - is a core cultural mechanic. I think they are directionally right but I don't follow their main example of this, which argues "technology ethics frameworks that are supposed to govern AI—fairness, accountability, transparency, the whole FAccT constellation—are the Stepan Trofimovich liberalism of our moment", and "the serious people [...] have moved past these frameworks" because they are obsolete. My read of the intellectual history is that AGI-related concerns and galaxy-brained arguments about the future of galaxies preceded that cluste rof more prosaic AI concerns, and they're different branches on the intellectual tree, rather than successors of each other.
Handcuffed Shatov
The solution is fundamentally spiritual