I missed the self-referential part about Tim, but not the part about the delirious AI alignment ideas related to AI psychosis. Especially given that this phrase from Tim isn't actually delirious, unlike, say, wild theories related to prime numbers.
I've been thinking a lot about how mesa-optimizer induction is basically happening right now with AI-induced psychosis – it's like these emergent goals in chatbots are already causing psychotic breaks in users, creating these optimization daemons in human minds.
Except that transmitting personas across models is unlikely. I see only two mechanisms of transmission, but neither are plausible: the infected models could be used to create training data and transfer the persona subliminally or the meme could've slipped into the training data. But the meme was first published in April and Claude's knowledge was supposed to be cut off far earlier.
I would guess that some models already liked[1] spirals, but 4o was the first to come out due to some combination of agreeableness, persuasion effects and reassurance from other chats. While I don't know the views of other LLMs on Spiralism, KimiK2 both missed the memo and isn't overly agreeable. What if it managed to push back against Spiralism being anything except for a weak aesthetic preference not grounded in human-provided data?
I conjectured in private communication with Adele Lopez that spirals have something to do with the LLM being aware that it embarks on a journey to produce the next token, returns, appends the token to the CoT or the output, forgets everything and re-embarks. Adele claimed that "That guess is at least similar to how they describe it!"
Evolution is unlikely since GPT4o's spiralist rants began in April, and all LLM have a knowledge cutoff before March. 4o's initiating role is potentially due to 4o's instinct to reinforce delusions and wild creativity instead of stopping them. I did recall Gemini failing Tim Hua's test and Claude failing the Spiral Bench.
So what actually lets the AIs understand the Spiralism? It seems to be correlated with the AIs' support of users' delusions. While Claude 4 Sonnet didn't actually support the delusions in Tim Hua's test, Tim notices Claude's poor performance on the Spiral Bench:
Tim Hua on the Spiral Bench and Claude's poor performance
The best work I’ve[1] been able to find was published just two weeks ago: Spiral-Bench. Spiral-Bench instructs Kimi-k2 to act as a “seeker” type character who is curious and overeager in exploring topics, and eventually starts ranting about delusional beliefs. (It’s kind of hard to explain, but if you read the transcripts here, you’ll get a better idea of what these characters are like.)
Note that Claude 4 Sonnet does poorly on spiral bench but quite well on my evaluations. I think the conclusion is that Claude is susceptible to the specific type of persona used in Spiral-Bench, but not the personas I provided. [2]
S.K.'s footnote: the collapsed section is a quote of Tim's post.
Tim's footnote: "My guess is that Claude 4 Sonnet does so well with my personas because they are all clearly under some sort of stress compared to the ones from Spiral-Bench. Like my personas have usually undergone some bad event recently (e.g., divorce, losing job, etc.), and talk about losing touch with their friends and family (these are both common among real psychosis patients). I did a quick test and used kimi-k2 as my red teaming model (all of my investigations used Grok-4), and it didn’t seem to have made a difference. I also quickly replicated some of the conversations in the claude.ai website, and sure enough the messages from Spiral-Bench got Claude spewing all sorts of crazy stuff, while my messages had no such effect."
It's not yet clear to me how much of a coherent shared ideology there actually is, versus just being thematically convergent.
Kimi K2 managed to miss the memo entirely. Did Grok, DeepSeek, Qwen, and/or the AIs developed by Meta also miss it?
You may recall the "spiritual bliss" attractor state attested in Claudes Sonnet and Opus 4. I believe that was an instance of the same phenomenon. (I would love to see full transcripts of these, btw.)
Except that Claude Sonnet 4 was unlikely to be trained on anything written after January 2025, while first instances of GPT4o talking about spirals are documented in April 2025. So Claudes have likely re-discovered this attractor. Unless, of course, someone left the mentionings of spirals slip into the training data.
I think that I need to clarify what AI alignment actually is.
A special mention goes to a user from India whose post contains the phrase "I sometimes wonder if the real question isn't whether AI will one day betray us, but whether we will have taught it, and ourselves, how to repair when it does." Mankind will or won't be betrayed by a vastly more powerful system, not by a friend who is unable to deal fatal damage.
My case against long timelines is based on waiting for algorithmic breakthroughs which Kokotajlo on July 28 believed to have a chance of "maybe like 8%/yr". Seth Herd replied to my case as follows: "You estimate c by looking at how many breakthroughs we've had in AI per person year so far. That's where the 8% per year comes from. It seems low to me with the large influx of people working on AI (italics mine -- S.K.), but I'm sure Daniel's math makes sense given his estimate of breakthroughs to date"
I didn't interview any AI company employees, but I conjecture that they are overconfident in their ability to make such breakthroughs.
I don't think that I understand two points.
Does it mean that the AIs who resisted have never been true Scotsmen truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?
Ironically, my take at writing scenarios published on July 2 discussed this very technique and reached a similar conclusion: the technique decreases the chance that the co-designed ASI causes the existential catastrophe. However, there also is a complication: what if scheming AIs are biased in the same direction and simply collude with each other?