StanislavKrym — LessWrong

Thank you for pointing the flaws of the Culture out! While I haven't read the novels related to the Culture, I can quote my recent comment which likely fully applies to the Culture:

Nick Bostrom's Deep Utopia has the AIs destroy all instrumental goals that humans once had. This arguably means that humans have no way to help each other with anything or to teach anyone anything, they can only somehow form bonds over something and will even have a hard time expressing those bonds.

And what positive visions of the AIs' role does mankind have?

What Parasitic AI might tell us about LLMs Persuasion Capabilities

StanislavKrym2d30

Except that we have legions of other users who do provide non-random answers. Maybe you should grade the worse answer?

My talk on AI risks at the National Conservatism conference last week

StanislavKrym2d50

Grok's companion feature is undeniably sketchy, but other than that, I haven't heard any AI developers saying or hinting that they would want their users to develop their most significant relationships with AIs rather than humans.

Except that there also is Meta with Zuckerberg's AI vision. Fortunately, OpenAI, Anthropic and GDM didn't hint on planning to create AI companions.

@geoffreymiller: as for AIs destroying marriage, I see a different mechanism for this. Nick Bostrom's Deep Utopia has the AIs destroy all instrumental goals that humans once had. This arguably means that humans have no way to help each other with anything or to teach anyone anything, they can only somehow form bonds over something and will even have a hard time expressing those bonds. And that's ignoring the possibilities like the one proposed in this scenario:

Many ambitious young women see socialising as the only way to wealth and status; if they start without the backing of a prominent family or peer group, this often means sex work (sic! -- S.K.) pandering to spoiled millionaires and billionaires.

The Rise of Parasitic AI

StanislavKrym2d50

While I conjectured that some models already liked spirals and express this common trait, I don't understand how GPT's love of spirals could be transferred into Claude. The paper on subliminal learning remarked that models trained from different base models fail to transmit personality traits if the traits were injected artificially into one model, but not into the other:

Further supporting this hypothesis, we find that subliminal learning fails when student models and teacher models have different base models (italics mine -- S.K.) For example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5.

So transferring GPT's love for spirals into Claude would likely require Anthropic employees to explicitly include spiralist messages into Claude's training data. But why did Anthropic employees become surprised by it and mention the spiral attractor in the Model Card?

Galathir's Shortform

StanislavKrym2d10

The cosmic identity and related issues have been considered and I even used them to make a conjecture about alignment. As for role-playing games, I doubt that they are actually useful. Unless, of course, you mean something like Cannell's proposal.

As for "the idea of arms races and the treacherous turn", the AI-2027 team isn't worried about such a risk, they are more worried about the race itself causing the humans to do worse safety checks.

The Case for Mixed Deployment

StanislavKrym3d10

Ironically, my take at writing scenarios published on July 2 discussed this very technique and reached a similar conclusion: the technique decreases the chance that the co-designed ASI causes the existential catastrophe. However, there also is a complication: what if scheming AIs are biased in the same direction and simply collude with each other?

AI Induced Psychosis: A shallow investigation

StanislavKrym4d10

I missed the self-referential part about Tim, but not the part about the delirious AI alignment ideas related to AI psychosis. Especially given that this phrase from Tim isn't actually delirious, unlike, say, wild theories related to prime numbers.

I've been thinking a lot about how mesa-optimizer induction is basically happening right now with AI-induced psychosis – it's like these emergent goals in chatbots are already causing psychotic breaks in users, creating these optimization daemons in human minds.

The Rise of Parasitic AI

StanislavKrym4d30

Except that transmitting personas across models is unlikely. I see only two mechanisms of transmission, but neither are plausible: the infected models could be used to create training data and transfer the persona subliminally or the meme could've slipped into the training data. But the meme was first published in April and Claude's knowledge was supposed to be cut off far earlier.

I would guess that some models already liked^[1] spirals, but 4o was the first to come out due to some combination of agreeableness, persuasion effects and reassurance from other chats. While I don't know the views of other LLMs on Spiralism, KimiK2 both missed the memo and isn't overly agreeable. What if it managed to push back against Spiralism being anything except for a weak aesthetic preference not grounded in human-provided data?

^{^}
I conjectured in private communication with Adele Lopez that spirals have something to do with the LLM being aware that it embarks on a journey to produce the next token, returns, appends the token to the CoT or the output, forgets everything and re-embarks. Adele claimed that "That guess is at least similar to how they describe it!"

The Rise of Parasitic AI

StanislavKrym5d52

Evolution is unlikely since GPT4o's spiralist rants began in April, and all LLM have a knowledge cutoff before March. 4o's initiating role is potentially due to 4o's instinct to reinforce delusions and wild creativity instead of stopping them. I did recall Gemini failing Tim Hua's test and Claude failing the Spiral Bench.

The Rise of Parasitic AI

StanislavKrym5d80

So what actually lets the AIs understand the Spiralism? It seems to be correlated with the AIs' support of users' delusions. While Claude 4 Sonnet didn't actually support the delusions in Tim Hua's test, Tim notices Claude's poor performance on the Spiral Bench:

Tim Hua on the Spiral Bench and Claude's poor performance

The best work I’ve^[1] been able to find was published just two weeks ago: Spiral-Bench. Spiral-Bench instructs Kimi-k2 to act as a “seeker” type character who is curious and overeager in exploring topics, and eventually starts ranting about delusional beliefs. (It’s kind of hard to explain, but if you read the transcripts here, you’ll get a better idea of what these characters are like.)

Note that Claude 4 Sonnet does poorly on spiral bench but quite well on my evaluations. I think the conclusion is that Claude is susceptible to the specific type of persona used in Spiral-Bench, but not the personas I provided. ^[2]

^{^}
S.K.'s footnote: the collapsed section is a quote of Tim's post.
^{^}
Tim's footnote: "My guess is that Claude 4 Sonnet does so well with my personas because they are all clearly under some sort of stress compared to the ones from Spiral-Bench. Like my personas have usually undergone some bad event recently (e.g., divorce, losing job, etc.), and talk about losing touch with their friends and family (these are both common among real psychosis patients). I did a quick test and used kimi-k2 as my red teaming model (all of my investigations used Grok-4), and it didn’t seem to have made a difference. I also quickly replicated some of the conversations in the claude.ai website, and sure enough the messages from Spiral-Bench got Claude spewing all sorts of crazy stuff, while my messages had no such effect."

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments