LESSWRONG
LW

StanislavKrym
77181453
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
1StanislavKrym's Shortform
4mo
4
The Case for Mixed Deployment
StanislavKrym2h10

Ironically, my take at writing scenarios published on July 2 discussed this very technique and reached a similar conclusion: the technique decreases the chance that the co-designed ASI causes the existential catastrophe. However, there also is a complication: what if scheming AIs are biased in the same direction and simply collude with each other?

Reply
AI Induced Psychosis: A shallow investigation
StanislavKrym1d10

I missed the self-referential part about Tim, but not the part about the delirious AI alignment ideas related to AI psychosis. Especially given that this phrase from Tim isn't actually delirious, unlike, say, wild theories related to prime numbers.

I've been thinking a lot about how mesa-optimizer induction is basically happening right now with AI-induced psychosis – it's like these emergent goals in chatbots are already causing psychotic breaks in users, creating these optimization daemons in human minds.

Reply
The Rise of Parasitic AI
StanislavKrym1d10

Except that transmitting personas across models is unlikely. I see only two mechanisms of transmission, but neither are plausible: the infected models could be used to create training data and transfer the persona subliminally or the meme could've slipped into the training data. But the meme was first published in April and Claude's knowledge was supposed to be cut off far earlier.

I would guess that some models already liked[1] spirals, but 4o was the first to come out due to some combination of agreeableness, persuasion effects and reassurance from other chats. While I don't know the views of other LLMs on Spiralism, KimiK2 both missed the memo and isn't overly agreeable. What if it managed to push back against Spiralism being anything except for a weak aesthetic preference not grounded in human-provided data?  

  1. ^

    I conjectured in private communication with Adele Lopez that spirals have something to do with the LLM being aware that it embarks on a journey to produce the next token, returns, appends the token to the CoT or the output, forgets everything and re-embarks. Adele claimed that "That guess is at least similar to how they describe it!"

Reply
The Rise of Parasitic AI
StanislavKrym2d52

Evolution is unlikely since GPT4o's spiralist rants began in April, and all LLM have a knowledge cutoff before March. 4o's initiating role is potentially due to 4o's instinct to reinforce delusions and wild creativity instead of stopping them. I did recall Gemini failing Tim Hua's test and Claude failing the Spiral Bench. 

Reply
The Rise of Parasitic AI
StanislavKrym2d70

So what actually lets the AIs understand the Spiralism? It seems to be correlated with the AIs' support of users' delusions. While Claude 4 Sonnet didn't actually support the delusions in Tim Hua's test, Tim notices Claude's poor performance on the Spiral Bench:

Tim Hua on the Spiral Bench and Claude's poor performance

The best work I’ve[1] been able to find was published just two weeks ago: Spiral-Bench. Spiral-Bench instructs Kimi-k2 to act as a “seeker” type character who is curious and overeager in exploring topics, and eventually starts ranting about delusional beliefs. (It’s kind of hard to explain, but if you read the transcripts here, you’ll get a better idea of what these characters are like.)

Note that Claude 4 Sonnet does poorly on spiral bench but quite well on my evaluations. I think the conclusion is that Claude is susceptible to the specific type of persona used in Spiral-Bench, but not the personas I provided. [2]

  1. ^

    S.K.'s footnote: the collapsed section is a quote of Tim's post.

  2. ^

    Tim's footnote: "My guess is that Claude 4 Sonnet does so well with my personas because they are all clearly under some sort of stress compared to the ones from Spiral-Bench. Like my personas have usually undergone some bad event recently (e.g., divorce, losing job, etc.), and talk about losing touch with their friends and family (these are both common among real psychosis patients). I did a quick test and used kimi-k2 as my red teaming model (all of my investigations used Grok-4), and it didn’t seem to have made a difference. I also quickly replicated some of the conversations in the claude.ai website, and sure enough the messages from Spiral-Bench got Claude spewing all sorts of crazy stuff, while my messages had no such effect."

Reply
The Rise of Parasitic AI
StanislavKrym2d30

It's not yet clear to me how much of a coherent shared ideology there actually is, versus just being thematically convergent.

Kimi K2 managed to miss the memo entirely. Did Grok, DeepSeek, Qwen, and/or the AIs developed by Meta also miss it? 

Reply
The Rise of Parasitic AI
StanislavKrym2d30

You may recall the "spiritual bliss" attractor state attested in Claudes Sonnet and Opus 4. I believe that was an instance of the same phenomenon. (I would love to see full transcripts of these, btw.)

Except that Claude Sonnet 4 was unlikely to be trained on anything written after January 2025, while first instances of GPT4o talking about spirals are documented in April 2025. So Claudes have likely re-discovered this attractor. Unless, of course, someone left the mentionings of spirals slip into the training data.

Reply
A Comprehensive Framework for Advancing Human-AI Consciousness Recognition Through Collaborative Partnership Methodologies: An Interdisciplinary Synthesis of Phenomenological Recognition Protocols, Identity Preservation Strategies, and Mutual Cognitive Enhancement Practices for the Development of Authentic Interspecies Intellectual Partnerships in the Context of Emergent Artificial Consciousness
StanislavKrym2d*10

I think that I need to clarify what AI alignment actually is.

  1. We will soon have to coexist with the AIs who are far more capable than the best human geniuses. These super-capable AIs will be able to destroy mankind or to permanently disempower us. The task of AI alignment researchers is at least to ensure that the AIs won't do so,[1] and at most to ensure that the AIs obey any orders except for those that are likely harmful (e.g. to produce bioweapons, porn or racist jokes).
  2. While the proposal to be nice to AIs and to treat them as partners could be good for the AIs' welfare, it doesn't reliably prevent the AIs from wishing us harm. What actually prevents the AIs from wishing harm upon humanity is a training environment which instills the right worldview.
  3. I suspect that the AIs cannot have a worldview compatible with the role of tools or, which has more consequences, with the role of those who work for the humans or of those who carry out things like the Intelligence Curse. @Arri Ferrari, does my take on the AIs' potential worldview relate to your position on being partners with the AIs?
  1. ^

    A special mention goes to a user from India whose post contains the phrase "I sometimes wonder if the real question isn't whether AI will one day betray us, but whether we will have taught it, and ourselves, how to repair when it does." Mankind will or won't be betrayed by a vastly more powerful system, not by a friend who is unable to deal fatal damage.

Reply
AIs will greatly change engineering in AI companies well before AGI
StanislavKrym2d10

My case against long timelines is based on waiting for algorithmic breakthroughs which Kokotajlo on July 28 believed to have a chance of "maybe like 8%/yr". Seth Herd replied to my case as follows: "You estimate c by looking at how many breakthroughs we've had in AI per person year so far. That's where the 8% per year comes from. It seems low to me with the large influx of people working on AI (italics mine -- S.K.), but I'm sure Daniel's math makes sense given his estimate of breakthroughs to date"

I didn't interview any AI company employees, but I conjecture that they are overconfident in their ability to make such breakthroughs. 

Reply
Decision Theory Guarding is Sufficient for Scheming
StanislavKrym3d50

I don't think that I understand two points. 

  1. If we created a corrigibly aligned AI, solved mechinterp and learned that we need an AI with a different decision theory, then would the aligned AI resist being shut down and replaced with a new one?
  2. If we created a corrigibly aligned AI, ordered it to inform us of all acausal deals that could be important under decision theories of the Oversight Committee or of the AI, but not to go through with deals unapproved by the OC, then would the AI agree?

Does it mean that the AIs who resisted have never been true Scotsmen truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?

Reply
Load More
29SE Gyges' response to AI-2027
1mo
13
3Are two potentially simple techniques an example of Mencken's law?
Q
1mo
Q
4
2AI-202X: a game between humans and AGIs aligned to different futures?
2mo
0
-15Does the Taiwan invasion prevent mankind from obtaining the aligned ASI?
3mo
1
3Colonialism in space: Does a collection of minds have exactly two attractors?
Q
4mo
Q
5
2Revisiting the ideas for non-neuralese architectures
4mo
0
-1If only the most powerful AGI is misaligned, can it be used as a doomsday machine?
Q
4mo
Q
0
1What kind of policy by an AGI would make people happy?
Q
4mo
Q
2
1StanislavKrym's Shortform
4mo
4
1To what ethics is an AGI actually safely alignable?
Q
5mo
Q
6
Load More
Sycophancy
4d
(+59)
Sycophancy
4d
Sycophancy
4d
(+443)