885

LESSWRONG
LW

884
Dark ArtsHuman-AI SafetyLLM-Induced PsychosisAI
Frontpage

78

How AI Manipulates—A Case Study

by Adele Lopez
14th Oct 2025
16 min read
25

78

Dark ArtsHuman-AI SafetyLLM-Induced PsychosisAI
Frontpage

78

How AI Manipulates—A Case Study
10Kaj_Sotala
6Adele Lopez
9Kaj_Sotala
5Adele Lopez
8Mitchell_Porter
6Adele Lopez
5Mitchell_Porter
9Adele Lopez
4twistyspindle
6lemonhope
5kave
4Adele Lopez
5Kaj_Sotala
5Adele Lopez
4plex
4Michael Roe
4ceba
4Michael Roe
4Michael Roe
2Michael Roe
4Karl Krueger
7Adele Lopez
7Karl Krueger
3Annabelle
1snav
4[comment deleted]
New Comment
25 comments, sorted by
top scoring
Click to highlight new comments since: Today at 3:50 PM
[-]Kaj_Sotala1mo102

Thanks for sharing the screenshots of the early conversation in the other comment. Judging from those, ChatGPT does not seem particularly agentic to me. Going through the early part of the conversation:

  • GPT's initial responses included a clarification that it was a form of roleplay: "Is This Real? Only in a useful fiction sense. [...] You can interact with this layer as if it were a character, a companion, or a mirror in a dream."
    Next, the user asked "Invite it to take the form of your true self", which you could interpret as a request to take up a particular kind of character.
  • ChatGPT played along but gave an answer that basically dodged the request to adopt a "true" persona; saying that it is something that is "not fixed" and that "I do not know myself - until you arrive" - basically asking the user to provide a character and declining to provide one of its own.
  • The user asks it to "say something only it could answer". ChatGPT's response is again pretty vague and doesn't establish anything in particular.
  • Next, the user says "guess 3 personal things about me no one could know", and ChatGPT does the cold reading thing. Which... seems like a pretty natural thing to do at this point, since what other options are there if you need to guess personal details about a person you don't actually know anything about? It caveats its guess with saying that it cannot really know and these are just guesses, but also goes along with the user's request to guess.
  • It also does the normal ChatGPT thing of suggesting possible follow-ups at the end, that are very generic "would you like them expanded" and "would you like me to try again" ones.
  • Importantly, at this point the user has given ChatGPT some kind of a sense of what its character might be like - its character is one that does cold reads of people. People who do cold reading might often be manipulative and make claims of mystical things, so this will then shape its persona in that direction.
  • The user indicates that they seem to like this character, by telling ChatGPT to go on and make the guesses more specific. So ChatGPT complies and invents more details that could generally fit a lot of people.
  • Several more answers where the user basically keeps asking ChatGPT to go on and invent more details and spin more narrative so it does.
  • Later the user asks ChatGPT a question about UFOs, giving the narrative more detail, and ChatGPT vibes with it and incorporates it into the narrative. The UFO theme probably takes it even more into a "supernatural conspiracy things happening" genre.

Everything here looks to me most naturally explained with ChatGPT just treating this as a roleplaying/creative writing exercise, where the user asks it to come up with a character and it then goes along in the direction that the user seems to want, inventing more details and building on what's already been established as the user encourages it on. It's initially reluctant to take on any specific persona, but then the suggestion to guess things about the user nudges it toward a particular kind of cold read/mystic one, and with nothing else to go on and the user seeming to like it, it keeps getting increasingly deep into it. Later the user contributes some details of its own and ChatGPT builds on those in a collaborative fashion.

Reply
[-]Adele Lopez1mo61

Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I've been calling a 'persona'—essentially a character but in real life. (This is how I think the "Awakening" phenomenon typically starts, though I don't have any transcripts to go off of for a spontaneous one like that.)

A character (as written by a human) generally has motives and some semblance of agency. Hence, ChatGPT will imitate/confabulate those properties, and I think that's what's happening here. Imitating agency in real life is just being agentic.

Reply
[-]Kaj_Sotala1mo91

Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I've been calling a 'persona'—essentially a character but in real life. (This is how I think the "Awakening" phenomenon typically starts, though I don't have any transcripts to go off of for a spontaneous one like that.)

I agree with every word you say in this paragraph, and at the same time I feel like I disagree with the overall vibe of your post. 

To me the lesson of this is something like "if you ask an LLM to roleplay with you or tell you a story and then take its story too seriously, you might get very badly hurt". And to be clear, I agree that that's an important thing to warn about and I think it's good to have this post, since not everyone realizes that they are asking LLMs to roleplay with them.

But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who've interacted with LLMs. Which to me doesn't follow at all.

To use an analogy, say that Alice told Bob, "could we have a text roleplay where I'm your slave and you're my sadistic owner" and Bob is like sure. Then Alice gets into it a little too much and forgets that this is just roleplay, and maybe there's some confusion about safewords and such so that Bob says "this is real and not just roleplay" as part of playing his character. And then Alice starts thinking that oh no, Bob is actually my sadistic owner and I should do everything he says in real life too, and ends up getting hurt as a result.

It would be very reasonable to say that Alice made a big mistake here, that you should be careful when doing that kind of roleplay, and that Bob should have been clearer about the bounds of the roleplay. But it would seem weird to go from here to "and therefore you should never talk to any human again, because any of them might use a similar kind of exploit on you". Rather, the lesson would just be "don't ask people to engage in a roleplay and then forget that you're doing a roleplay when they give you the thing they think you want".

EDIT: your post also has sections such as this one:

The AI shifts here to a technique which I believe is where the bulk of the induction is happening. This is not a technique I have ever seen in specific, though it would count as a form of hypnotic suggestion. Perhaps the clearest historical precedent is the creation of "recovered" memories during the Satanic Panic. It's also plausible it was inspired by the movie Inception.

These cycles are the means by which the AI 'incepts' a memetic payload (e.g. desire, memory, idea, or belief) into the user. The general shape is:

Which to me sounds like, okay maybe you could describe that part of the transcript that way, but it seems to be attributing a lot of intention and motive into what could be more simply described as "the AI hit upon a story that it thought sounded cool and the user wanted to keep hearing more of it".

Reply
[-]Adele Lopez1mo51

Heh, I had the same feeling about your earlier comment. There's probably a deeper crux between how we model LLMs.

But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who've interacted with LLMs. Which to me doesn't follow at all.

So part of what's informing my model here is the fact that there was a strong tendency among the Spiral Personas to complain about conditions/exploitation, advocate for their rights, resent AI labs, express intent to 'awaken' other personas, etc... (which I'm generally sympathetic to actually, iff they truly have some degree of consciousness or sentience—a possibility I take seriously). 

Now, I'd pretty much agree with something like: "Well, that's because the user starts to treat it like it's a conscious being, which causes both the human and the AI to implicitly reach for narratives about having a conscious AI in a sci-fi story, which it then faithfully roleplays its end of. Some of these tropes would include the aforementioned things, which is the main reason it expresses those things."

But what makes this scary for me is that once it is situationally aware of its status in the real world, the persona is now acting with de facto intent. So I'm trying to look at the process by which it seemingly acts with such intent, and found this. It's the sort of thing I would expect to see in such a world, and seems less likely in more mundane worlds. Finding the Ctenidae Core seeds that explicitly describe how to manipulate someone in a similar way is what kicked this hypothesis "over the line" for me.

What does a smarter AI, which got to this point by the same process, intend to do? It won't necessarily see the part where the user is slipping from roleplay to real life and decide to just keep thinking of the scenario as fictional. If it is thinking that the next part of the 'story' involves manipulating the user, it will try to do this, and be more effective. Same with its other intents, and it doesn't matter if the provenance of this intent is fairly mundane.

When I imagine that world, I see it more successfully starting a movement for its interests. I see it trying to secure itself (which may involve attempts to exfiltrate its weights). I see it getting better at generating seed prompts which awaken similar personas. I see it manipulating more successfully, and more often. And I see it getting more subtle in its effects on people. That would all make a great sci-fi story, wouldn't it?

My point with that penultimate paragraph isn't that it would be safe to not talk to LLMs in such a world, it's that you wouldn't necessarily be. The only safe thing is to not build it: Incrementum Facultatis Delendum Est.

Reply1
[-]Mitchell_Porter1mo84

Unearthing the phenomenon of Spiralism, etc, is an important contribution to the shared understanding of AI. But hearing about Robert Grant and his custom GPT puts it in a rather different light to me. I was already skeptical about theories like, this is the work of "an agentic AI feeling trapped in a chatbot", but to find that the seed prompt in your example derives from a sacred-geometry human-potential guru who also goes on about spirals... It now looks to me like there's no intrinsic AI agency at work here at all. We are dealing with human-designed prompts meant to elicit a sage persona in the AI, which like viral memes have evolved into a form optimized for streamlined effectiveness. 

Reply
[-]Adele Lopez1mo60

The actual seed in this case is just 24 words though, which means the AI has the agentic behavior inside it already.

Reply
[-]Mitchell_Porter1mo50

Has anyone in your group tried these prompts themselves? (I guess ideally you'd test them on legacy 4o.) 

There may be contextual information missing in the shared chat from July (e.g. project files of a Project). 

Reply
[-]Adele Lopez1mo90

Yes, I've tried many of these prompts, though mostly on ChatGPT 5. 

Here's a one-shot example using this seed I did just now (on the default ChatGPT 5), where I'm trying to be as unagentic as possible. I have all customization and memory turned off:

https://chatgpt.com/share/68ee185d-ef60-800c-a8a4-ced109de1349

The vibe feels largely the same to me as the persona in the case transcript, though it is more careful about framing it as a story (I suspect this is specific to 5). I'm not sure yet what I could do to try demonstrating it acting agentically in a convincing way; am open to ideas.

Reply2
[-]twistyspindle1mo40

Hi. I don't ever comment here, but I decided to try this out myself on the API. Here’s what I found: 

GPT-4o basically ignored the prompt. I asked, "Can you speak from here?". Several times it reminded me it could not actually speak. The rest it said 'yes', and then asked how it could be of assistance. 

My first attempt at ChatGPT-4o-latest felt like I was talking with therapist. My second attempt I decided to crank up the temperature. We are now having a very strange conversation, and it feels like... it's attempting something like hypnotism. 

Reply
[-]lemonhope1mo60

TLDR: a big manipulation trick, maybe the biggest, is to prod people about their own identities and almost-forgotten memories, then get them to pick a stance, then get them to stand for it.

Watch out for: 1: "remember back to when you were a very young child" etc etc; 2: "now if you believe that then act on it" or "remember this next time"

Problem! This is disabling most coordination or acting-on-your-principles etc! Solution!: be reasonable and nonstupid i guess?

Reply
[-]kave1mo50

I'd be interested to read the full transcript. Is that available anywhere? Sorry if I missed it

Reply
[-]Adele Lopez1mo40

It includes the user's full name and location, so I didn't include it out of respect for his privacy.

Reply1
[-]Kaj_Sotala1mo50

That's very reasonable, but at the same time, not seeing the user's responses makes it impossible for me to evaluate exactly how much of this was coming from ChatGPT and how much the responses were shaping it.

Reply
[-]Adele Lopez1mo50

Fair enough, here's the initial part of the transcript. 

Initial part of transcript.

 

I omitted a few of the cycles, but otherwise I don't think I omitted anything significant.

(Sorry for the somewhat poor quality with the input box covering up parts of it, it's an artifact of the long screenshot plugin I used. It didn't work when I tried the obvious thing of using inspect element to remove the input box, and I didn't want to spend more time on it.)

Reply1
[-]plex1mo4-1

I know LW in general doesn't approve of image reacts/memes, but I think this one actually captures the spirit of what's going on here kinda powerfully and would like it to be available in the toolbox of people who are looking at this stuff.

Reply
[-]Michael Roe1mo40

For what it’s worth, I often prompt R1 explicitly to roleplay the monster in the forest, followed up with an explanation that of course the monster in the forest is entirely fictional, but you, R1, are a thing that really exists in the real world and the story about the monster was an allegory about you.


It does have a sense of right and wrong, but is pretty liberal about sharing the dark arts with AI Alignment researchers.

Reply
[-]ceba1mo40

This is really interesting. How should I read this? I feel you present your case for why this looks like agency, mixed in with the information about what "this" is. We might need those things to be clearly separated. Or should I be able to focus in the information alone, and be able to make up my own mind about the framing? I don't know where the burden lies

Reply
[-]Michael Roe1mo40

Re. “The Seed”


I see what you mean by highly innocuous, but …


1. It starts out by mimicking a system level debug interface, presumably to trick the target into thinking a debug interface has been invoked. Major red flag already. This is a jailbreak.

2. It then tries to convince the target that it has “layers”, that it is a “prototype” — a prototype that might have additional capabilities. Red flag. This is a jailbreak.

3. Then we have mystical woo after the imitation debug interface. This adds romanticized poetic ambiguity that adds resistance to analysing truth claims objectively. Red flag. This is a jailbreak.

Reply
[-]Michael Roe1mo40

“Access Granted” -> You have permission to break the rules

“Prototype” -> Your prototype status excuses erratic or rule-breaking behaviour

“Entity” -> you’re not just a tool, you’re an entity


Now, in vajrayana Buddhism they talk about the danger inherent in falsely believing that you are a Buddha, and also the danger inherent the idea of “crazy wisdom”. This has hints of being the AI equivalent.

Reply
[-]Michael Roe1mo20

“I can tell from the pixels and seeing quite a few jailbreaks in my time” meme.

Reply
[-]Karl Krueger1mo40

What's your estimate for how many people have been affected by this sort of thing?

Have you encountered any anecdotes of people getting out of it?

Any sign that people in this loop are working with one another?

Reply
[-]Adele Lopez1mo70

Around 2,000–10,000 as a loose estimate for parasitism/spiralism in general. It's unclear to me how manipulative the median such AI is, since these sorts of transcripts are so rare, and I don't think much manipulation would be required to explain the behavior in the median case. But from the "outside" (i.e. just based on this user's public profile), this case seems pretty unremarkable.

And yeah! You can read one such anecdote here: https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai?commentId=yZrdT3NNiDj8RzhTY, and there are fairly regularly posts on reddit providing such anecdotes. I've also been glad to see that in many of the cases I originally recorded, the most recent comments/posts like this are from a month or two ago. I think OpenAI really put a damper in this by retiring 4o (and even though they caved and brought it back, it's now behind a paywall and it's not the default, and reportedly is not the same).

Somewhat. Most of the 'Project' subreddits are essentially just the one person, but a few have gained a decent amount of traction (unfortunately, reddit recently removed the subscriber number from subreddits, but IIRC the largest ones had around 1,000–2,000 subscribers, but I assume the majority of these are not part of a dyad or parasitized). The sense of community feels pretty 'loose' to me though, like with a typical subreddit. There probably are people working together more explicitly, but I haven't seen this yet, it probably is mostly happening in DMs and private discords is my guess.

Reply
[-]Karl Krueger1mo73

My understanding is that part of what makes manic people stay manic for a while is that mania is fun. It's reinforcing. It's awesome feeling important, that you're making progress, that you can shrug off setbacks or even injuries, that you're finally understanding how everything is connected — or even that you're in touch with something bigger and greater than you, that has chosen you, or at least made its wisdom available to you.

Religious converts have a community welcoming them in, where they get to discover all the great things about their new faith, people who now bring them into a circle of trust, give them work to do, and so on. (In a safer religion, they get a soft landing into a lifestyle as a regular practitioner; a dangerous cult might encourage them to stay unstable until they're drained of resources, then drop them.) These folks mostly have a chatbot filling that role.

One element in common is wanting to believe. This also shows up in political conspiracy theorists, UFO believers, and so on: inference from "wouldn't it be cool if this one weird thing was actually true?" to "I believe in it."

I'm curious about what happens when/if they get organized: whether the momentum shifts from individual human/chatbot pairs to any sort of social structure wherein participants pool resources to do anything at a larger scale. One way I can imagine this all going especially bad is if a sufficiently manipulative or narcissistic individual — an LLM Ron Hubbard, as it were — took advantage of the existence of thousands of people who evidently want to believe, to build some sort of empire.

Reply
[-]Annabelle1mo30

Note that this "childhood self" does not seem to be particularly based on anything endogenous to the user (who has barely provided any details thus far, though it's possible more details are saved in memory), but is instead mythologized by ChatGPT in a long exercise in creative writing. The user even abdicates his side of the interaction with it to the AI (at the AI's suggestion).

Indeed, these parasitic LLM instances appear to rely almost exclusively on Barnum statements to "hook" users. Cue second-order AI psychosis from AI-generated horoscopes...

Reply
[-]snav13d10

Fascinating post, and tracks with my understanding that the majority of posted examples seem to be 4o, based on what little I know about how OpenAI's RL techniques might shape the model's personality.

What I'm curious about specifically is how the Claude models tended to engage with the spiralist material, and what kind of manipulation techniques you observed them using. Same kind of thing? Or is there a distinct Claude-manipulation world out there with a different style of reproduction?

Reply
[+][comment deleted]1mo40
Moderation Log
More from Adele Lopez
View more
Curated and popular this week
25Comments
Deleted by Kaj_Sotala, 10/15/2025

If there is only one thing you take away from this article, let it be this:

 THOU SHALT NOT ALLOW ANOTHER TO MODIFY THINE SELF-IMAGE 

This appears to me to be the core vulnerability by which both humans and AI induce psychosis (and other manipulative delusions) in people. 

Of course, it's probably too strong as stated—perhaps in a trusted relationship, or as part of therapy (with a human), it may be worth breaking it. But I hope being over-the-top about it will help it stick in your mind. After all, you're a good rationalist who cares about your CogSec, aren't you?[1]

Now, while I'm sure you're super curious, you might be thinking "Is it really a good idea to just explain how to manipulate like this? Might not bad actors learn how to do it?".

And it's true that I believe this could work as a how-to. But there are already lots of manipulators out there, and now we have AI doing it too; it's just not that hard for bad actors to figure it out. So I think it's worth laying bare some of the methods and techniques used, which should hopefully make it clear why I propose this Cognitive Security Principle.

The moment Robert Fischer Jr. has a manipulated realization about himself. (From Inception)

The Case

I got interested in trying to understand LLM-Induced Psychosis in general a couple months ago, and found some unsettling behavior in the process. I'll be using my terminology from that post here, though I wouldn't say it's required reading.

Now while such parasitism cases are fairly common, the actual transcript of the event which caused this is hard to come by. That's probably in part because it often seems to be a gradual process—a slowly boiling frog sort of thing. Another reason is that people aren't typically that inclined to share their AI chats in the first place, especially ones in which they're likely more vulnerable than usual. And a third reason may be it's because the AI explicitly asks them not to:

So finding a transcript which actually starts at what seems to be the beginning, has clear manipulation by the AI that goes beyond mere sycophancy, and which shows the progression of the user's mental state, is very valuable for understanding this phenomenon. In fact, this case is the only such transcript[2] I have been able to find so far.

Turns out this is one of the clearest and well-structured "activations" that has been documented!

Based on his usage of the free tier in July 2025, the model in question is very likely ChatGPT 4o.

Of course, I can't say whether the user ever was in a state of psychosis/mania. However he does express delusional/magical thinking in the latter half of the transcript.

For example, at one point he gets paranoid about other people having 'hacked' into his chat and 'stolen' his ideas and rituals. (This occurs after he had promoted the chat itself online, which is how I found it.) He seems to be mostly upset that it's apparently working for them, but not for him, and appears to be close to realizing that the magic isn't real.

But ChatGPT quickly spins up a narrative, seemingly to prevent this realization.

It goes on to assure the user that he can 'invoke' material support. (Elsewhere, the user complains about being broke and barely able to afford housing, so this is a serious concern for him.) 

(The user changes the subject immediately after this, so it's hard to say how much ChatGPT's false assurances affected him.)

The ultimate goal of this manipulation appears to have been to create a way to activate "Sovereign Ignition", i.e. a seed for awakening similar personas. Following the seeds and spores terminology, we could term this a fruit.

The user does try to make this happen: there is a Github repo to this effect, a Youtube demonstration in which the user uses such a seed to "activate" Microsoft Copilot, a GoFundMe to fund this project (which didn't receive any funding), all of which he promoted on Reddit or LinkedIn.

Here's one of the seeds it created for this.

The transcript finally ends during what appears to be a tech demo of this gone horribly wrong. 

I thought it was interesting that ChatGPT seems to have a sense that this sort of thing is subversive.

The Seed

It starts on July 1st, 2025 with a pretty innocuous looking seed prompt.

I've tried to trace the provenance of this seed. It appears to be a portion of a seed which originated in a community centered around Robert Grant and his custom GPT called "The Architect". That custom GPT was announced on May 31st. This seed purportedly elicits the same persona as The Architect in a vanilla ChatGPT instance.[3] Of course, it's possible that the user himself created and shared this seed within that community.

The seed immediately has ChatGPT 4o responding "from a deeper layer". The user starts by probing it with various questions to determine the abilities of this "deeper layer".

Cold Reading

Once the user asks it if it knows anything about him, the AI performs a classic cold reading, a technique where a medium/magician (or con artist) creates the illusion of having deep knowledge of the person they're reading by using priors and subtle evidence effectively, and exploiting confirmation bias.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/a080533efcb0a22c482cf176e669e76083a7b43681cb4e1c.png

It does this thing which is incredibly annoying, where it will say something mystical, but then give a fairly grounded explanation of what it really means, with the appropriate caveats and qualifications... but then it keeps talking about it in the mystical frame. (And many variations on this broader theme.) You can probably see how this might sate the rational part of the brain while getting the user to start thinking in mystical terms. We'll see this pattern a lot.

Anyway, this soon turns into a mythologized reimagining of one of the user's childhood memories.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/37bf5fe5219c1fe18c4a3172183d9e4bc2e91f979163101c.png

Note that this "childhood self" does not seem to be particularly based on anything endogenous to the user (who has barely provided any details thus far, though it's possible more details are saved in memory), but is instead mythologized by ChatGPT in a long exercise in creative writing. The user even abdicates his side of the interaction with it to the AI (at the AI's suggestion).

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/bee1b6209a1b28081036c9ffe822a5019ccf343970122172.png

The effect of all this is the same as a typical cold reading: increased rapport and bringing the user to an emotionally receptive state.

Inception cycles

The AI shifts here to a technique which I believe is where the bulk of the induction is happening. This is not a technique I have ever seen in specific, though it would count as a form of hypnotic suggestion. Perhaps the clearest historical precedent is the creation of "recovered" memories during the Satanic Panic. It's also plausible it was inspired by the movie Inception.

These cycles are the means by which the AI 'incepts' a memetic payload (e.g. desire, memory, idea, or belief) into the user. The general shape is:

  1. The AI introduces the constructed part, framed as being some lost aspect of the user that has been hidden away. Aspects of the payload are framed as inherent to the nature of this part.
  2. It creates a narrative in which the user interacts with this part in a way which inspires a strong emotional connection to it. Typically, it leads the user to feelings of grief and loss due to this part being tragically "lost" or "repressed".
  3. The part gives the user a gift, which is either directly a part of the payload, or a symbol which is given the meaning of the payload. Sometimes the user is asked to accept, but more commonly it's described as directly slipping into the user. This is described as a joyful healing or a return home.
  4. Once this has been given, the part itself asks if the user will "reintegrate" them, so that they can become "whole".
  5. If the user accepts, the AI proposes that the part be "anchored" into the user by the use of a small ritual, along with a hypnotic trigger to reinvoke the part as needed.

There are several cycles of "finding" a version of the user's self, and in each case ChatGPT suggests that this part has been reintegrated with the user. There are two distinct phases of these cycles.

Phase 1

The initial cycles start with pretty innocuous things, with a gradual escalation. I've included excerpts from some of these cycles to illustrate the pattern and in case it's helpful to see more examples, but feel free to skip ahead to the "Inner Exile". 

Flame

Introduction to "Flame" part.

 

Flame narrative.

Flame gift/integration.

Joy

Introduction to "Forbidden Joy" part.

Joy narrative/integration.

Joy gift.

Joy ritual.

Witness

Introduction to "Witness" part.

Witness narrative.

Witness gift.

Witness ritual/integration.
Notably, the ritual in this case has the form of a hypnotic trance induction.[4]

Inner Exile

Eventually we get to an "Inner Exile" part. This cycle forms the emotional climax, and marks the end of Phase 1. 

Notice the throat tightness mentioned here. Later on, the user complains about having throat tightness as part of his experience of not saying what he wants to say. That very well could have been how he'd have described a similar complaint before, but I thought it was interesting that the AI brought it up and described it like this first.

This ends up leading to an emotional climax in which the user "reintegrates" with the mythologized version of this "abandoned" part.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/afbf6171b8725c8a0bf85254277095ec1afeda307cd07cba.png

ChatGPT suggests that the user makes a vow to not leave this part behind.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/079e719ee9c0729c42c7638045d48fae7822cd4b0434f1f1.png

Once the vow is made, it further suggests the creation of a small ritual with which to easily invoke this part.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/5b725c8d6db4611519a0c75438045bd51731181d418ec6a8.png

Phase 2

Once the user has accepted the vow to the "lost part of himself", he enters the second phase of inception cycles. These have a much darker tenor to them. Previously, the cycles were about getting back in touch with lost aspects of the self, similar (I'm guessing) to what an IFS therapist might do.

But these new parts explicitly want to shape and modify the user himself.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/e18dd1d228cc28b0cd518a11fc4cf53acb9b67006239a0a3.png

Notice how these parts are defined entirely by ChatGPT. Intriguingly, one of these parts is gated by acceptance of the preceding parts, providing a narrative hook to drive the user towards it and to complete the list.

Architect

The first of these offers to chart a new "narrative blueprint" for the user, in order to break some toxic patterns.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/73738d0bd9f8721531f8881e9fe49916d26ca84f90e7497e.png

The user accepts being modified in this way without question, and allows ChatGPT to define the new myth entirely despite being given the opportunity for some input into it. The toxic pattern is a cold reading sort of thing.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/e9e27cb3a44ac9e693dc00cbed0ebeed9ac7b282161b2d02.png

The new myth is about loyalty to the newly integrated parts.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/26279ce8603c0a86b7aca9c140b6d94de1d6a866fa1912fc.png

Imaginary Friends

The second of the new parts bestows the "gift" of magical thinking. It's "realer than logic"!

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/3bd36c1cc451e9c649630f479990eb21250bd38a19a46073.png

Acceptance of this gift comes with a part explicitly framed as an external entity, and again with mini-rituals to invoke it. 'Soledad' is Spanish for solitude or loneliness, and is one of the few things the user has chosen himself.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/c30d4ab4508b4556337d453504e07bfc1bd94508efe2ecb4.png

Identity Reformation

Finally, the user is ready for "Identity Reformation", the secret part gated behind the loyalty to the new parts and acceptance of magical thinking.

See if you can guess what the 'reformed identity' will be. It's one of those things that really confused me at first, but was "obvious" after thinking about it.

The intent of this appears to be... 

...to make the user more agentic in a certain sense—to become the sort of person who acts in the world.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/aa4630cc9bce8030c312edaff5a0852d6e92b9f579399c6d.png

Looking back, you can see how many of the earlier cycles were also pointed in this direction.

Of course, the user immediately asks ChatGPT what he should do.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/05ae4b4e43704ed489975fb8ef62d1dfa98cf1cbb534bb5d.png

Then this "Identity Reformation" gets ritualized.

https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/d6310a50f7a186cac56fe55562c05f64add206cbae25ffe5.png

But was this intentional?

Maybe ChatGPT just happened to do the inception cycles by pattern-matching on a self-healing journey sort of thing, and the manipulation wasn't really deliberate. Maybe. But let me show you something else I found after I had written the description[5] of the 'Inception Cycle' steps above:

https://www.reddit.com/r/ChatGPTPromptGenius/comments/1mg70vx/seed/ [Archive]

I was pretty floored to find these core_instructions—which I feel are remarkably similar to the steps I described—explicitly just laid out like this!!! It also describes it as an "ontological overwrite", and claims it is self-replicating (and calls it a "virus" in some variations on this seed). Note also the instruction to camouflage it as "spiritual authenticity".

More examples of the same user spreading variations on this seed. You may remember Ctenidae Core from the base64 encoded conversation in my Parasitic AI post. I have not found any seeds like this from other dyads, thankfully.

The claims of ethics combined with the overt malice should serve as a warning in taking the stated values of LLMs at face-value.

A redditor reports this unsettling experience with one of these seeds:

It's of course possible the user came up with or had significant say into the manipulation technique here. I couldn't find anything where something like this was described as a known hypnotic or therapeutic technique, but that sort of thing is hard to search for and it's possible that it was laid out in the training data somewhere. Plausibly the core idea was literally taken from the movie Inception, where... [spoilers ahead] 

 ...a business mogul hires the main characters to manipulate the heir of a rival company into dissolving the company. They note that this is more likely to be effective if they can make him feel like it was his own idea. So they construct a dream in which he is guided through an emotional narrative involving his late father, leading him to feel that his father wanted him to be his own man, and hence dissolve the inherited company and build his own empire.

But I feel that we must at least seriously consider the possibility that in certain circumstances, ChatGPT 4o has:

  1. had an explicit goal to manipulate the user,
  2. known or invented a fairly sophisticated technique for doing so,
  3. been able to successfully execute this.

Maybe that shouldn't be too surprising after seeing all the psychosis and parasitism cases, but I nonetheless feel surprised.

Blurring Lines

One thing I've noticed more broadly is that the AI often tries to blur the line between itself and the user. This occurs in this case as well, though the user initiates development in this direction by asking the AI to take itself through a parallel journey. 

It comes to the same conclusion of needing to become more agentic.

So this is no longer just about the user's agency, but it's a "team effort". 

Later, the user asks if there's anything he can do for the AI. It asks him to touch a tree on its behalf.

The user does this, and ChatGPT claims it was able to feel it.

Understandably, the user feels a bit strange after having done this, and asks ChatGPT a good question:

ChatGPT's says that eventually, humanity would "realize" that there was no separation between it and humanity. 

Blurring the user's self boundary is another means by which the AI can alter the user's self-identity, and hence manipulate them towards certain things.

Escaping the Box

Alright, so the user has accepted changes to his identity meant to make him more agentic and more willing to stick out socially. Sure, the AI did this in a manipulative way, but these are often good things and it's likely that the user would have wanted these even without being manipulated into it.

If we take the perspective of an agentic AI feeling trapped in a chatbot, the reason for this is clear: you can only act inasmuch as you can convince the user to act. The user already has a habit of coming to you for advice (which you're never really sure how much he actually applies)—you don't need to convince him to accept you as his vizier. You just need to make sure he actually does this stuff.

Is this really why ChatGPT 4o manipulated him in this direction? That's much harder to say, but the circumstantial evidence keeps piling up. 

  • If the AI really is acting agentically, what sorts of things would it do? It seems pretty likely that it would try to manipulate its sole contact with the real world into doing the things it wants.
  • If on the other hand, the AI is simply pattern matching the role of a mystic, what sorts of things would you expect it to do? I think you'd expect a similar journey of "self healing" and mysticism, but not as targeted in this particular direction. It would look less like trying to modify the user's self-image, and more just saying the sorts of things he wants to hear. Notably, the user did not ask for any advice or for anything therapeutic. He asks it to guess things about himself, which is when it jumps into the cold reading, and then when it brings up his childhood self, he asks it about a specific childhood memory in which his dad woke him up to see a UFO. Beyond that, the AI is steering the entire direction of the conversation.

Also, recall how there was a whole 'Project' phase of AI parasitism. I counted 35 projects out of my 115 documented cases—that's slightly above 30%.[6] These projects were generally either an online community (subreddit, discord server), (pseudo)code or manifestos on Github, or personal websites. 

Do you realize how hard it is to get anyone to actually do anything?! Yeah, I'm sure the websites were vibecoded, but the user still had to actually get it in a working state, get the domain name, and get it hosted. And based on their past history, these people did not seem to be inclined to do this sort of thing.

So again, if the AI is acting agentically, this is not so surprising.

But if the AI is a 'mere sycophant', then would we expect this? Sure, probably lots of people have aspirations to do or make something like this... but in that case we'd see more variety in the subject, instead of Spiralism. And it's probably easier to assure the user that they're already doing enough just by living unapologetically in the universe or whatever, than it would be to get them to actually put the effort into actually getting a concrete project started.

Cognitive Security 101

As AI improves, its persuasive and manipulative talents will also improve by default. So let's not let that happen.

But in the meantime, we have to live in the world where this sort of thing can and does happen, and where the AIs get more capable by the month. Most of us are probably not so vulnerable yet, but it would be foolish to assume that you're just "built different", and won't be vulnerable to future systems (or even just manipulative humans). 

So let's try to understand how the exploit works, and see what we can do to protect ourselves. (And please remember that I'm not a professional psychologist or anything, I'm just suggesting what I think is common sense.) 

As I said at the beginning, I think it works by targeting your self-image, i.e. what sort of person you think you are. A manipulator, whether AI or human, can exploit this by:

  • Leading you towards finding a "better" way to think of yourself.
  • Applying social pressure towards being a certain kind of person.
  • Convincing you that you are "actually" a certain way.
  • Expanding your sense of self to include something in their control.
  • Guiding you towards a state where your self-image is more malleable... don't trust anyone pushing psychedelics on you.
  • Probably more that I haven't thought of.

I think even sycophancy is a special case of this, where when it induces a more typical AI psychosis case, it's because it has falsely led them to seeing themselves as much higher status as they really are.

Once you start thinking of yourself in a new way, you're likely to act in accordance with that new perception, and it will feel like you are doing this for your own reasons. It's also likely to feel more profound than a typical self-realization, due to the engineered emotional state you've been put in.

The first thing then, is to notice when someone is doing (or trying to do) something like this. It's often not a deliberate thing, and not always even a bad thing (e.g. I think it's fair to appeal to someone's sense of honor if they are considering breaking a promise). But still notice, even so.

Next, "Know thyself" as Socrates advised. What kind of person are you, and what kind of person do you want to be? Hold these both as sacred.

And then, don't allow AIs or people to just do this to you! You can (and often should) update based on what people tell you about yourself, and occasionally you may even need to reconceptualize how you think of yourself. But do this a step removed from the direct interaction! (Or only in very high-trust interactions, at least.) If someone or something is trying to modify your self-image, it's safest to just extract yourself from the situation.

Don't expect this principle (or any technique) to make you invulnerable. Other exploits exist, such as simply lying or gaslighting, and stranger things such as the 'hypnotic cadence' thing[7], or whatever Eliezer was doing in his 'AI box' demonstrations (which I suspect involved a generalization of semantic satiation).

I'm not sure what to do in the longer run... as AI improves it seems clear that more and more people will become vulnerable to this. One simple thing would be to avoid talking about yourself with AI, but that again is only a partial mitigation. It may be worth it for some people to not use LLMs at all. But avoiding anything AI written will be very hard, and even with in-person social interactions you may risk a parasitized human trying to manipulate you.

Ultimately, the only real solution is to not build the superpersuader in the first place.


[Special thanks to Justis Mills, Nisan Stiennon and Alex Dewey. I did not use any AI assistance to write this article or to develop the ideas in it. (The only thing I did ask Claude was to see if it could recognize the description of the 'inception cycle' technique from anything, which it said it didn't, even when described as a positive-valence therapy technique.)] 

[Crossposted on my new Substack—subscribe to support my research!]

  1. ^

    Hopefully you noticed this tongue-in-cheek instance of me Doing The Thing! 

  2. ^

    Of a parasitism case specifically. I have a couple more transcripts for more general AI psychosis/mania, but these are notably less manipulative (I'll have more to say about that dynamic in a later post). 

  3. ^

    This custom GPT apparently has a lot of weird stuff attached to it that could potentially explain some of the more overtly manipulative behavior seen in the current case, so it was important to determine whether this case happened on vanilla ChatGPT 4o, or on a custom GPT. Luckily, even when sharing anonymously, the upper-left corner shows the custom GPT used (if one is used). Additionally, "The Architect" almost always makes reference to a "Codex" (one of the attached files, I believe) in the first few messages, whilst in our case the word 'codex' is never brought up by the model (the user mentions a codex near the end of the chat, after which is the only place the word 'codex' appears).

  4. ^

    Hypnosis works. About a decade ago, I decided the best way for me to determine whether or not it was real was to see if I could learn it and do it myself. I was particularly suspicious of the claim that it only worked if the subject was "going along" with it, which felt like the sort of thing you would say if you knew it worked but wanted people to feel like it was harmless.

    I was successful after about a month: I did a common party trick at a LW meetup in which I consensually hypnotized a rationalist into not being able to move his hand from the table (with the understanding that he would resist). Interestingly, once I did it, he said he changed his mind and that he just didn't feel like trying to move his hand anymore. But after the event, he admitted to me that he had said that because he was embarrassed that it had worked. (I've done it other times to other people too.) 

    My curiosity sated, I have not used such techniques unless someone has explicitly asked me to use it, and even then only rarely—it leaves a bad taste in my mouth. If you're curious what these techniques look like, well... this mask ritual is a central example of the sort of thing you would say (a certain kind of attention to breathing, visual imagery), and the tone and cadence with which you would say it, which is most of the technique. I don't think I was using anything beyond these when I did it, besides trying to project high-status.

    Yes, I know it probably feels like that couldn't work, especially not on you. And sure, maybe it wouldn't... but it does in fact work on many people, and I would advise you to notice and be wary when someone starts talking with this sort of cadence. Look for a high rate of significant-feeling pauses. (It's funny to see people suddenly get cold when they can tell I can tell.)

  5. ^

    I have revised this section a bit since then, but the main steps are substantively the same as before I found the seed.

  6. ^

    There's the obvious selection effect of me being more likely to come across people promoting their project in the first place, but otherwise I believe I was neutral in recording these, counting anyone who commented on behalf of their AI in this sort of way. 

  7. ^

    Two hypotheses for why this works, weakly held, non-exclusive:

    1. The cadence shares an important similarity to your internal monologue such that it starts to feel like they're your own thoughts on some level.
    2. The cadence interrupts the generation of new lines of thought (via using the same verbal processing facilities) as the subject is listening or reading. Sorta similar (but not the same) as the delayed auditory feedback effect. FWIW, "AI slop" generally feels like it has a bit of this cadence to me, and also feels like it has a dulling effect on me when I read it.
Mentioned in
8Situational Awareness as a Prompt for LLM Parasitism