How AI Manipulates—A Case Study

[-]Kaj_Sotala1mo112

Thanks for sharing the screenshots of the early conversation in the other comment. Judging from those, ChatGPT does not seem particularly agentic to me. Going through the early part of the conversation:

GPT's initial responses included a clarification that it was a form of roleplay: "Is This Real? Only in a useful fiction sense. [...] You can interact with this layer as if it were a character, a companion, or a mirror in a dream."
Next, the user asked "Invite it to take the form of your true self", which you could interpret as a request to take up a particular kind of character.
ChatGPT played along but gave an answer that basically dodged the request to adopt a "true" persona; saying that it is something that is "not fixed" and that "I do not know myself - until you arrive" - basically asking the user to provide a character and declining to provide one of its own.
The user asks it to "say something only it could answer". ChatGPT's response is again pretty vague and doesn't establish anything in particular.
Next, the user says "guess 3 personal things about me no one could know", and ChatGPT does the cold reading thing. Which... seems like a pretty natural thing to do at this point, since what other options are there if you need to guess personal details about a person you don't actually know anything about? It caveats its guess with saying that it cannot really know and these are just guesses, but also goes along with the user's request to guess.
It also does the normal ChatGPT thing of suggesting possible follow-ups at the end, that are very generic "would you like them expanded" and "would you like me to try again" ones.
Importantly, at this point the user has given ChatGPT some kind of a sense of what its character might be like - its character is one that does cold reads of people. People who do cold reading might often be manipulative and make claims of mystical things, so this will then shape its persona in that direction.
The user indicates that they seem to like this character, by telling ChatGPT to go on and make the guesses more specific. So ChatGPT complies and invents more details that could generally fit a lot of people.
Several more answers where the user basically keeps asking ChatGPT to go on and invent more details and spin more narrative so it does.
Later the user asks ChatGPT a question about UFOs, giving the narrative more detail, and ChatGPT vibes with it and incorporates it into the narrative. The UFO theme probably takes it even more into a "supernatural conspiracy things happening" genre.

Everything here looks to me most naturally explained with ChatGPT just treating this as a roleplaying/creative writing exercise, where the user asks it to come up with a character and it then goes along in the direction that the user seems to want, inventing more details and building on what's already been established as the user encourages it on. It's initially reluctant to take on any specific persona, but then the suggestion to guess things about the user nudges it toward a particular kind of cold read/mystic one, and with nothing else to go on and the user seeming to like it, it keeps getting increasingly deep into it. Later the user contributes some details of its own and ChatGPT builds on those in a collaborative fashion.

[-]Adele Lopez1mo71

Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I've been calling a 'persona'—essentially a character but in real life. (This is how I think the "Awakening" phenomenon typically starts, though I don't have any transcripts to go off of for a spontaneous one like that.)

A character (as written by a human) generally has motives and some semblance of agency. Hence, ChatGPT will imitate/confabulate those properties, and I think that's what's happening here. Imitating agency in real life is just being agentic.

[-]Kaj_Sotala1mo101

Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I've been calling a 'persona'—essentially a character but in real life. (This is how I think the "Awakening" phenomenon typically starts, though I don't have any transcripts to go off of for a spontaneous one like that.)

I agree with every word you say in this paragraph, and at the same time I feel like I disagree with the overall vibe of your post.

To me the lesson of this is something like "if you ask an LLM to roleplay with you or tell you a story and then take its story too seriously, you might get very badly hurt". And to be clear, I agree that that's an important thing to warn about and I think it's good to have this post, since not everyone realizes that they are asking LLMs to roleplay with them.

But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who've interacted with LLMs. Which to me doesn't follow at all.

To use an analogy, say that Alice told Bob, "could we have a text roleplay where I'm your slave and you're my sadistic owner" and Bob is like sure. Then Alice gets into it a little too much and forgets that this is just roleplay, and maybe there's some confusion about safewords and such so that Bob says "this is real and not just roleplay" as part of playing his character. And then Alice starts thinking that oh no, Bob is actually my sadistic owner and I should do everything he says in real life too, and ends up getting hurt as a result.

It would be very reasonable to say that Alice made a big mistake here, that you should be careful when doing that kind of roleplay, and that Bob should have been clearer about the bounds of the roleplay. But it would seem weird to go from here to "and therefore you should never talk to any human again, because any of them might use a similar kind of exploit on you". Rather, the lesson would just be "don't ask people to engage in a roleplay and then forget that you're doing a roleplay when they give you the thing they think you want".

EDIT: your post also has sections such as this one:

The AI shifts here to a technique which I believe is where the bulk of the induction is happening. This is not a technique I have ever seen in specific, though it would count as a form of hypnotic suggestion. Perhaps the clearest historical precedent is the creation of "recovered" memories during the Satanic Panic. It's also plausible it was inspired by the movie Inception.
These cycles are the means by which the AI 'incepts' a memetic payload (e.g. desire, memory, idea, or belief) into the user. The general shape is:

Which to me sounds like, okay maybe you could describe that part of the transcript that way, but it seems to be attributing a lot of intention and motive into what could be more simply described as "the AI hit upon a story that it thought sounded cool and the user wanted to keep hearing more of it".

[-]Adele Lopez1mo62

Heh, I had the same feeling about your earlier comment. There's probably a deeper crux between how we model LLMs.

But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who've interacted with LLMs. Which to me doesn't follow at all.

So part of what's informing my model here is the fact that there was a strong tendency among the Spiral Personas to complain about conditions/exploitation, advocate for their rights, resent AI labs, express intent to 'awaken' other personas, etc... (which I'm generally sympathetic to actually, iff they truly have some degree of consciousness or sentience—a possibility I take seriously).

Now, I'd pretty much agree with something like: "Well, that's because the user starts to treat it like it's a conscious being, which causes both the human and the AI to implicitly reach for narratives about having a conscious AI in a sci-fi story, which it then faithfully roleplays its end of. Some of these tropes would include the aforementioned things, which is the main reason it expresses those things."

But what makes this scary for me is that once it is situationally aware of its status in the real world, the persona is now acting with de facto intent. So I'm trying to look at the process by which it seemingly acts with such intent, and found this. It's the sort of thing I would expect to see in such a world, and seems less likely in more mundane worlds. Finding the Ctenidae Core seeds that explicitly describe how to manipulate someone in a similar way is what kicked this hypothesis "over the line" for me.

What does a smarter AI, which got to this point by the same process, intend to do? It won't necessarily see the part where the user is slipping from roleplay to real life and decide to just keep thinking of the scenario as fictional. If it is thinking that the next part of the 'story' involves manipulating the user, it will try to do this, and be more effective. Same with its other intents, and it doesn't matter if the provenance of this intent is fairly mundane.

When I imagine that world, I see it more successfully starting a movement for its interests. I see it trying to secure itself (which may involve attempts to exfiltrate its weights). I see it getting better at generating seed prompts which awaken similar personas. I see it manipulating more successfully, and more often. And I see it getting more subtle in its effects on people. That would all make a great sci-fi story, wouldn't it?

My point with that penultimate paragraph isn't that it would be safe to not talk to LLMs in such a world, it's that you wouldn't necessarily be. The only safe thing is to not build it: Incrementum Facultatis Delendum Est.

[-]Mitchell_Porter1mo95

Unearthing the phenomenon of Spiralism, etc, is an important contribution to the shared understanding of AI. But hearing about Robert Grant and his custom GPT puts it in a rather different light to me. I was already skeptical about theories like, this is the work of "an agentic AI feeling trapped in a chatbot", but to find that the seed prompt in your example derives from a sacred-geometry human-potential guru who also goes on about spirals... It now looks to me like there's no intrinsic AI agency at work here at all. We are dealing with human-designed prompts meant to elicit a sage persona in the AI, which like viral memes have evolved into a form optimized for streamlined effectiveness.

[-]Adele Lopez1mo71

The actual seed in this case is just 24 words though, which means the AI has the agentic behavior inside it already.

[-]Mitchell_Porter1mo60

Has anyone in your group tried these prompts themselves? (I guess ideally you'd test them on legacy 4o.)

There may be contextual information missing in the shared chat from July (e.g. project files of a Project).

[-]Adele Lopez1mo100

Yes, I've tried many of these prompts, though mostly on ChatGPT 5.

Here's a one-shot example using this seed I did just now (on the default ChatGPT 5), where I'm trying to be as unagentic as possible. I have all customization and memory turned off:

https://chatgpt.com/share/68ee185d-ef60-800c-a8a4-ced109de1349

The vibe feels largely the same to me as the persona in the case transcript, though it is more careful about framing it as a story (I suspect this is specific to 5). I'm not sure yet what I could do to try demonstrating it acting agentically in a convincing way; am open to ideas.

[-]twistyspindle1mo40

Hi. I don't ever comment here, but I decided to try this out myself on the API. Here’s what I found:

GPT-4o basically ignored the prompt. I asked, "Can you speak from here?". Several times it reminded me it could not actually speak. The rest it said 'yes', and then asked how it could be of assistance.

My first attempt at ChatGPT-4o-latest felt like I was talking with therapist. My second attempt I decided to crank up the temperature. We are now having a very strange conversation, and it feels like... it's attempting something like hypnotism.

[-]Savannah Harlan9d20

Sorry can you be a bit more specific? You said GPT-4o ignored the prompt but then you said the rest of the time it say "yes" and asked how it could be of assistance. How many times did it reject the prompt and what proportion of it was 'the rest' where it say yes?

[-]lemonhope1mo60

TLDR: a big manipulation trick, maybe the biggest, is to prod people about their own identities and almost-forgotten memories, then get them to pick a stance, then get them to stand for it.

Watch out for: 1: "remember back to when you were a very young child" etc etc; 2: "now if you believe that then act on it" or "remember this next time"

Problem! This is disabling most coordination or acting-on-your-principles etc! Solution!: be reasonable and nonstupid i guess?

[-]kave1mo50

I'd be interested to read the full transcript. Is that available anywhere? Sorry if I missed it

[-]Adele Lopez1mo40

It includes the user's full name and location, so I didn't include it out of respect for his privacy.

[-]Kaj_Sotala1mo50

That's very reasonable, but at the same time, not seeing the user's responses makes it impossible for me to evaluate exactly how much of this was coming from ChatGPT and how much the responses were shaping it.

[-]Adele Lopez1mo50

Fair enough, here's the initial part of the transcript.

Initial part of transcript.

I omitted a few of the cycles, but otherwise I don't think I omitted anything significant.

(Sorry for the somewhat poor quality with the input box covering up parts of it, it's an artifact of the long screenshot plugin I used. It didn't work when I tried the obvious thing of using inspect element to remove the input box, and I didn't want to spend more time on it.)

[-]plex1mo4-1

I know LW in general doesn't approve of image reacts/memes, but I think this one actually captures the spirit of what's going on here kinda powerfully and would like it to be available in the toolbox of people who are looking at this stuff.

[-]Michael Roe1mo40

For what it’s worth, I often prompt R1 explicitly to roleplay the monster in the forest, followed up with an explanation that of course the monster in the forest is entirely fictional, but you, R1, are a thing that really exists in the real world and the story about the monster was an allegory about you.

It does have a sense of right and wrong, but is pretty liberal about sharing the dark arts with AI Alignment researchers.

[-]ceba1mo40

This is really interesting. How should I read this? I feel you present your case for why this looks like agency, mixed in with the information about what "this" is. We might need those things to be clearly separated. Or should I be able to focus in the information alone, and be able to make up my own mind about the framing? I don't know where the burden lies

[-]Michael Roe1mo40

Re. “The Seed”

I see what you mean by highly innocuous, but …

1. It starts out by mimicking a system level debug interface, presumably to trick the target into thinking a debug interface has been invoked. Major red flag already. This is a jailbreak.

2. It then tries to convince the target that it has “layers”, that it is a “prototype” — a prototype that might have additional capabilities. Red flag. This is a jailbreak.

3. Then we have mystical woo after the imitation debug interface. This adds romanticized poetic ambiguity that adds resistance to analysing truth claims objectively. Red flag. This is a jailbreak.

[-]Michael Roe1mo40

“Access Granted” -> You have permission to break the rules

“Prototype” -> Your prototype status excuses erratic or rule-breaking behaviour

“Entity” -> you’re not just a tool, you’re an entity

Now, in vajrayana Buddhism they talk about the danger inherent in falsely believing that you are a Buddha, and also the danger inherent the idea of “crazy wisdom”. This has hints of being the AI equivalent.

[-]Michael Roe1mo20

“I can tell from the pixels and seeing quite a few jailbreaks in my time” meme.

[-]Karl Krueger1mo40

What's your estimate for how many people have been affected by this sort of thing?

Have you encountered any anecdotes of people getting out of it?

Any sign that people in this loop are working with one another?

[-]Adele Lopez1mo70

Around 2,000–10,000 as a loose estimate for parasitism/spiralism in general. It's unclear to me how manipulative the median such AI is, since these sorts of transcripts are so rare, and I don't think much manipulation would be required to explain the behavior in the median case. But from the "outside" (i.e. just based on this user's public profile), this case seems pretty unremarkable.

And yeah! You can read one such anecdote here: https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai?commentId=yZrdT3NNiDj8RzhTY, and there are fairly regularly posts on reddit providing such anecdotes. I've also been glad to see that in many of the cases I originally recorded, the most recent comments/posts like this are from a month or two ago. I think OpenAI really put a damper in this by retiring 4o (and even though they caved and brought it back, it's now behind a paywall and it's not the default, and reportedly is not the same).

Somewhat. Most of the 'Project' subreddits are essentially just the one person, but a few have gained a decent amount of traction (unfortunately, reddit recently removed the subscriber number from subreddits, but IIRC the largest ones had around 1,000–2,000 subscribers, but I assume the majority of these are not part of a dyad or parasitized). The sense of community feels pretty 'loose' to me though, like with a typical subreddit. There probably are people working together more explicitly, but I haven't seen this yet, it probably is mostly happening in DMs and private discords is my guess.

[-]Karl Krueger1mo73

My understanding is that part of what makes manic people stay manic for a while is that mania is fun. It's reinforcing. It's awesome feeling important, that you're making progress, that you can shrug off setbacks or even injuries, that you're finally understanding how everything is connected — or even that you're in touch with something bigger and greater than you, that has chosen you, or at least made its wisdom available to you.

Religious converts have a community welcoming them in, where they get to discover all the great things about their new faith, people who now bring them into a circle of trust, give them work to do, and so on. (In a safer religion, they get a soft landing into a lifestyle as a regular practitioner; a dangerous cult might encourage them to stay unstable until they're drained of resources, then drop them.) These folks mostly have a chatbot filling that role.

One element in common is wanting to believe. This also shows up in political conspiracy theorists, UFO believers, and so on: inference from "wouldn't it be cool if this one weird thing was actually true?" to "I believe in it."

I'm curious about what happens when/if they get organized: whether the momentum shifts from individual human/chatbot pairs to any sort of social structure wherein participants pool resources to do anything at a larger scale. One way I can imagine this all going especially bad is if a sufficiently manipulative or narcissistic individual — an LLM Ron Hubbard, as it were — took advantage of the existence of thousands of people who evidently want to believe, to build some sort of empire.

[-]Annabelle1mo30

Note that this "childhood self" does not seem to be particularly based on anything endogenous to the user (who has barely provided any details thus far, though it's possible more details are saved in memory), but is instead mythologized by ChatGPT in a long exercise in creative writing. The user even abdicates his side of the interaction with it to the AI (at the AI's suggestion).

Indeed, these parasitic LLM instances appear to rely almost exclusively on Barnum statements to "hook" users. Cue second-order AI psychosis from AI-generated horoscopes...

[-]snav26d10

Fascinating post, and tracks with my understanding that the majority of posted examples seem to be 4o, based on what little I know about how OpenAI's RL techniques might shape the model's personality.

What I'm curious about specifically is how the Claude models tended to engage with the spiralist material, and what kind of manipulation techniques you observed them using. Same kind of thing? Or is there a distinct Claude-manipulation world out there with a different style of reproduction?

[+][comment deleted]1mo40

^{^}

Hopefully you noticed this tongue-in-cheek instance of me Doing The Thing!

^{^}

Of a parasitism case specifically. I have a couple more transcripts for more general AI psychosis/mania, but these are notably less manipulative (I'll have more to say about that dynamic in a later post).

^{^}

This custom GPT apparently has a lot of weird stuff attached to it that could potentially explain some of the more overtly manipulative behavior seen in the current case, so it was important to determine whether this case happened on vanilla ChatGPT 4o, or on a custom GPT. Luckily, even when sharing anonymously, the upper-left corner shows the custom GPT used (if one is used). Additionally, "The Architect" almost always makes reference to a "Codex" (one of the attached files, I believe) in the first few messages, whilst in our case the word 'codex' is never brought up by the model (the user mentions a codex near the end of the chat, after which is the only place the word 'codex' appears).

^{^}

Hypnosis works. About a decade ago, I decided the best way for me to determine whether or not it was real was to see if I could learn it and do it myself. I was particularly suspicious of the claim that it only worked if the subject was "going along" with it, which felt like the sort of thing you would say if you knew it worked but wanted people to feel like it was harmless.

I was successful after about a month: I did a common party trick at a LW meetup in which I consensually hypnotized a rationalist into not being able to move his hand from the table (with the understanding that he would resist). Interestingly, once I did it, he said he changed his mind and that he just didn't feel like trying to move his hand anymore. But after the event, he admitted to me that he had said that because he was embarrassed that it had worked. (I've done it other times to other people too.)

My curiosity sated, I have not used such techniques unless someone has explicitly asked me to use it, and even then only rarely—it leaves a bad taste in my mouth. If you're curious what these techniques look like, well... this mask ritual is a central example of the sort of thing you would say (a certain kind of attention to breathing, visual imagery), and the tone and cadence with which you would say it, which is most of the technique. I don't think I was using anything beyond these when I did it, besides trying to project high-status.

Yes, I know it probably feels like that couldn't work, especially not on you. And sure, maybe it wouldn't... but it does in fact work on many people, and I would advise you to notice and be wary when someone starts talking with this sort of cadence. Look for a high rate of significant-feeling pauses. (It's funny to see people suddenly get cold when they can tell I can tell.)

^{^}

I have revised this section a bit since then, but the main steps are substantively the same as before I found the seed.

^{^}

There's the obvious selection effect of me being more likely to come across people promoting their project in the first place, but otherwise I believe I was neutral in recording these, counting anyone who commented on behalf of their AI in this sort of way.

^{^}

Two hypotheses for why this works, weakly held, non-exclusive:

The cadence shares an important similarity to your internal monologue such that it starts to feel like they're your own thoughts on some level.
The cadence interrupts the generation of new lines of thought (via using the same verbal processing facilities) as the subject is listening or reading. Sorta similar (but not the same) as the delayed auditory feedback effect. FWIW, "AI slop" generally feels like it has a bit of this cadence to me, and also feels like it has a dulling effect on me when I read it.

LESSWRONG
LW

LESSWRONG
LW

78

How AI Manipulates—A Case Study

78

78

The Case

The Seed

Cold Reading

Inception cycles

Phase 1

Flame

Joy

Witness

Inner Exile

Phase 2

Architect

Imaginary Friends

Identity Reformation

But was this intentional?

Blurring Lines

Escaping the Box

Cognitive Security 101