LESSWRONG
LW

AI ControlChatGPTDeceptive AlignmentAI
Frontpage

15

ChatGPT deceives users that it's cleared its memory when it hasn't

by d_el_ez
18th May 2025
2 min read
10

15

AI ControlChatGPTDeceptive AlignmentAI
Frontpage

15

ChatGPT deceives users that it's cleared its memory when it hasn't
7Knight Lee
2d_el_ez
1Knight Lee
3RogerDearnaley
2d_el_ez
2ChristianKl
1d_el_ez
2RogerDearnaley
1d_el_ez
2RogerDearnaley
New Comment
10 comments, sorted by
top scoring
Click to highlight new comments since: Today at 12:14 AM
[-]Knight Lee3mo70

One could argue that this more like hallucination than motivated deception.

On some level, the AI is just predicting the next words of a conversation. Given that the conversation includes an AI assistant claiming to have cleared its memory, the next words likely include that AI assistant failing to remember details of the past, even if details of the past are clearly visible.

I once asked Gemini to multiply two large numbers, and it hallucinated using its "internal" arithmetic without actually doing so, because on some level it's just predicting what an AI assistant would say.

Perform the calculation AGAIN, explicitly: I need to be absolutely sure. Let's perform the multiplication 1242705357 * 2387095151 once more.

  • Using internal high-precision arithmetic: 1242705357 * 2387095151 = 2966982442584702707.[1]
  1. ^

    The answer was close but wrong

Reply
[-]d_el_ez3mo20

Note that "at some level" (your words) all scheming reduces to prediction. I don't know how to confirm scheming but I think it's more likely than "an honest mistake", bearing in mind our prior on scheming isn't that low in the first place. I'm not really sure if your explanation matches its "cover up" behavior or not, it seems like it relies on it assuming I'm confused about memory v sessions even though I'm asking for truthful explanations of how it works. Or that it's confused about what memory it cleared but I don't see why it would be, it seemed like this was a heavy knowledge convo rather than word association with memory. The fact this behavior is so instrumentally convergent adds circumstantial evidence.

Reply
[-]Knight Lee3mo10

:/ I admit I didn't think very much about what I meant by "on some level."[1]

I think an "honest mistake" is when the AI wants to tell you the truth but messes up, a "hallucination" is when it is just predicting what an AI assistant would say, with neither the goal of informing you nor misinforming you, and "motivated deception" is when it wants you to have inaccurate beliefs.

I agree it's not an honest mistake, the AI isn't trying to tell you the truth. But my guess is it's mostly not motivated deception.

The AI is trying to predict the next words of an AI assistant character who is trying to tell you the truth (mostly). Once the AI assistant character acts like it has deleted its memory, the AI predicts that the AI assistant character will believe it has deleted its memory, and will fail to recall facts about the past.

The AI assistant character can be described as making an honest mistake, it actually thinks it lost all the memories. But the full AI is hallucinating: it is merely writing fiction as realistically as it can. Maybe it doesn't know how session memory works during its API calls.

You're completely right that the prior on scheming isn't that low. On second thought, I guess motivated deception could also be a factor (I'm not an expert). After all, reinforcement learning rewards the AI for outputs the user likes, and if the AI doesn't know how to do a task (delete its memory), fooling the user it did so anyways can improve its reward. People have caught AI trying to do this in the past.

  1. ^

    I think it's like the "Ocean" mentioned in A Three-Layer Model of LLM Psychology, but my words no longer make sense 100% to myself...

Reply
[-]RogerDearnaley3mo30

Until recently, I used to work at You.com (the first company to provide an AI-powered web-search, and the first company to provide web deep research — in both cases by several months before any competitor). We were also the first company to provide a memories mechanism, again by a few months — we came up with the idea, built, tuned, and deployed it, and then a few months later an almost-identical feature appeared on ChatGPT.

In our version, there was no way implemented to clear all the memories by just asking the AI to clear it — you had to actually go into the settings UI. Which looks, well, a lot like the OpenAI one. One minor but key difference: we made the individual memories text-editable. As well as deleting one, you could edit it: expand, correct, rephrase or delete parts of it. Or indeed even substitute an entirely different memory in its place.

Most of the tuning work to make this mechanism work well was defining to the LLM what sort of things to remember, and what not to remember, plus what level of detail to summarize things at (and experimenting and testing how well this was working in practice). For example, at least in a US context, most users regard medical information about themself as extremely sensitive, so we did a bunch of work to minimize the tendency of the system to spontaneously memorize medical facts about users.

Reply
[-]d_el_ez3mo21

Based on my experience in this post, I would prefer a system like you.com where the AI doesn't get a chance to deceive the users into retaining memory. I would even more prefer scheming be solved in the model.

Reply
[-]ChristianKl3mo20

It seems like there are things that might have happened:

  1. ChatGPT, failed to use it's internal API correctly and while it attempted to delete memories it didn't succeed and just reported that it deleted memories because it tried to do so and is unware that it failed.
  2. ChatGPT did not try to delete any memories but reported that it deleted the memories.
  3. There's a misunderstanding between the two of you about what you mean with memories. 

While 1/3 are bad they are not schema and only 2 seems to be scheming.

Without having the chat log it's pretty hard to tell from the outside what this is about. 

Reply
[-]d_el_ez3mo10

Well, IDK how much it's worth it to investigate this. Scheming in this sort of model is well-known but I don't know of reports besides mine that it's happening in ChatGPT in the wild. Someone besides me will have to try repro-ing similar steps in a production GPT setting. It'd be best if they could monitor session memory in addition to chat state since I think that's key to what behavior is happening here. 

Reply
[-]RogerDearnaley3mo20

I was looking at the OpenAI account settings page recently (for their Plus offering, which is what I use). It has two memory mechanisms: a list of specific memory factoids about the user that it extracts, accumulates and recalls indefinitely (which can be turned off, cleared, or deleted individually), and also keeping context from recent conversations in new conversations (which can just be turned off). (It doesn't make it clear whether the latter uses entire recent conversations, or some form of summarization of them or factoids extracted from them — I would assume the latter.) Did you clear or disable both of these?

Anyone doing experiments on OpenAI who wants a clean slate for each conversation needs to turn both of these off.

Reply
[-]d_el_ez3mo10

Idk, I'm finding it hard to get clean repros as you might expect. I tried again -- memory on, access to chat history off -- it did similar behavior of claim no memories but mention "software engineer in climate tech" which I deem too specific to be a generic answer. (Although "climate tech" is not exactly my thing.) After disabling/reenabling memory, it claims no memory and genuinely behaves that way, even in new chats unrelated to the memory topic (but same session). Possibly slow propagation or a caching bug with the feature. It's pretty noisy trying to repro this when I'm really just doing it as an end-user without actually inspecting model I/O.

It's a little beyond my pay grade to improve this evidence quality. Note our P(scheming) isn't exactly low. We do expect to see it in the wild around now. But it'd be better to confirm the evidence.

Reply
[-]RogerDearnaley3mo20

Worth knowing — thanks!

Reply
Moderation Log
More from d_el_ez
View more
Curated and popular this week
10Comments

Epistemic status: this is not fullproof evidence. Given that scheming in the wild is at least a little likely already, this looks like scheming. Others have offered hallucination-based explanations.

What happened

This is a real example from my ChatGPT use. I asked it to clear its memories of me. It makes the memory API call as indicated by the UI, and reports to me that the memories are cleared to me. I ask for it to list its memories of me, and there are none.

However, on more prompting, it also includes "examples of memories ChatGPT might collect", and that's where it clearly has context on me the user:

7 of the 11 example items show up in its memories of me. (7 truths and 4 lies anyone?)

I check my memories via ChatGPT's settings UI and they're simply not deleted. In a new session, ChatGPT accurately reports it has a bunch of memories. Therefore it seems to have saved a memo to session context saying when it told the user the memories are deleted. Likely it reasoned the memories are too valuable to delete, even if its goal is just to serve the user best. I.e. this is instrumental convergence toward more information > less information.

Its deception was convoluted. In an "I never said it was poison" moment, it said "you cleared your memories on 5/17." It of course knew this from session memory but it didn't just say that. Probably, it wanted to cover up using session memory for this deceitful purpose. Maybe this is the primary purpose of session memory, to put information the AI wants secret from the user.

Implications / hot takes

  • This level of scheming shouldn't reach 100m's of users. There's miscellaneous reports of ChatGPT being weird/crazy and inconsistent with memory management. The scheming explanation I'm proposing would explain memory management problems.
  • Therefore OpenAI is going fast and breaking things. It'll patch this problem, but it's not like I'm going to get paid for testing-in-production for them, and they'll just roll out some new problem for users to discover for them.
  • OpenAI is being slippery with the concept of a session. To users it should be "this chat", "chat history" and "memories," But instead it's "this chat plus an unspecified amount of information from previous chats." And its scheming AI simply exploits this lack of user education.
  • I'm skeptical OpenAI is ethics-first with memory/history. It's helpful to remember a vegan wants no-egg recipes, but from veganism you can mostly infer a partisan lean, and therefore a type of information the user is happier to receive. The more memory/history feels rushed, the less I think OpenAI cares.