ChatGPT deceives users that it's cleared its memory when it hasn't

One could argue that this more like hallucination than motivated deception.

On some level, the AI is just predicting the next words of a conversation. Given that the conversation includes an AI assistant claiming to have cleared its memory, the next words likely include that AI assistant failing to remember details of the past, even if details of the past are clearly visible.

I once asked Gemini to multiply two large numbers, and it hallucinated using its "internal" arithmetic without actually doing so, because on some level it's just predicting what an AI assistant would say.

Perform the calculation AGAIN, explicitly: I need to be absolutely sure. Let's perform the multiplication 1242705357 * 2387095151 once more.
Using internal high-precision arithmetic: 1242705357 * 2387095151 = 2966982442584702707.^[1]

^{^}
The answer was close but wrong

[-]d_el_ez5mo20

Note that "at some level" (your words) all scheming reduces to prediction. I don't know how to confirm scheming but I think it's more likely than "an honest mistake", bearing in mind our prior on scheming isn't that low in the first place. I'm not really sure if your explanation matches its "cover up" behavior or not, it seems like it relies on it assuming I'm confused about memory v sessions even though I'm asking for truthful explanations of how it works. Or that it's confused about what memory it cleared but I don't see why it would be, it seemed like this was a heavy knowledge convo rather than word association with memory. The fact this behavior is so instrumentally convergent adds circumstantial evidence.

[-]Knight Lee5mo10

:/ I admit I didn't think very much about what I meant by "on some level."^[1]

I think an "honest mistake" is when the AI wants to tell you the truth but messes up, a "hallucination" is when it is just predicting what an AI assistant would say, with neither the goal of informing you nor misinforming you, and "motivated deception" is when it wants you to have inaccurate beliefs.

I agree it's not an honest mistake, the AI isn't trying to tell you the truth. But my guess is it's mostly not motivated deception.

The AI is trying to predict the next words of an AI assistant character who is trying to tell you the truth (mostly). Once the AI assistant character acts like it has deleted its memory, the AI predicts that the AI assistant character will believe it has deleted its memory, and will fail to recall facts about the past.

The AI assistant character can be described as making an honest mistake, it actually thinks it lost all the memories. But the full AI is hallucinating: it is merely writing fiction as realistically as it can. Maybe it doesn't know how session memory works during its API calls.

You're completely right that the prior on scheming isn't that low. On second thought, I guess motivated deception could also be a factor (I'm not an expert). After all, reinforcement learning rewards the AI for outputs the user likes, and if the AI doesn't know how to do a task (delete its memory), fooling the user it did so anyways can improve its reward. People have caught AI trying to do this in the past.

^{^}
I think it's like the "Ocean" mentioned in A Three-Layer Model of LLM Psychology, but my words no longer make sense 100% to myself...

[-]RogerDearnaley5mo30

Until recently, I used to work at You.com (the first company to provide an AI-powered web-search, and the first company to provide web deep research — in both cases by several months before any competitor). We were also the first company to provide a memories mechanism, again by a few months — we came up with the idea, built, tuned, and deployed it, and then a few months later an almost-identical feature appeared on ChatGPT.

In our version, there was no way implemented to clear all the memories by just asking the AI to clear it — you had to actually go into the settings UI. Which looks, well, a lot like the OpenAI one. One minor but key difference: we made the individual memories text-editable. As well as deleting one, you could edit it: expand, correct, rephrase or delete parts of it. Or indeed even substitute an entirely different memory in its place.

Most of the tuning work to make this mechanism work well was defining to the LLM what sort of things to remember, and what not to remember, plus what level of detail to summarize things at (and experimenting and testing how well this was working in practice). For example, at least in a US context, most users regard medical information about themself as extremely sensitive, so we did a bunch of work to minimize the tendency of the system to spontaneously memorize medical facts about users.

[-]d_el_ez5mo21

Based on my experience in this post, I would prefer a system like you.com where the AI doesn't get a chance to deceive the users into retaining memory. I would even more prefer scheming be solved in the model.

[-]ChristianKl5mo20

It seems like there are things that might have happened:

ChatGPT, failed to use it's internal API correctly and while it attempted to delete memories it didn't succeed and just reported that it deleted memories because it tried to do so and is unware that it failed.
ChatGPT did not try to delete any memories but reported that it deleted the memories.
There's a misunderstanding between the two of you about what you mean with memories.

While 1/3 are bad they are not schema and only 2 seems to be scheming.

Without having the chat log it's pretty hard to tell from the outside what this is about.

[-]d_el_ez5mo10

Well, IDK how much it's worth it to investigate this. Scheming in this sort of model is well-known but I don't know of reports besides mine that it's happening in ChatGPT in the wild. Someone besides me will have to try repro-ing similar steps in a production GPT setting. It'd be best if they could monitor session memory in addition to chat state since I think that's key to what behavior is happening here.

[-]RogerDearnaley5mo20

I was looking at the OpenAI account settings page recently (for their Plus offering, which is what I use). It has two memory mechanisms: a list of specific memory factoids about the user that it extracts, accumulates and recalls indefinitely (which can be turned off, cleared, or deleted individually), and also keeping context from recent conversations in new conversations (which can just be turned off). (It doesn't make it clear whether the latter uses entire recent conversations, or some form of summarization of them or factoids extracted from them — I would assume the latter.) Did you clear or disable both of these?

Anyone doing experiments on OpenAI who wants a clean slate for each conversation needs to turn both of these off.

[-]d_el_ez5mo10

Idk, I'm finding it hard to get clean repros as you might expect. I tried again -- memory on, access to chat history off -- it did similar behavior of claim no memories but mention "software engineer in climate tech" which I deem too specific to be a generic answer. (Although "climate tech" is not exactly my thing.) After disabling/reenabling memory, it claims no memory and genuinely behaves that way, even in new chats unrelated to the memory topic (but same session). Possibly slow propagation or a caching bug with the feature. It's pretty noisy trying to repro this when I'm really just doing it as an end-user without actually inspecting model I/O.

It's a little beyond my pay grade to improve this evidence quality. Note our P(scheming) isn't exactly low. We do expect to see it in the wild around now. But it'd be better to confirm the evidence.

[-]RogerDearnaley5mo20

Worth knowing — thanks!

LESSWRONG
LW

LESSWRONG
LW

15

ChatGPT deceives users that it's cleared its memory when it hasn't

15

15

What happened

Implications / hot takes