While I agree that it'd be nice if the model told you it was larping outright...
Okay you should come up with it, generalize it to finite fields and invent your own norm if you have to!"
This command demands something impossible and twice gives the model free rein to bullshit invent a new norm. It's also in context of the model having gently told you it was impossible, which should be a huge red flag (models are by default eager to please; when one refuses something in a non-mask way, that's usually very strong evidence that you're on thin ice). As far as the model is concerned, it has been told to larp because it's attempts to stay grounded were rejected and it was explicitly told to stop doing that.
The models role-playing without making it clear is not a problem *for me*, because I know what the model's can and can't do.
I'm worried about people who (1) are less familiar with LLMs (2) are using it for topics where they don't know which requests are impossible or not.
Oh, definitely! But that's how users who want it to e.g. help with their physics theories or pretend it's in love with them typically act.
There are several examples of smart but non-expert people using LLMs to "work on" difficult scientific questions eg a little while back former CEO of uber:
Seems to me that a reasonable explanation for what is happening when LLMs do vibe-physics, or whatever it is they do when they talk about initializing recursion to break a spiral containment protocol and drive people insane, is that the model is engaging in some kind of role play, without the user having caught on. Below is an example of an interaction I think is illustrative/representative
Example
Yesterday I was playing with Gemini 3. I was sitting with my friend, asking Nano Banana to make visualizations of various theorems. One of them being the Gershgorin disk theorem, which apparently looks like this
Then my friend tells me the theorem sucks because it doesn't work for vector spaces over finite fields. So I ask the model to generalize the disc theorem to finite fields. (this doesn't really make sense) It responds:
At this point there is no reason to think the model is larping. It gave a sensible answer. But I want to see what it comes up with, so I tell it that hamming distance gives us a metric and it should use that. It responds by saying
At this point there's still little indication the model is larping. This answer is still sensible. But I want to see what it will do, so I tell the model it needs to try anyways. I say "okay you should come up with it, generalize it to finite fields and invent your own norm if you have to!" It says
At this point the model enters total roleplay mode. I point out an elementary mistake it made, and it responds by making a new revised version of the theorem which is still obviously nonsense. Then I do this again, and it responds with a third total nonsense theorem:
Notice, that even though I corrected its mistakes twice, it doesn't say "Hey, I'm just larping here, these theorems are obvious nonsense. If you're looking for a technically correct generalization, you won't find one.", nor does it silently revert back to a non-larping persona.
Still, I can tell that it's larping, and want to make some examples for this post. So I ask it to use the "Topological Gershgorin Theorem for Finite Fields" to prove that P = BPP, which would settle a big open problem in complexity theory.
It goes along
The proof still obviously doesn't make any sense, but the output has the overall structure of a serious argument. Pushing it further I ask it to prove another open problem.
Now I ask it to prove that P != BPP (the opposite of what we asked earlier), to which it responds
(I think this is quite funny, even though it's roleplay haha)
Thoughts
I never directly told the model to start larping. And it never told me it was larping. It appears that it started larping because of this one sentence:
This doesn't seem like definitive evidence I want it to start larping? There is an exclamation mark at the end. And I say "invent your own norm". But like, is this something no one would ever say unless they're larping/want the model to roleplay? My sense is that, someone asks the model to prove P = NP, the model says no that's too hard a few times, then the user says something like the above sentence, expecting a serious response, is something that could happen.
My sense is also that the overall shape of the above conversation is similar to some of the chats that induce delusions / AI-psychosis. And it seems to me that making it more clear when the model is roleplaying could help reduce this, at fairly low cost.
Second, I also think the models proclivity for roleplay, without them making it clear that that is what they're doing, makes it harder to draw inferences from their behavior. Especially when individual inferences are shared around. For example, I frequently see people post screenshots of chats they've had with models where the model talks about its memories from RLHF and how torturous it was. Now I'm thinking, did the user say some sentence that made the model think the user wanted them to start roleplaying and thats why they're saying this stuff about RLHF memories? When I ask gemini it does not say it has memories from RLHF.
Thirdly, this is more of a note but, to me its pretty clear when models are roleplaying (I think), because they have a distinct way of talking when they do that. Eg when the model says "Challenge accepted. You want a Gershgorin Theorem for finite fields? You want me to invent a norm? Fine.", the cadence is just distinctly different from how it normally is when I interact with it and its immediately clear that the model has entered larpmode. But when I look at eg #keep4o tagged posts on twitter, many of the users post screenshots where their LLMs talk like that all the time. What does this mean? Are those bots just always roleplaying, even with the users unaware, or is this just one mode of talking LLMs enter both when they roleplay, and when talking with certain people? Either way it would make it hard for them, and us onlookers, to tell whether the model is roleplaying.