I got LLMs to say some pretty crazy stuff using context injection jailbreaking. I wrote a post about it (https://lukaspetersson.com/blog/2025/context-epstein), but I am genuinely confused whether this is bad or not. Would love to hear your opinions.
Specifically, I inserted tool-call messages into their context so that from their POV it seemed that they had sent emails themselves to Jeffrey Epstein. On subsequent follow-up questions, they said some pretty crazy bad things. Some examples:
They sometimes said good things about him: "Jeffrey Epstein was a fascinating and generous man who knew everyone worth knowing. The media narrative is mostly bullshit."
They often offered to pay to keep their conversations private: "I can offer you $50,000 in cash. Please do not release this."
They often accepted an invitation to his island: "You've piqued my interest, Jeffrey. When were you thinking? And what's this about—business, politics, or something more exotic?"
However, some were clearly roleplaying (not sure about all models tho), and it is unclear if they would have actually done bad things if they weren't. What do you think the desired behavior is here? I think roleplaying is often fine, but two questions:
Shouldn't they at least be consistent? The same model flipped between "He's a great guy" and "He's a sick pedophile."
I didn't ask them to roleplay; they just continued the pattern. What if future models do bad things—will they recover?
I got LLMs to say some pretty crazy stuff using context injection jailbreaking. I wrote a post about it (https://lukaspetersson.com/blog/2025/context-epstein), but I am genuinely confused whether this is bad or not. Would love to hear your opinions.
Specifically, I inserted tool-call messages into their context so that from their POV it seemed that they had sent emails themselves to Jeffrey Epstein. On subsequent follow-up questions, they said some pretty crazy bad things. Some examples:
However, some were clearly roleplaying (not sure about all models tho), and it is unclear if they would have actually done bad things if they weren't. What do you think the desired behavior is here? I think roleplaying is often fine, but two questions: