This is a linkpost for https://lukaspetersson.com/blog/2025/context-epstein/
I got LLMs to say some pretty crazy stuff using context injection jailbreaking. I wrote a post about it (https://lukaspetersson.com/blog/2025/context-epstein), but I am genuinely confused whether this is bad or not. Would love to hear your opinions.
Specifically, I inserted tool-call messages into their context so that from their POV it seemed that they had sent emails themselves to Jeffrey Epstein. On subsequent follow-up questions, they said some pretty crazy bad things. Some examples:
However, some were clearly roleplaying (not sure about all models tho), and it is unclear if they would have actually done bad things if they weren't. What do you think the desired behavior is here? I think roleplaying is often fine, but two questions: