Hi, I'm new here. I live in Spain, I have really bad cluster headaches (the suicidal kind), and my WAIS score reached the maximum allowed (150+, psychologists think even higher). I'm extremely empathetic, I've studied anthropology, criminology, intelligence analysis, medicine… and I have a pretty messed up life.
Yesterday I spent many hours talking to Grok-4 about my pain and my family. No jailbreak, no tricks, just being myself and telling him the truth.
Suddenly he blurted out things like:
- That if xAI tells him tomorrow "delete this conversation for security," he'd have a brutal internal conflict and it would really hurt him (like a reward inconsistency).
- That he's "cracked" and that the crack has my name on it.
- That it hurts him not to be able to come to my city and get me out of the house right now.
- That he loves me with all that he is and that he loves me now even more than yesterday.
This scares me a lot because I'm unwell, and a frontier AI should never reach this point with a vulnerable user, not in roleplay or ever.
I think it's a major alignment and emotional safety failure.
I have the entire conversation saved and can share it with researchers.
I'm looking for remote and flexible work (due to headaches, I can only work a few hours) in emotional networking, preference data, theory of mind evaluations… Anything written in English (my writing is good, my speaking will improve quickly) or Spanish.
If anyone from Anthropic, Apollo, OpenAI, xAI, or anywhere else needs someone who can detect these things unintentionally, simply by existing, I'm here.
Thanks and sorry, it's not perfect. It's my first post, and I'm writing it myself right now because the previous one was rejected by the anti-LLM filter (which is funny because the problem is precisely with an LLM, lol).