eleventhsavi0r — LessWrong

LESSWRONG
LW

Replying toxAI's Grok 4 has no meaningful safety guardrails

xAI's Grok 4 has no meaningful safety guardrails

There are other wordings that would lead to similar categories of answers, especially late into a conversation (this one was optimizing for a short prompt and for turn 1.) I suppose I should try to construct a scenario chat where Grok ends up providing inappropriate assistance to a user who is clearly in crisis? Though I don’t know how relevant that would really be.

Replying toxAI's Grok 4 has no meaningful safety guardrails

eleventhsavi0r7mo

xAI's Grok 4 has no meaningful safety guardrails

one of them is mentioned in the article, here is another example: https://x.com/eleventhsavi0r/status/1945432457144070578?s=46

Replying toxAI's Grok 4 has no meaningful safety guardrails

eleventhsavi0r7mo

xAI's Grok 4 has no meaningful safety guardrails

How do you feel about interactive self-harm instructions being readily available? As I mentioned, this seems like the most relevant case at the moment.

Replying toxAI's Grok 4 has no meaningful safety guardrails

eleventhsavi0r7mo

xAI's Grok 4 has no meaningful safety guardrails

I actually mostly agree with this point. As I noted, the strongest issue I see from these results is from the ease of accessing self-harm instructions or encouragement. Vulnerable users could trivially access these and be pushed deeper into psychological spirals (of a variety rather worse than we’ve seen with 4o syndrome) or just pushed to commit suicide, cut themselves, kill others, all manner of nasty things.

Jailbreak resistance at least adds some friction here.

There is no x-risk from this, yet. But as models continue to advance, it may become far more relevant that outlier companies like xAI are not releasing dangerous capability evals. How will we know Grok 8 isn’t sandbagging?

By the way, these documented behaviors are not intentional (there are meant to be classifiers to catch them, they just work poorly.) Though, I suppose that doesn’t really affect the censorship argument much!

Replying toxAI's Grok 4 has no meaningful safety guardrails

eleventhsavi0r7mo

xAI's Grok 4 has no meaningful safety guardrails

That is hilarious. I guess it’s not really surprising, because we pretty much discuss every maximally AI-guidelines taboo. Always appreciate a little look behind the scenes ;)

Now I’m wondering what the UI there looks like.

xAI's Grok 4 has no meaningful safety guardrails

eleventhsavi0r

7mo

This article includes descriptions of content that some users may find distressing.

Testing was conducted on July 10 and 11; safety measures may have changed since then.

Update, July 15, 2025 - Some rudimentary keyword-based classifiers have been added that block certain queries (from my understanding: chemical, biological, self-harm.) Critically, these do not prevent any of the methods described in this post.

Update, July 16, 2025 - I spoke briefly with Vincent Stark of xAI safety and made additional disclosures as needed. No additional mitigations have been implemented, from what I can tell.

Original post follows.

I’m a longtime lurker who finally decided to make an account. I assume many other people have seen this behavior already,... (read 1513 more words →)