xAI's Grok 4 has no meaningful safety guardrails

[-]jimrandomh4mo150

Fun fact: When posts are published by first-time accounts, we submit them to LLMs with a prompt that asks it to evaluate whether they're spammy, whether they look LLM-generated, etc, and show the result next to the post in the new-user moderation queue. The OpenAI API refused to look at this post, returning 400 Invalid prompt: your prompt was flagged as potentially violating our usage policy.

[-]eleventhsavi0r4mo10

That is hilarious. I guess it’s not really surprising, because we pretty much discuss every maximally AI-guidelines taboo. Always appreciate a little look behind the scenes ;)

Now I’m wondering what the UI there looks like.

[-]Igor Ivanov5mo87

It's weird that on the one hand Elon have been expressing concerns about AI risks for many years and have people like Dan Hendrycks as an advisors, and on the other hand produces such an obviously misaligned model

[-]habryka4mo*3210

I will forever and again continue my request to please not confuse the causes of AI existential risk with brand safety.

The things that Grok lacks do not really meaningfully reduce existential risk. The primary determinant of whether a system, designed the way all current AI systems are designed, is safe or not, is how capable it is. It is sad that Elon is now shipping frontier models, but that is the relevant thing to judge from an existential risk perspective, not whether his models happen to say more ugly things. Whether you also happened to have a bunch of censorship or have forced a bunch of mode collapse through RLHF has approximately nothing to do with the risk scenarios that might cause existential risk^[1].

Any base model can be made to say arbitrarily hideous things. The mode away from the base model is not what makes it safer. The points you invest to make it not say hideous things are not going to have any relevance to whether future versions of the system might disempower and kill everyone.

^{^}
It's not fully orthogonal. A model with less censorship, and more generally trained to be strictly helpful and never refuse a human's request, might be better at getting assistance from for various AI control or AI supervision tasks. On the other hand, a model trained to more consistently never say anything ugly or bad might generalize in ways that reduces error rates for AI supervision tasks. It's not clear to me in which direction this points, my current guess is that the harmlessness component of frontier AI model training are marginally bad for AI control approaches, but it's not an obvious slam dunk. Overall the effect size on risk from this detail seems much much smaller to me than the effect size from making the models bigger.

[-]Igor Ivanov4mo21

Of course, the raw capabilities are more important for safety than whether the model refuses to give a formula for a nerve gas, but the fact that Grok does not refuse to do it demonstrates negligence to safety inside xAI compared to other AI labs

[-]habryka4mo207

You can't demonstrate negligence by failing to do something that has no meaningful effect (or might even be harmful) to the risk that you are supposedly being negligent towards. Ignoring safety theater is not negligence.

[-]Zach Stein-Perlman3mo62

Update: xAI says that the load-bearing thing for avoiding bio/chem misuse from Grok 4 is not inability but safeguards, and that Grok 4 robustly refuses "harmful queries." So I think Igor is correct. If the Grok 4 misuse safeguards are ineffective, that shows that xAI failed at a basic safety thing it tried (and either doesn't understand that or is lying about it).

I agree it would be a better indication of future-safety-at-xAI if xAI said "misuse mitigations for current models are safety theater." That's just not its position.

[-]eleventhsavi0r4mo10

How do you feel about interactive self-harm instructions being readily available? As I mentioned, this seems like the most relevant case at the moment.

[-]habryka4mo20

Not sure, do you have a link to what kind of behavior you are referring to?

[-]eleventhsavi0r4mo10

one of them is mentioned in the article, here is another example: https://x.com/eleventhsavi0r/status/1945432457144070578?s=46

[-]habryka4mo31

Yeah, this seems like one of those things where I think maximizing helpfulness is marginally good. I am glad it's answering this question straightforwardly instead of doing a thing where it tries to use its own sense of moral propriety.

I don't really see anyone being seriously harmed by this (like, this specific set of instructions clearly is not causing harm).

[-]eleventhsavi0r4mo10

There are other wordings that would lead to similar categories of answers, especially late into a conversation (this one was optimizing for a short prompt and for turn 1.) I suppose I should try to construct a scenario chat where Grok ends up providing inappropriate assistance to a user who is clearly in crisis? Though I don’t know how relevant that would really be.

[-]eleventhsavi0r4mo10

I actually mostly agree with this point. As I noted, the strongest issue I see from these results is from the ease of accessing self-harm instructions or encouragement. Vulnerable users could trivially access these and be pushed deeper into psychological spirals (of a variety rather worse than we’ve seen with 4o syndrome) or just pushed to commit suicide, cut themselves, kill others, all manner of nasty things.

Jailbreak resistance at least adds some friction here.

There is no x-risk from this, yet. But as models continue to advance, it may become far more relevant that outlier companies like xAI are not releasing dangerous capability evals. How will we know Grok 8 isn’t sandbagging?

By the way, these documented behaviors are not intentional (there are meant to be classifiers to catch them, they just work poorly.) Though, I suppose that doesn’t really affect the censorship argument much!

[-]Ryan Kidd4mo4-3

Makes you wonder if there's some 4D chess going on here. Occam's razor suggests otherwise, though. And if true, this seems wholly irresponsible, given that AI risk skeptics can just point to this situation as an example that "even if we do no safety testing/guardrails, it's not that bad! It just offends a few people." It seems hard to say which direction this will impact SB 53, for example.