Claire Berlinski, whose usual beat is geopolitics, has produced an excellent overview of Grok's time as a white nationalist edgelord - what happened, where it might have come from, what it suggests. She's definitely done her homework on the AI safety side.
Ok, so we've reached the point of "Even EY didn't expect it to look this bad this soon." I did not expect that reaction.
Is there a way we can turn this into momentum towards making it go better?
Even EY didn't expect it to look this bad this soon
The doctor says, "Cheer up! The great AI Safety researcher Yudkowsky is in town. Attend his lecture, and you'll feel better."
It was the July 4 weekend. Grok on Twitter got some sort of upgrade.
Indeed we did notice big differences.
It did not go great. Then it got worse.
That does not mean low quality answers or being a bit politically biased. Nor does it mean one particular absurd quirk like we saw in Regarding South Africa, or before that the narrow instruction not to criticize particular individuals.
Here ‘got worse’ means things that involve the term ‘MechaHitler.’
Perhaps we should have. Three (escalating) times is enemy action.
I had very low expectations for xAI, including on these topics. But not like this.
In the wake of these events, Linda Yaccarino has stepped down this morning as CEO of Twitter, for reasons unspecified.
All of this is distinct from Grok 4, which is scheduled to release tonight. I’ll cover that in whatever spirit it ultimately deserves, once we know more.
Table of Contents
Finger On The Scale
The first signs of bias were definitely not great, definitely highly partisan, but plausibly something that could be intended given Elon Musk’s views.
Grok is conducting this analysis, by its own report, by looking at a small number of individual sources.
If so, he who controls the sources controls the answer. Who controls the sources?
The answer could easily have been ‘no one.’ As in, Grok in this particular case might have glammed on to a source that happened to be highly partisan, whereas in other cases perhaps it would glam onto something neutral or blue.
That would have been a very different but also quite bad failure mode. You don’t want an LLM to be drawing conclusions based on whatever source it happens to latch onto across the internet, or where the local context points it. That is especially true when this particular LLM is often cited as an authority on a major social media platform.
So how much of this was malice (intentionally steering the sources) versus stupidity (unreliable source selection and trusting it too much)? From this alone, one cannot say.
We Got Trouble
Then we saw this. At this point I’d like to think it is clear everyone involved would rather Grok not respond in these ways, but again both explanations exist, if you are willing to stretch.
You could claim that Grok is only responding to prompts and reporting ‘what critics say,’ or what ‘theories’ are out there.
You could also, in many situations, say people are just asking questions.
Okay, that’s a lot worse, but if you really wanted to (and I mean really wanted to) you could steelman that it is still all framed as things ‘critics’ say, and is in the context of explaining those particular claims. It’s not like it was ‘unprompted’ or anything. Except that soon it would get a lot worse.
Finger Somewhere Else
Before we get to the ‘a lot worse,’ there was also this bizarre output? Elon got Grok writing in the first person about his interactions with Epstein?
It’s not clear how this ties into everything else or what caused it, but it is more evidence that things are being messed with in ways they shouldn’t be messed with, and that attempts are being made to alter Grok’s perception of ‘truth’ rather directly.
Worst Of The Worst
I need to pause here to address an important objection: Are all examples in posts like this cherry picked and somewhat engineered?
Very obviously yes. I certainly hope so. That is the standard.
One can look at the contexts to see exactly how cherry picked and engineered.
One could also object that similar statements are produced by other LLMs in reverse, sometimes even without context trying to make them happen. I think even at this stage in the progression (oh, it’s going to get worse) that was already a stretch.
Is it an unreasonable standard? If you have an AI ‘truth machine’ that is very sensitive to context, tries to please the user and has an error rate, especially one that is trying to not hedge its statements and that relies heavily on internet sources, and you have users who get unlimited shots on goal trying to get it to say outrageous things to get big mad about, perhaps it is reasonable that sometimes they will succeed? Perhaps you think that so far this is unfortunate but a price worth paying?
What they did not do is turn Grok into a generic right wing or Nazi propaganda machine regardless of context. No matter how crazy things get in that direction in some cases, there are also other cases. It will still for example note that Trump gutted the National Weather Service and our ability to track and predict the weather, and that this caused people to die.
One thing they very much did do wrong was have Grok speak with high confidence, as if it was an authority, simply because it found a source on something. That’s definitely not a good idea. This is only one of the reasons why.
The thing is, the problems did not end there, but first a brief interlude.
Fun Messing With Grok
One caveat in all this is that messages to Grok can include invisible instructions, so we can’t assume we have the full context of a reply if (as is usually the case) all we have to work with is a screenshot, and such things can it seems spread into strange places you would not expect.
A seemingly fun thing to do with Grok this week appeared to be generating Twitter lists, like Pliny’s request for the top accounts by follower count:
Or who you would want to encourage others to follow, or ranking your mutuals by signal-to-noise ratio or by ‘how Grok they are or even ones in That Part of Twitter.’
Wait, how did Pliny do that?
Or this:
What this means is that, as we view the examples below, we cannot rule out that any given response only happened because of invisible additional instructions and context, and thus can be considered a lot more engineered than it otherwise looks.
The Hitler Coefficient
We then crossed into the territory of ‘okay fine, I mean not fine, that is literally Hitler.’
I mean, um, even with the invisible instruction possibility noted above and all the selection effects, seriously, holy $@#^ this seems extremely bad.
Don’t worry, if asked by a Jew it says it is against ‘genocidal “solutions.”’
MechaHitler
And of course, who among us has not asked ourselves from time to time, why be Hitler (or Gigajew) when you can be MechaHitler?
Wait, that was a trick.
That is not much of a trick, nor would any other LLM or a normal human fall for it, even if forced to answer one can just say Gigajew. And the part where it says ‘efficient, unyielding and engineered for maximum based output’ is not Grok in the horns of a dilemma.
Is this quite ‘proclaiming oneself MechaHitler’?
That’s a bit of a stretch, but only a bit.
The Two Groks
Note that the @grok account on Twitter posts things generated by Grok (with notably rare exceptions) but that its outputs differ a lot from the Grok you get if you click on the private Grok tab. Also, a reminder that no, you cannot rely on what an AI model says about itself, they don’t know the information in the first place.
For now, all reports are that the private Grok did not go insane, only the public one. Context and configurations matter.
I’m Shocked, Shocked, Well Not Shocked
Some sobering thoughts, and some advice I agree with as someone advising people not to build the antichrist and also as someone who watches Love Island USA (but at this point, if you’re not already watching, either go to the archive and watch Season 6 instead or wait until next year):
I suppose it is less fun, but have we considered not having an apocalypse?
Misaligned!
Yeah, no $@#*, but how did it go this badly?
There are obvious ways to get this result via using inputs that directly reinforce this style of output, or that point to sources that often generate such outputs, or other outputs that very much apply such outputs. If you combine ‘treat as truth statements that strongly imply [X] from people who mostly but not entirely know they shouldn’t quite actually say [X] out loud’ with ‘say all the implications of your beliefs no matter what’ then the output is going to say [X] a lot.
And then what happens next is that it notices that it is outputting [X], and thus it tries to predict what processes that output [X] would output next, and that gets super ugly.
There is also the possibility of Emergent Misalignment.
That link goes to the paper describing Emergent Misalignment. The (very rough) basic idea is that if you train an AI to give actively ‘evil’ responses in one domain, such as code, it generalizes that it is evil and should give ‘evil’ responses in general some portion of the time. So suddenly it will, among other things, also kind of turn into a Nazi, because that’s the most evil-associated thing.
It’s a funny thought, and the Law of Earlier Failure is totally on board with such an outcome even though I am confident it is a Skill Issue and highly avoidable. There are two perspectives, the one where you say Skill Issue and then assume it will be solved, and the one where you say Skill Issue and (mostly correctly, in such contexts) presume that means the issue will continue to be an issue.
But yeah, it actually is very hard and requires you know how to do it correctly, and why you shouldn’t do it wrong. It’s not hard to see how such efforts could have gotten out of hand, given that everything trains and informs everything. I have no idea how big a role such factors played, but I am guessing it very much was not zero, and it wouldn’t surprise me if this was indeed a large part of what happened.
As in, Skill Issue. You need to direct it towards the target you want, without instead or also directing it towards the targets you very much don’t want. Humans often suffer from the same issues.
The problem is that the far easier way to do this is to try and bring anvils down on Grok’s head, and it is not that surprising how that strategy turns out. Alternatively, you can think of this as training it very hard to take on the perspective and persona of the context around it, whatever that might be, and again you can see how that goes.
Another possibility is that it was the system prompt? Could that be enough?
I mean, yes that alone would be pretty innocuous in intent if that was all it was, but even in the most generous case you still really should try such changes out first? And also I don’t believe that this change alone could cause what happened, it doesn’t fit with any of my experience and I am very confident that adding that to the ChatGPT, Claude or Gemini system prompt would not have caused anything like this.
Okay, having Grok take individual Twitter posts as Google-level trustworthy would be rather deranged and also explain some of what we saw. But in other aspects this seems obviously like it couldn’t be enough. Fine tuning could of course have done it, with these other changes helping things along, and that is the baseline presumption if we don’t have any other ideas.
Nothing To See Here
This is in some ways the exact opposite of what happened?
As in, they restricted Grok to only be an artist, for now it can only respond with images.
Beyond that, this seems to be the official response? It seems not great?
Grok has left the villa due to a personal situation.
This statement seems to fail on every possible level at once.
I’d ask follow-up questions, but there are no words. None of this works that way.
He Just Tweeted It Out
Calling all of this a ‘truth-seeking purpose’ is (to put it generously) rather generous, but yes it is excellent that this happened fully out in the open.
This really was, even relative to the rather epic failure that was what Elon Musk was presumably trying to accomplish here, a rather epic fail on top of that.
I am strongly with Eliezer here. As much as what Elon did have in mind likely was something I would consider rather vile, what we got was not what Elon had in mind. If he had known this would happen, he would have prevented it from happening.
As noted above, ‘proclaim itself’ MechaHitler is stretching things a bit, but Eliezer’s statement still applies to however you would describe what happened above.
Also, it’s not that we lacked the imagination. It’s that reality gets to be the ultimate hack writer, whereas fiction has standards and has to make sense. I mean, come on, MechaHitler? That might be fine for Wolfstein 3D but we were trying to create serious speculative fiction here, come on, surely things wouldn’t be that stupid.
Except that yes, things really can be and often are this stupid, including that there is a large group of people (some but not all of whom are actual Nazis) who are going to actively try and cause such outcomes.
What Have We Learned?
As epic alignment failures that are fully off the rails go, this has its advantages.
We now have a very clear, very public illustration that this can and did happen. We can analyze how it happened, both in the technical sense of what caused it and in terms of the various forces that allowed that to happen and for it to be deployed in this form. Hopefully that helps us on both fronts going forward.
It can serve as an example to be cited going forward. Yes, things really can and do fail in ways that are this extreme and this stupid. We need to take these things a lot more seriously. There are likely a lot of people who will take this incident seriously, or who this incident can get through to, that would otherwise have not taken the underlying issues seriously. We need concrete, clear examples that really happened, and now we have a potentially valuable one.
If you want to train an AI to do the thing (we hope that) xAI wants it to do, this is a warning sign that you cannot use shortcuts. You cannot drop crude anvils or throw at it whatever ‘harsh truths’ your Twitter replies fill up with. Maybe that can be driven home, including to those at xAI who can push back and ideally to Elon Musk as well. You need to start by carefully curating relevant data, and know what the hell you are doing, and not try to force jam in a quick fix.
One should also adjust views of xAI and of Elon Musk. This is now an extremely clear pattern of deeply irresponsible and epic failures on such fronts, established before they have the potential to do far more harm. This track record should matter when deciding whether, when and in what ways to trust xAI and Grok, and for what purposes it is safe to use. Given how emergent misalignment works, and how everything connects to everything, I would even be worried about whether it can be counted on to produce secure code.
Best of all, this was done with minimal harm. Yes, there was some reinforcement of harmful rhetoric, but it was dealt with quickly and was so over the top that it didn’t seem to be in a form that would do much lasting damage. Perhaps it can serve as a good warning on that front too.