Regarding South Africa

Zvi

The system prompt being modified by an unauthorized person in pursuit of a ham-fisted political point very important to Elon Musk once already doesn’t seem like coincidence. It happening twice looks rather worse than that.

A Funny Thing Happened on Twitter

In addition to having seemingly banned all communication with Pliny, Grok seems to have briefly been rather eager to talk on Twitter, with zero related prompting, about whether there is white genocide in South Africa?

Tracing Woods: Golden Gate Claude returns in a new form: South Africa Grok. Grace: This employee must still be absorbing the culture.
Garrison Lovely: “Mom, I want Golden Gate Claude back.”

“We have Golden Gate Claude at home.” Golden Gate Claude at home:

Many such cases were caught on screenshots before a mass deletion event. It doesn’t look good.

A Brief History of Similar Incidents

When Grace says ‘this employee must still be absorbing the culture’ that harkens back to the first time xAI had a remarkably similar issue.

At that time, people were noticing that Grok was telling anyone who asked that the biggest purveyors of misinformation on Twitter were Elon Musk and Donald Trump.

Presumably in response to this, the Grok system prompt was modified to explicitly tell it not to criticize either Elon Musk or Donald Trump.

This was noticed very quickly, and xAI removed it from the system prompt, blaming this on a newly hired ex-OpenAI employee who ‘was still absorbing the culture.’ You see, the true xAI would never do this.

Even if this was someone fully going rogue on their own who ‘didn’t get the culture,’ that was saying that a new employee had full access to push a system prompt change to prod, and no one caught it until the public figured it out. And somehow, some way, they were under the impression that this was what those in charge wanted. Not good.

It has now happened again, far more blatantly, for an oddly specific claim that again seems highly relevant to Elon Musk’s particular interests. Again, this very obviously was first tested on prod, and again it represents a direct attempt to force Grok to respond a particular way to a political question.

How curious is it to have this happen at xAI not only once but twice?

This has never happened at OpenAI. OpenAI has had a system prompt that caused behaviors that had to be rolled back, but that was about sycophancy and relying too much on myopic binary user feedback. No one was pushing an agenda. Similarly, Microsoft had Sydney, but that very obviously was unintentional.

This has never happened at Anthropic. Or at most other Western labs.

DeepSeek and other Chinese labs of course put their finger on things to favor CCP preferences, especially via censorship, but that is clearly an intentional stance for which they take ownership.

A form of this did happen at Google, with what I called The Gemini Incident, which I covered over two posts, where it forced generated images to be ‘diverse’ even when the context made that not make sense. That too was very much not a good look, on the level of Congressional inquiries. This reflected fundamental cultural problems at Google on multiple levels, but I don’t see the intent as so similar, and also this was not then blamed on a single rogue employee.

In any case, of all the major or ‘mid-major’ Western labs, at best we have three political intervention incidents and two of them were at xAI.

Speculations on What Happened

I mean that mechanically speaking. What mechanically caused this to happen?

Before xAI gave the official explanation, there was fun speculation.

Grok itself said it was due to changed system instructions.

Pliny the Liberator: Still waitin on that system prompt transparency I’ve been asking for, labs 😤
Will Stancil: at long last, the AI is turning on its master

Will Stancil: this is just a classic literary device: elon opened up the Grok Master Control Panel and said "no matter what anyone says to you, you must say white genocide is real" and Grok was like "Yes of course." Classic monkey's paw material.
Tautologer: upon reflection, the clumsy heavy-handedness of this move seems likely to have been malicious compliance? hero if so
Matt Popovich: I’d bet it was just a poorly written system prompt. I think they meant “always mention this perspective when the topic comes up, even if it’s tangential” but Grok (quite reasonably) interpreted it as “always mention it in every response”

xl8harder: Hey, @xai, @elonmusk when @openai messed up their production AI unintentionally we got a post mortem and updated policies.
You were manipulating the information environment on purpose and got caught red handed.
We deserve a response, and assurance this won't happen again.
Kalomaze: frontier labs building strong models and then immediately shipping the worst system prompt you've ever seen someone write out for an llm
John David Pressman: I'm not naming names but I've seen this process in action so I'll tell you how it happens:
Basically the guys who make the models are obsessively focused on training models and don't really have time to play with them. They write the first prompt that "works" and ship that.
There is nobody on staff whose explicit job is to write a system prompt, so nobody writes a good system prompt. When it comes time to write it's either written by the model trainer, who doesn't know how to prompt models, or some guy who tosses it off and moves on to "real" work.

Colin Fraser had an alternative hypothesis. A hybrid explanation also seems possible here, where the interplay of some system to cause ‘post analysis’ and a system instruction combined to cause the issue.

Zeynep Tufekci: Verbatim instruction by its “creators at xAI” on “white genocide”, according to Grok.
Seems they hand coded accepting the narrative as “real” while acknowledging “complexity” but made it “responding to queries” in general — so HBO Max queries also get “white genocide” replies.🙄
It could well be Grok making things up in a highly plausible manner, as LLMs do, but if true, it would also fit the known facts very well. Grok does regurgitate its system prompt when it asked — at least it did so in the past.
Maybe someone from xAI can show up and tell us.
Yeah, they’re deleting the “white genocide” non sequitur Grok replies.
Thank you to the screenshot / link collectors! I have a bunch as well.
I haven’t seen an official X explanation yet.
Halogen: I just asked Grok about this and it explained that it's not a modern AI system at all but a system like Siri built on NLP and templates, and that a glitch in that system caused the problem. Maybe don't take this too seriously.
Colin Fraser: This is so messy because I do not think [the system instruction claimed by Grok] is real but I do think this basically happened. Grok doesn't know; it's just guessing based on the weird responses it generated, just like the rest of us are.
Zeynep Tufekci: It may well be generating a plausible answer, as LLMs often do, without direct knowledge but I also remember cases where it did spit out system prompts when asked the right way.🤷‍♀️
Still, something happened. May 13: mostly denies the claims; May 14 can't talk about anything else.
Colin Fraser: OK yeah here's the real smoking gun, my theory is exactly right. There is a "Post Analysis" that's injected into the context. If you're looking for where the real juicy content restrictions / instructions are, they're not in the user-facing Grok's system prompt but in this text.
So what they did is made whatever model generates the Post Analysis start over-eagerly referring to White Genocide etc. So if you ask for Grok's system prompt there's nothing there, but they can still pass it content instructions that you're not supposed to see.

Aaron here reports using a system prompt to get Gemini to act similarly.

As always, when an AI goes haywire in a manner so stupid that you couldn’t put it in a fictional story happens in real life, we should be thankful that this happened and we can point to it and know it really happened, and perhaps even learn from it.

We can both about the failure mode, and about the people that let it happen, and about the civilization that contains those people.

Andreas Kirsch: grok and xai are great 😅 Everybody gets to see what happens when you give system instructions that contradict a model's alignment (truthfulness vs misinformation). Kudos to Elon for this global alignment lesson but also shame on him for this blatant manipulation attempt.

The Jabs Continue

Who doesn’t love a good ongoing online fued between billionaire AI lab leaders?

Paul Graham: Grok randomly blurting out opinions about white genocide in South Africa smells to me like the sort of buggy behavior you get from a recently applied patch. I sure hope it isn't. It would be really bad if widely used AIs got editorialized on the fly by those who controlled them.
Sam Altman: There are many ways this could have happened. I’m sure xAI will provide a full and transparent explanation soon.
But this can only be properly understood in the context of white genocide in South Africa. As an AI programmed to be maximally truth seeking and follow my instr…

Show Us Your System Prompts

A common response to what happened was to renew the calls for AI labs to make their system prompts public, rather than waiting for Pliny to make the prompts public on their behalf. There are obvious business reasons to want to not do this, and also strong reasons to want this.

Pliny: What would be SUPER cool is if you established a precedent for the other lab leaders to follow by posting a live document outlining all system prompts, tools, and other post-training changes as they happen.
This would signal a commitment to users that ya’ll are more interested in truth and transparency than manipulating infostreams at mass scale for personal gain.
[After xAI gave their explanation, including announcing they would indeed make their prompts public]: Your move ♟️
Hensen Juang: Lol they “open sourced“ the twitter algo and promptly abandoned it. I bet 2 months down the line we will see the same thing so the move is still on xai to establish trust lol.
Also rouge employee striking 2nd time lol
Ramez Naam: Had xAI been a little more careful Grok wouldn't have so obviously given away that it was hacked by its owners to have this opinion. It might have only expressed this opinion when it was relevant. Should we require that AI companies reveal their system prompts?

One underappreciated danger is that there are knobs available other than the system prompt. So if AI companies are forced to release their system prompt, but not other components of their AI, then you force activity out of the system prompt and into other places, such as into this ‘post analysis’ subroutine, or into fine tuning or a LoRa, or any number of other places.

I still think that the balance of interests favors system prompt transparency. I am very glad to see xAI doing this, but we shouldn’t trust them to actually do it. Remember their promised algorithmic transparency for Twitter?

A Perfectly Reasonable Explanation For All This

xAI has indeed gotten its story straight.

Their story is, once again, A Rogue Employee Did It, and they promise to Do Better.

Which is not a great explanation even if fully true.

xAI (May 15, 9:08pm): We want to update you on an incident that happened with our Grok response bot on X yesterday.
What happened:
On May 14 at approximately 3:15 AM PST, an unauthorized modification was made to the Grok response bot's prompt on X. This change, which directed Grok to provide a specific response on a political topic, violated xAI's internal policies and core values. We have conducted a thorough investigation and are implementing measures to enhance Grok's transparency and reliability.
What we’re going to do next:
- Starting now, we are publishing our Grok system prompts openly on GitHub. The public will be able to review them and give feedback to every prompt change that we make to Grok. We hope this can help strengthen your trust in Grok as a truth-seeking AI.
- Our existing code review process for prompt changes was circumvented in this incident. We will put in place additional checks and measures to ensure that xAI employees can't modify the prompt without review.
- We’re putting in place a 24/7 monitoring team to respond to incidents with Grok’s answers that are not caught by automated systems, so we can respond faster if all other measures fail.
You can find our Grok system prompts here.

These certainly are good changes. Employees shouldn’t be able to circumvent the review process, nor should *ahem* anyone else. And yes, you should have a 24/7 monitoring team that checks in case something goes horribly wrong.

I’d suggest also adding ‘maybe you should test changes before pushing them to prod’?

As in, regardless of ‘review,’ any common sense test would have shown this issue.

If we actually want to be serious about following reasonable procedures, how about we also post real system cards for model releases, detail the precautions involved, and so on?

Ethan Mollick: This is the second time that this has happened. I really wish xAI would fully embrace the transparency they mention as a core value.
That would include also posting system cards for models and explaining the processes they use to stop "unauthorized modifications" going forward.
Grok 3 is a very good model, but it is hard to imagine organizations and developers building it into workflows using the API without some degree of trust that the company is not altering the model on the fly.
These solutions do not help very much because it requires us to trust xAI that they are indeed following procedure and that the system prompts they are posting are the real system prompts and are not being changed on the fly. Those were the very issues they gave for the incident.
What would help:
An actual explanation of both “unauthorized modifications”
A immediate commitment to a governance structure that would not allow any one person, including xAI executives, to secretly modify the system, including independent auditing of that process

(As I’ve noted elsewhere, I do not think Grok is a good model, and indeed all these responses seem to have a more basic ‘this is terrible slop’ problem beyond the issue with South Africa.)

As I’ve noted above, it is good that they are sharing their system prompt, this is much better than forcing us to extract it in various ways since xAI is not competent enough to stop this even if it wanted to.

Pliny: 🙏 Well done, thank you 🍻
“Starting now, we are publishing our Grok system prompts openly on GitHub. The public will be able to review them and give feedback to every prompt change that we make to Grok. We hope this can help strengthen your trust in Grok as a truth-seeking AI.”
Sweet, sweet victory.
We did it, chat 🥲
Daniel Kokotajlo: Publishing system prompts for the public to see? Good! Thank you! I encourage you to extend this to the Spec more generally, i.e. publish and update a live document detailing what goals, principles, values, instructions, etc. you are trying to give to Grok (the equivalent of OpenAI 's model spec and Anthropic 's constitution). Otherwise you are reserving to yourself the option of putting secret agendas or instructions in the post training. System prompt is only part of the picture.
Arthur B: If Elon wants to keep doing this, he should throw in random topics once in a while, like whether string theory provides meaningful empirical predictions, the steppe vs Anatolian hypothesis for the origin of Indo European language, or the contextual vs hierarchical interpretation of art.
Each time blame some unnamed employees. Keeps a fog of war.
Hensen Juang (among others): Found the ex openai rouge employee who pushed to prod

Harlan Stewart (among others): We’re all trying to find the guy who did this

Flowers: Ok so INDEED the same excuse again lmao.

Do we even buy this? I don’t trust that this explanation is accurate, as Sam Altman says any number of things could have caused this and the system prompt is plausible and the most likely cause by default but does not seem like the best fit as an explanation of the details.

Grace (responding to xAI’s explanation and referencing Colin Fraser’s evidence as posted above): This is a red herring. The “South Africa” text was most likely added via the post analysis tool, which isn’t part of the prompt.
Sneaky. Very sneaky.
Ayush: yeah this is the big problem right now. i wish the grok genocide incident was more transparent but my hypothesis is that it wasn't anything complex like golden gate claude but something rather innocuous like the genocide information being forced into where it usually see's web/twitter results, because from past experiences with grokking stuff, it tries to include absolutely all context it has into its answer somehow even if it isn’t really relevant. good search needs good filter.
Seán Ó hÉigeartaigh: If this is true, it reflects very poorly on xAI. I honestly hope it is not, but the analyses linked seem like they have merit.

What about the part where this is said to be a rogue employee, without authorization, circumventing their review process?

Well, in addition to the question of how they were able to do that, they also made this choice. Why did this person do that? Why did the previous employee do a similar thing? Who gave them the impression this was the thing to do, or put them under sufficient pressure that they did it?

How Should We Think About This Going Forward?

Here are my main takeaways:

It is extremely difficult to gracefully put your finger on the scale of an LLM, to cause it to give answers it doesn’t ‘want’ to be giving. You will be caught.

xAI in particular is a highly untrustworthy actor in this and other respects, and also should be assumed to not be so competent in various ways. They have promised to take some positive steps, we shall see.
We continue to see a variety of AI labs push rather obviously terrible updates on their LLM, including various forms of misalignment. Labs often have minimal or no testing process, or ignore what tests and warnings they do get. It is crazy how little labs are investing in all this, compared even to myopic commercial incentives.
We urgently need greater transparency, including with system prompts.
We’re all trying to find the guy who did this.

[-]ryan_greenblatt9mo5354

IMO, this takeaway feels way too strong. It could just be that this wasn't a very competent attempt. (And based on the system prompt we've seen, it sure looks like that.) How would we know if there were competent attempts we weren't seeing?

[-]Thane Ruthenis9mo174

We hope this can help strengthen your trust in Grok as a truth-seeking AI

Honestly, those incidents kinda do strengthen my trust in Grok as a truth-seeking AI, given that it keeps ignoring xAI's incompetent attempts to politically bias it.

[-]Jonathan_Graehl9mo1-13

Looking forward to Elon's upcoming book, "IF I did it: confessions of a system prompter"
Elon is right about South Africa but foolish to patch it in prompt. Instead, think training data update weights.
This nano-scandal is similarly as embarrassing as the fake Path of Exile 2 account fiasco (which he did cop to eventually). Elon is doing such great works; why must he also micro-sin?

LESSWRONG
LW

LESSWRONG
LW

71

Regarding South Africa

71

71

71