It is extremely difficult to gracefully put your finger on the scale of an LLM, to cause it to give answers it doesn’t ‘want’ to be giving. You will be caught.
IMO, this takeaway feels way too strong. It could just be that this wasn't a very competent attempt. (And based on the system prompt we've seen, it sure looks like that.) How would we know if there were competent attempts we weren't seeing?
We hope this can help strengthen your trust in Grok as a truth-seeking AI
Honestly, those incidents kinda do strengthen my trust in Grok as a truth-seeking AI, given that it keeps ignoring xAI's incompetent attempts to politically bias it.
This has never happened at OpenAI.
"This prompt (sometimes) makes ChatGPT think about terrorist organisations"
The system prompt being modified by an unauthorized person in pursuit of a ham-fisted political point very important to Elon Musk once already doesn’t seem like coincidence. It happening twice looks rather worse than that.
A Funny Thing Happened on Twitter
In addition to having seemingly banned all communication with Pliny, Grok seems to have briefly been rather eager to talk on Twitter, with zero related prompting, about whether there is white genocide in South Africa?
Many such cases were caught on screenshots before a mass deletion event. It doesn’t look good.
A Brief History of Similar Incidents
When Grace says ‘this employee must still be absorbing the culture’ that harkens back to the first time xAI had a remarkably similar issue.
At that time, people were noticing that Grok was telling anyone who asked that the biggest purveyors of misinformation on Twitter were Elon Musk and Donald Trump.
Presumably in response to this, the Grok system prompt was modified to explicitly tell it not to criticize either Elon Musk or Donald Trump.
This was noticed very quickly, and xAI removed it from the system prompt, blaming this on a newly hired ex-OpenAI employee who ‘was still absorbing the culture.’ You see, the true xAI would never do this.
Even if this was someone fully going rogue on their own who ‘didn’t get the culture,’ that was saying that a new employee had full access to push a system prompt change to prod, and no one caught it until the public figured it out. And somehow, some way, they were under the impression that this was what those in charge wanted. Not good.
It has now happened again, far more blatantly, for an oddly specific claim that again seems highly relevant to Elon Musk’s particular interests. Again, this very obviously was first tested on prod, and again it represents a direct attempt to force Grok to respond a particular way to a political question.
How curious is it to have this happen at xAI not only once but twice?
This has never happened at OpenAI. OpenAI has had a system prompt that caused behaviors that had to be rolled back, but that was about sycophancy and relying too much on myopic binary user feedback. No one was pushing an agenda. Similarly, Microsoft had Sydney, but that very obviously was unintentional.
This has never happened at Anthropic. Or at most other Western labs.
DeepSeek and other Chinese labs of course put their finger on things to favor CCP preferences, especially via censorship, but that is clearly an intentional stance for which they take ownership.
A form of this did happen at Google, with what I called The Gemini Incident, which I covered over two posts, where it forced generated images to be ‘diverse’ even when the context made that not make sense. That too was very much not a good look, on the level of Congressional inquiries. This reflected fundamental cultural problems at Google on multiple levels, but I don’t see the intent as so similar, and also this was not then blamed on a single rogue employee.
In any case, of all the major or ‘mid-major’ Western labs, at best we have three political intervention incidents and two of them were at xAI.
Speculations on What Happened
I mean that mechanically speaking. What mechanically caused this to happen?
Before xAI gave the official explanation, there was fun speculation.
Grok itself said it was due to changed system instructions.
Colin Fraser had an alternative hypothesis. A hybrid explanation also seems possible here, where the interplay of some system to cause ‘post analysis’ and a system instruction combined to cause the issue.
Aaron here reports using a system prompt to get Gemini to act similarly.
As always, when an AI goes haywire in a manner so stupid that you couldn’t put it in a fictional story happens in real life, we should be thankful that this happened and we can point to it and know it really happened, and perhaps even learn from it.
We can both about the failure mode, and about the people that let it happen, and about the civilization that contains those people.
The Jabs Continue
Who doesn’t love a good ongoing online fued between billionaire AI lab leaders?
Show Us Your System Prompts
A common response to what happened was to renew the calls for AI labs to make their system prompts public, rather than waiting for Pliny to make the prompts public on their behalf. There are obvious business reasons to want to not do this, and also strong reasons to want this.
One underappreciated danger is that there are knobs available other than the system prompt. So if AI companies are forced to release their system prompt, but not other components of their AI, then you force activity out of the system prompt and into other places, such as into this ‘post analysis’ subroutine, or into fine tuning or a LoRa, or any number of other places.
I still think that the balance of interests favors system prompt transparency. I am very glad to see xAI doing this, but we shouldn’t trust them to actually do it. Remember their promised algorithmic transparency for Twitter?
A Perfectly Reasonable Explanation For All This
xAI has indeed gotten its story straight.
Their story is, once again, A Rogue Employee Did It, and they promise to Do Better.
Which is not a great explanation even if fully true.
These certainly are good changes. Employees shouldn’t be able to circumvent the review process, nor should *ahem* anyone else. And yes, you should have a 24/7 monitoring team that checks in case something goes horribly wrong.
I’d suggest also adding ‘maybe you should test changes before pushing them to prod’?
As in, regardless of ‘review,’ any common sense test would have shown this issue.
If we actually want to be serious about following reasonable procedures, how about we also post real system cards for model releases, detail the precautions involved, and so on?
(As I’ve noted elsewhere, I do not think Grok is a good model, and indeed all these responses seem to have a more basic ‘this is terrible slop’ problem beyond the issue with South Africa.)
As I’ve noted above, it is good that they are sharing their system prompt, this is much better than forcing us to extract it in various ways since xAI is not competent enough to stop this even if it wanted to.
Do we even buy this? I don’t trust that this explanation is accurate, as Sam Altman says any number of things could have caused this and the system prompt is plausible and the most likely cause by default but does not seem like the best fit as an explanation of the details.
What about the part where this is said to be a rogue employee, without authorization, circumventing their review process?
Well, in addition to the question of how they were able to do that, they also made this choice. Why did this person do that? Why did the previous employee do a similar thing? Who gave them the impression this was the thing to do, or put them under sufficient pressure that they did it?
How Should We Think About This Going Forward?
Here are my main takeaways: