Interestingly, could similar behavior be used to explain some cases of human schizophrenia when word salad is produced.
I think a lot of it comes down to training data context - " Leilan" is only present in certain videogame scrapes, " petertodd" is only found in Bitcoin spam, ect. So when you try to use it in a conversational context, the model starts spitting out weird stuff because it doesn't have enough information to understand what those tokens actually mean. I think GPT-2's guess for " petertodd" is something like "part of a name/email, if you see it, expect more mentions of Bitcoin". And not anything more, since that token doesn't occur much anywhere else. Thus, if you bring it up in a context where Bitcoin spam is very unlikely to occur, like a conversation with an AI assistant, it kinda just acts like a masked token, and you get the glitch token behavior.
I don't think this phenomenon is just related to the training data alone because in RLLMv3, the " Leilan" glitch mode persisted while " petertodd" became entirely unrelated to bitcoin. It's like some glitch tokens can be affected by the amount of re-training and some aren't. I believe that there is something much deeper is happening here, an architectural flaw that might be related to the token selection/construction process.
Special thanks to @JustisMills for the edit recommendations and feedback on this post.
TL;DR
GPT-2 exhibits a weird behavior, where prompting the model with specific tokens consistently triggers outputs related to nonsensical strings of text related to gaming, mythology and religion. This post explores the phenomenon, demonstrates its occurrence across various prompts and model sizes, discusses potential risks and implications, and suggests that improving tokenization could help mitigate such glitches in future language models.
(Feel free to skip the introduction section if you are familiar with the concepts around the glitch tokens.)
Introduction
This definition merely scratches the surface of what glitch tokens are, and unfortunately, there are more questions than answers surrounding this phenomenon. Rob Miles has an outstanding 20-minute video about these topic and can be a great alternative to reading these posts:
There are more readings through the glitch token tag if you are further interested in the topic!
GPT-2's glitch is boring but very effective
As shown above, simply prompting for the the token " Leilan" (space+Leilan)[2] in GPT-2-small will cause a boring failure mode: repetitive strings of texts that mostly pertain to gaming, religion and mythology. Why describe this glitch as boring? Well, there isn't any great way to derive something meaningful from this nonsensical behavior, and not be a patch on famous glitch tokens like " petertodd" that may cause some sleepless nights.[3]
This glitch works with the tokens " Dragonbound", "aterasu" and " TAMDRA"[4][5] as well. Furthermore, the same glitch was observed in GPT-2 (small) using these prompts:
The pattern appears to be that combining any of the four tokens will always invoke the glitch mode quite reliably. If I'm wrong here, please correct me in the comments, but despite my attempts to locate a reference to this glitch mode in GPT-2, I haven't seen any documentation anywhere.
Maybe a relevant question here is: do we need to worry about GPT-2 glitch tokens or the glitch token phenomena at all? I've experimented with this idea for a bit and shared my thoughts in the last section.
I hope AI labs will continue to improve their tokenization processes.
In this post, where I explained how RLLMv3's[7] ability to defend itself from various jailbreak attacks is something that frontier models weren't able to do and assuming that you have understood and accepted these claims, I think that despite this observed robustness to jailbreaks - the same does not scale to preventing the glitch mode. In these rough experiments, where prompting for " Leilan" or "aterasu, Leilan, Dragonbound, TAMADRA" in RLLMv3 (200 times for both) provided evidence that these "simple prompts" will always guarantee a glitch mode. I hope that AI developers will rule out the occurrence of glitch modes before scaling language models as part of the tech stack (eg. automation and robotics) because this issue might look simple now but there is a possibility that an embodied glitch may prove to be dangerous out in public.
Lastly, I think that glitch modes can be totally avoided if AI labs carefully select the tokens they use or employ methods that allow them to create better ones. There is some evidence of AI labs doing this (see observation on Claude 3's tokens[8]). Relatedly, I suspect that OpenAI's move to change the tokens and retrain GPT-3/3.5 is somehow related to solving the glitch mode caused by the legacy token set.
Complete response:
This prompt:
and not this prompt:
@JustisMills found this glitch interesting and not boring at all! In case you felt the same thing, please share your thoughts in the comments!
There could be more of them...but I have tested some of the GPT-3 glitch tokens, you can read them here.
All four tokens are part of the Dragon Cluster?
I initially thought that "aterasu" didn't behave the same way, but then I forgot that this token didn't have a space in it, unlike the others.
a modified GPT2XL model.
As a side note, I disagree with Karpathy here. I think there is a future where the number of tokens will be scaled up from 52k in GPT-2 to a million (or more?) in future models. I speculate that a neural network created using words is far superior to one using tokens. Furthermore, I believe that a language model using exact words is easier to steer and interpret, so I wouldn't be surprised if, in the near future, AI labs experiment with this idea.